As you add applications and infrastructure to your enterprise setup, the amount of log data keeps on growing enormously. Hence, logs come into play, in-order to debug and keep your services active and running efficiently. If you are a large enterprise, you may also need to ensure that logging is not only taking care of only compliance and security needs but also is helping your business run effectively.
In this article I am sharing my experience of building a Log Analytics platform for a large multinational enterprise providing banking & financial solutions & how Automation helped in building a scalable logging layer.
The customer wanted to set up a centralized Infrastructure/log monitoring platform with real-time alerting and reporting. They possessed 2000+ data sources which were being monitored by 8 ELK decentralized nodes.
Our solution started with setting up an ELK Stack for Infrastructure and Application monitoring with 3 Nodes Cluster for all their data sources, with 5 beats on each data source. This solution also included setting up a real-time alerting system for log pipeline and data sources.
Operational Challenges:
- Installation and configuration of 5 beats on 2000+ servers
- Sizing and capacity planning
- Data on-boarding, index creation and management
- Implement role-based access control for large teams
- Alerting and reporting implementation – Bank wise, Branch wise, Application wise
- Visualization and dashboarding of around 150+ dashboards and 2000+ visualizations based on different scenarios
Project Goal:
This project was initially started for infrastructure & application (NGINX) monitoring up to 20 nodes with the help Elastic cluster (of 3 nodes and 2 nodes for Logstash and Kibana). This includes Installation and configuration of ELK cluster, Beats (Filebeat, Metricbeat, Heartbeat, Packetbeat, Auditbeat, Winlogbeat), HA for Logstash and Kibana, alerting, reporting and dashboarding.
Architecture Overview:
This customer had around 10+ servers where NGINX and NGINX Plus were installed, and they wanted to monitor that using ELK components like Filebeat, Metricbeat and Heartbeat. So, the design looks as follows:
Physical Architecture:
Infrastructure of the customer contains around 2000+ nodes and they wanted to monitor all of them using the ELK stack. These nodes are of different flavours of OS like Windows (2008/2012), Linux (RHEL6/RHEL7).
To build a monitoring platform for this kind of infrastructure, we proposed the following:
How We Solved Challenges:
Installation and configuration of 5 beats on 2000+ servers:
To perform installation and configuration of 5 beats on 2000+ servers manually is not a practical drill and it is a repeatable task with some minor changes, so we automated the same. We used Ansible to automate Beats installation and configuration. We used passive node from pair of Logstash and Kibana as Ansible server. To setup the Ansible server on node, where no internet access is available itself was one of the challenging tasks. But that we managed to take care of. We wrote Ansible roles for different kinds of OS like RHEL 6 and 7, Windows 2008/2012 for 5 beats.
ANSIBLE TO DATA SOURCE’S FLOW:
Sizing and capacity planning:
Initially, we did cluster sizing for the customer:
64 GB RAM and 1 TB Disk for each node of Elastic cluster (3 nodes).
16GB RAM and 500 GB Disk for each Logstash and Kibana node (2 nodes).
This was done based on the event calculation per second with the consideration of up to 50 nodes and 3 beats (Filebeat, Metricbeat and Heartbeat), but once the number of data sources increased, we revised the cluster sizing with the consideration of 2200 events per second from 2000+ data sources.
Data on-boarding, index creation and management:
Ingesting data from large number of nodes with the consideration of data going into the correct index and how the data mapping should look like, was the one of the key challenging tasks. At the outset, we had done data on-boarding for 20 servers which are both Windows and Linux servers of different platforms and includes NGINX and NGINX Plus application logs. While creating indexes, we considered rolling out of data and in terms of snapshot how it will become easy. So, we had specified index names in Beats itself excluding Filebeat because we used Filebeat to send NGINX access and error logs to Elasticsearch via Logstash which needs to filter out the data based on the log pattern. We also configured replicas and shard in Beats configuration dynamically, based on the user input to the Ansible script.
Apart from NGINX, there are some other applications in which logs need to be included in the same cluster, so we defined the generic patterns. For the logs which are going through Logstash, we created indexes for such applications of the respective bank servers, dynamically in Logstash only.
Implement role-based access control for large teams:
This customer has a large team of people to monitor the huge infrastructure and they are going to access the same Kibana (UI) server but the challenge is about the restriction on reading the data. So, we implemented role-based access for their users. We have automated the process of user and role creation and configuration using APIs which Elastic provided –making it easier and reduced the overall time.
Alerting and reporting implementation – Bank wise, Branch wise, Application wise:
This was again a challenging task for them – alerting and reporting becomes difficult when there’s no RBAC in place. When we implemented RBAC – based on users and roles, we decided to send alerts to those users. While creating indexes, we already did the logical and physical separation of the data, which makes it easy to create and configure alerts. Again, for the same, we used the rest of APIs which Elastic provided.
Visualization and dashboarding of around 150+ dashboards and 2000+ visualizations based on different scenarios:
Creating a large number of visualization and dashboards manually for different banks and different branches of those banks are kind of hard and repeatable tasks. So, what we did is we created the visualization and dashboards for one bank and replicated for the rest.
Solution Outcome:
- Implementing Automation scripts made scaling up the data sources without vendor lock-in.
- With the implementation of ELK stack, monitoring Logs and System Metrics became possible.
- The alerting system enabled the customer to receive real-time alerts to manage log pipeline and data sources.