Centralized logging platform and its migration into Google Cloud
Introduction to the centralized logging platform (Cello)
Cello is a one place destination for all logs generated by various micro services (including e-commerce and stores) managed by Fast Retailing. Cello currently serves more than 400 enterprise users of 6 brands across 20+ regions. Cello provides a user interface to query, search, visualize and build dashboards on top of millions of raw logs.
While Cello is built on Amazon Web Services mainly using Amazon Managed Elastic search (also called open search) and Airflow (open source data orchestration tool).
Why logging platform
From a tech perspective, Cello provides a platform to storing, analyzing, searching and visualizing the data stored in millions of log files everyday generated by our micro services.
The high-level view of the logging platform looks like this:
Current architecture
Cello was built on a self-managed Elastic search cluster built on top of AWS EC2 Instances and Elastic License until 2020.
In September 2020, after our successful POC, we migrated Cello from the self-managed Elastic cluster to the Amazon-managed Elastic search (AES)
Why you should chose Amazon Elastic search
Pros
As you can see from the image, the logs collected from different micro services are placed initially in the PET S3 buckets. Afterwards the fluentd log shipper copies the files from PET S3 to the DET S3 landing buckets. (PET S3 buckets are kind of transactional and while DET S3 buckets are Analytical purposes.)
Each region has its own dedicated S3 bucket for storing logs. Since the current Cello instances run in the ap-northeast-1 (Tokyo) region, the logs from the landing buckets in different regions are eventually copied to the Tokyo main bucket by NIFI. We then use Airflow to feed logs from S3 to Elastic search in micro batches (every 15~20 minutes). That's why you will see logs with the delay of 15 minutes!
{
'src_name': 'ec_admin_secure_log_uq_jp',
'schedule': '7,27,47 * * * *',
'src_platform': 'ec_admin',
'src_brand': 'uq',
'src_region': 'jp',
'src_log_type': 'secure_log_1m',
'dynamic_field_update': False,
'log_fields_key': 'ec_admin',
'pipeline': None,
'index_dur_type': 'Weekly' #Daily/Weekly/Monthly/Yearly based on Size,
},
#Index Name : src_platform.src_brand.src_region.src_log_type-index_dur_type-sl_no-YYYY-MM-DD
Kibana is the user interface of Elastic search for searching, analyzing, visualizing and the dashboards.
Challenges
We are running our Cello system on Elastic search software since the last 5 years and even after we migrated from the self-managed Elastic search cluster to Amazon-managed Elastic search, we could not completely solve the major challenges. Some of those remaining items are listed below.
Current pain points of Amazon Managed Elastic search
Description
Even though the Amazon Elastic search service is an Amazon-managed service, the overall health of it depends on us. We have to place data correctly and scale up the system beforehand. We also have to breakdown large data into chunks and feed it into Cello. The Cello team spent more than 6 months on just performance tuning and customization itself with 3 resources.
So we are hoping to reduce overall costs in the long term perspective with alternative software, which would be of course beneficial to both users and engineers.
Recommended by LinkedIn
Various proof of concepts
As part of the organizational wide digital business transformation, we worked on the below proof of concepts to propose the next generation logging platform to add new capabilities and support business growth:
Why we chose GCP Cloud Logging
From Organization perspective
[1] The Organization growth is at dynamic and large scale. It is the time to add capabilities of auto scaling, server-less system, pay per usage, log analytics etc.
[2] Cost: GCP logging - Pay per usage (Large volume are seasonal and on adhoc basis when we new products and run campaigns )
[3] Scalability - Since It is a server-less service which offers auto scaling, It definitely helps in unexpected campaigns, new stores, new countries etc.
[4] Hybrid Cloud (AWS+GCP) for long term growth
From engineer perspective
[5] We need engineers for tuning and managing. It is always difficult to find the right talent at the right time to support Company growth. With less work on managing system, we can more concentrate on extracting insights out of log data.
[6] AWS does not offer an SLA for Kibana.
[7] Less Engineers to operate
From user perspective
[8] Service availability had hit many users, because of Kibana instability, so we hope to avoid such situation by using Google Cloud Logging.
[9] Promote log analytics via BigQuery, DataStudio, BI engine etc. Since Google Cloud Logging enables you to export logs to Petabyte scale data warehouse like BigQuery, it enables you to query log data using SQL and draw more insights.
[10] Google Cloud Logging is globally accessible. - Many times, we hear complaints about Kibana not being accessible, responding with 5XX errors etc. We hope to solve those issues, since Google Cloud Logging is a shared solution and not specific to any project/VPC etc. Many times, we hear Kibana is not accessible from a specific IP etc. This will be solved, because Google Cloud logging is global and no longer a VPC native solution like Kibana.
[11] Google Authentication - More secure way of Authentication and Authorization
New architecture
We changed the overall architecture of the current Cello to the one below, for addressing the pain points listed above and also for adding new capabilities.
Change points
Because of the reasons mentioned above, we decided to go with Google Cloud Logging. Of course, we will keep the Amazon-managed Elastic search (but this time with 75% reduced nodes) for below reasons:
Conclusion
The logs from all regional buckets (PET) will be directly fed to the Japan main bucket (DET) instead of the landing bucket. By doing so we eliminate Lambda+S3+NIFI in turn and save costs. AES usage will reduced by 75% which in turns will reduce OPEX costs at the AWS side. In addition, GCP logging will be the main software/application for log processing, storage and analytics. And last but not least, engineers will have a faster and more reliable experience when they are looking for required logs.
About the author
Vijay is working as Engineering Manager,Data Fast Retailing . His work location is Tokyo, Japan.He is passionate about growing new engineering teams and building Data/Logging/Machine Learning platforms while brining new capabilities to the company..
Insightful Article on building scalable and resilient distributed logging platform, Looking forward for more such articles
Thank you Vijay Kumar for sharing this gem 💎💙