Centralized logging platform and its migration into Google Cloud

Centralized logging platform and its migration into Google Cloud

Introduction to the centralized logging platform (Cello) 

Cello is a one place destination for all logs generated by various micro services (including e-commerce and stores) managed by Fast Retailing. Cello currently serves more than 400 enterprise users of 6 brands across 20+ regions. Cello provides a user interface to query, search, visualize and build dashboards on top of millions of raw logs.

While Cello is built on Amazon Web Services mainly using Amazon Managed Elastic search (also called open search) and Airflow (open source data orchestration tool).

No alt text provided for this image

Why logging platform 

From a tech perspective, Cello provides a platform to storing, analyzing, searching and visualizing the data stored in millions of log files everyday generated by our micro services.

  • Make sense of log data 
  • Get insights from unstructured logs
  • Create visualization/dashboards and extract business insights and opportunities
  • Retrieve aggregated information on particular fields from million of files
  • Centralized stack for log data (handling 1 TB/day to 30 TB/month)
  • Role based access to different teams
  • Serve users globally from AWS Tokyo region
  • Anomaly detection using machine learning/watcher

The high-level view of the logging platform looks like this:

No alt text provided for this image

Current architecture

Cello was built on a self-managed Elastic search cluster built on top of AWS EC2 Instances and Elastic License until 2020.

In September 2020, after our successful POC, we migrated Cello from the self-managed Elastic cluster to the Amazon-managed Elastic search (AES)

Why you should chose Amazon Elastic search

  • Supports almost all features of Elastic, except Machine learning & watcher, while it is a managed service 
  • Reduce Opex cost
  • Scalability in clicks (reduced down time )
  • Overall Cost (license+infra) is less compared to competitors 
  • Amazon takes backup every day, so reliability and durability is ensured

Pros

  • Amazon Managed Elastic search so 80% reduction in Opex cost
  • Dedicated ultra warm node(readonly) for reduced cost in storage
  • Integration with cognito(unlimited scale of users)
  • Scalable and easy to maintain
  • Availability - B/G deployment
  • New features like ILM, cross cluster search, ultra warm node
  • Easy to setup cloud watch alarms and more metrics available

No alt text provided for this image

As you can see from the image, the logs collected from different micro services are placed initially in the PET S3 buckets. Afterwards the fluentd log shipper copies the files from PET S3 to the DET S3 landing buckets. (PET S3 buckets are kind of transactional and while DET S3 buckets are Analytical purposes.)

Each region has its own dedicated S3 bucket for storing logs. Since the current Cello instances run in the ap-northeast-1 (Tokyo) region, the logs from the landing buckets in different regions are eventually copied to the Tokyo main bucket by NIFI. We then use Airflow to feed logs from S3 to Elastic search in micro batches (every 15~20 minutes). That's why you will see logs with the delay of 15 minutes!

{
'src_name': 'ec_admin_secure_log_uq_jp',
'schedule': '7,27,47 * * * *',
'src_platform': 'ec_admin',
'src_brand': 'uq',
'src_region': 'jp',
'src_log_type': 'secure_log_1m',
'dynamic_field_update': False,
'log_fields_key': 'ec_admin',
'pipeline': None,
'index_dur_type': 'Weekly' #Daily/Weekly/Monthly/Yearly based on Size,
},        

#Index Name : src_platform.src_brand.src_region.src_log_type-index_dur_type-sl_no-YYYY-MM-DD

Kibana is the user interface of Elastic search for searching, analyzing, visualizing and the dashboards.

Challenges 

We are running our Cello system on Elastic search software since the last 5 years and even after we migrated from the self-managed Elastic search cluster to Amazon-managed Elastic search, we could not completely solve the major challenges. Some of those remaining items are listed below.

Current pain points of Amazon Managed Elastic search 

  • Data distribution is not happening correctly, lacks intelligence.
  • Kibana instability when search rate is high.
  • Large data at once causes pressure on one node, leading to unavailability of Kibana.
  • Cost is paid per machine/disk not as per usage.
  • Hot to ultra warm movement is very slow, so hot node cost is high.
  • Sometimes AWS support lacks skills, turn around time is not as per expectation.
  • Kibana does not have SLA 
  • Kibana does not support dynamic change in mapping

Description

Even though the Amazon Elastic search service is an Amazon-managed service, the overall health of it depends on us. We have to place data correctly and scale up the system beforehand. We also have to breakdown large data into chunks and feed it into Cello. The Cello team spent more than 6 months on just performance tuning and customization itself with 3 resources.

So we are hoping to reduce overall costs in the long term perspective with alternative software, which would be of course beneficial to both users and engineers.

Various proof of concepts

As part of the organizational wide digital business transformation, we worked on the below proof of concepts to propose the next generation logging platform to add new capabilities and support business growth:

  • Centralized logging platform using S3, Amazon-managed Elastic search & Airflow (Self-managed)
  • Centralized logging platform using S3, Amazon-managed Elastic search & Airflow (AWS-managed)
  • Centralized logging platform using S3, Amazon-managed Elastic search & Lambda
  • Centralized logging platform using S3, Splunk & Airflow (Self-managed)
  • Centralized logging platform using S3, Cloud Logging (GCP) & Composer (GCP managed Airflow)
  • Centralized logging platform using S3, Cloud Logging (GCP) & Airflow(Self-managed)
  • Centralized logging platform using S3, Cloud Logging (GCP) & Dataflow

Why we chose GCP Cloud Logging

From Organization perspective

[1] The Organization growth is at dynamic and large scale. It is the time to add capabilities of auto scaling, server-less system, pay per usage, log analytics etc.

[2] Cost: GCP logging - Pay per usage (Large volume are seasonal and on adhoc basis when we new products and run campaigns )

[3] Scalability - Since It is a server-less service which offers auto scaling, It definitely helps in unexpected campaigns, new stores, new countries etc.

[4] Hybrid Cloud (AWS+GCP) for long term growth

From engineer perspective

[5] We need engineers for tuning and managing. It is always difficult to find the right talent at the right time to support Company growth. With less work on managing system, we can more concentrate on extracting insights out of log data.

[6] AWS does not offer an SLA for Kibana.

[7] Less Engineers to operate

From user perspective

[8] Service availability had hit many users, because of Kibana instability, so we hope to avoid such situation by using Google Cloud Logging.

[9] Promote log analytics via BigQuery, DataStudio, BI engine etc. Since Google Cloud Logging enables you to export logs to Petabyte scale data warehouse like BigQuery, it enables you to query log data using SQL and draw more insights.

[10] Google Cloud Logging is globally accessible. - Many times, we hear complaints about Kibana not being accessible, responding with 5XX errors etc. We hope to solve those issues, since Google Cloud Logging is a shared solution and not specific to any project/VPC etc. Many times, we hear Kibana is not accessible from a specific IP etc. This will be solved, because Google Cloud logging is global and no longer a VPC native solution like Kibana.

[11] Google Authentication - More secure way of Authentication and Authorization 

New architecture 

We changed the overall architecture of the current Cello to the one below, for addressing the pain points listed above and also for adding new capabilities.

No alt text provided for this image


Change points

Because of the reasons mentioned above, we decided to go with Google Cloud Logging. Of course, we will keep the Amazon-managed Elastic search (but this time with 75% reduced nodes) for below reasons:

  • Amazon-managed Elastic search serves as a backup solution when Google Cloud Logging goes down in the worst case scenario 
  • Since we have limited visualizations in GCP, AES will serve as an alternative for such high priority visualizations
  • Overall log ingestion time from S3 to logging user interface will be decreased as we eliminate the landing buckets in the middle

Conclusion

The logs from all regional buckets (PET) will be directly fed to the Japan main bucket (DET) instead of the landing bucket. By doing so we eliminate Lambda+S3+NIFI in turn and save costs. AES usage will reduced by 75% which in turns will reduce OPEX costs at the AWS side. In addition, GCP logging will be the main software/application for log processing, storage and analytics. And last but not least, engineers will have a faster and more reliable experience when they are looking for required logs.

About the author

Vijay is working as Engineering Manager,Data Fast Retailing . His work location is Tokyo, Japan.He is passionate about growing new engineering teams and building Data/Logging/Machine Learning platforms while brining new capabilities to the company..

Insightful Article on building scalable and resilient distributed logging platform, Looking forward for more such articles

Thank you Vijay Kumar for sharing this gem 💎💙

To view or add a comment, sign in

More articles by Vijay Kumar

Others also viewed

Explore content categories