Centralized logging platform and its migration into Google Cloud

Vijay Kumar

Technical Solutions Consultant,Engineering - Google One & Photos

Published Oct 7, 2022

Introduction to the centralized logging platform (Cello)

Cello is a one place destination for all logs generated by various micro services (including e-commerce and stores) managed by Fast Retailing. Cello currently serves more than 400 enterprise users of 6 brands across 20+ regions. Cello provides a user interface to query, search, visualize and build dashboards on top of millions of raw logs.

While Cello is built on Amazon Web Services mainly using Amazon Managed Elastic search (also called open search) and Airflow (open source data orchestration tool).

Why logging platform

From a tech perspective, Cello provides a platform to storing, analyzing, searching and visualizing the data stored in millions of log files everyday generated by our micro services.

Make sense of log data
Get insights from unstructured logs
Create visualization/dashboards and extract business insights and opportunities
Retrieve aggregated information on particular fields from million of files
Centralized stack for log data (handling 1 TB/day to 30 TB/month)
Role based access to different teams
Serve users globally from AWS Tokyo region
Anomaly detection using machine learning/watcher

The high-level view of the logging platform looks like this:

Current architecture

Cello was built on a self-managed Elastic search cluster built on top of AWS EC2 Instances and Elastic License until 2020.

In September 2020, after our successful POC, we migrated Cello from the self-managed Elastic cluster to the Amazon-managed Elastic search (AES)

Why you should chose Amazon Elastic search

Supports almost all features of Elastic, except Machine learning & watcher, while it is a managed service
Reduce Opex cost
Scalability in clicks (reduced down time )
Overall Cost (license+infra) is less compared to competitors
Amazon takes backup every day, so reliability and durability is ensured

Pros

Amazon Managed Elastic search so 80% reduction in Opex cost
Dedicated ultra warm node(readonly) for reduced cost in storage
Integration with cognito(unlimited scale of users)
Scalable and easy to maintain
Availability - B/G deployment
New features like ILM, cross cluster search, ultra warm node
Easy to setup cloud watch alarms and more metrics available

As you can see from the image, the logs collected from different micro services are placed initially in the PET S3 buckets. Afterwards the fluentd log shipper copies the files from PET S3 to the DET S3 landing buckets. (PET S3 buckets are kind of transactional and while DET S3 buckets are Analytical purposes.)

Each region has its own dedicated S3 bucket for storing logs. Since the current Cello instances run in the ap-northeast-1 (Tokyo) region, the logs from the landing buckets in different regions are eventually copied to the Tokyo main bucket by NIFI. We then use Airflow to feed logs from S3 to Elastic search in micro batches (every 15~20 minutes). That's why you will see logs with the delay of 15 minutes!

{
'src_name': 'ec_admin_secure_log_uq_jp',
'schedule': '7,27,47 * * * *',
'src_platform': 'ec_admin',
'src_brand': 'uq',
'src_region': 'jp',
'src_log_type': 'secure_log_1m',
'dynamic_field_update': False,
'log_fields_key': 'ec_admin',
'pipeline': None,
'index_dur_type': 'Weekly' #Daily/Weekly/Monthly/Yearly based on Size,
},

#Index Name : src_platform.src_brand.src_region.src_log_type-index_dur_type-sl_no-YYYY-MM-DD

Kibana is the user interface of Elastic search for searching, analyzing, visualizing and the dashboards.

Challenges

We are running our Cello system on Elastic search software since the last 5 years and even after we migrated from the self-managed Elastic search cluster to Amazon-managed Elastic search, we could not completely solve the major challenges. Some of those remaining items are listed below.

Current pain points of Amazon Managed Elastic search

Data distribution is not happening correctly, lacks intelligence.
Kibana instability when search rate is high.
Large data at once causes pressure on one node, leading to unavailability of Kibana.
Cost is paid per machine/disk not as per usage.
Hot to ultra warm movement is very slow, so hot node cost is high.
Sometimes AWS support lacks skills, turn around time is not as per expectation.
Kibana does not have SLA
Kibana does not support dynamic change in mapping

Description

Even though the Amazon Elastic search service is an Amazon-managed service, the overall health of it depends on us. We have to place data correctly and scale up the system beforehand. We also have to breakdown large data into chunks and feed it into Cello. The Cello team spent more than 6 months on just performance tuning and customization itself with 3 resources.

So we are hoping to reduce overall costs in the long term perspective with alternative software, which would be of course beneficial to both users and engineers.

Why we chose GCP Cloud Logging

From Organization perspective

[1] The Organization growth is at dynamic and large scale. It is the time to add capabilities of auto scaling, server-less system, pay per usage, log analytics etc.

[2] Cost: GCP logging - Pay per usage (Large volume are seasonal and on adhoc basis when we new products and run campaigns )

[3] Scalability - Since It is a server-less service which offers auto scaling, It definitely helps in unexpected campaigns, new stores, new countries etc.

[4] Hybrid Cloud (AWS+GCP) for long term growth

From engineer perspective

[5] We need engineers for tuning and managing. It is always difficult to find the right talent at the right time to support Company growth. With less work on managing system, we can more concentrate on extracting insights out of log data.

[6] AWS does not offer an SLA for Kibana.

[7] Less Engineers to operate

From user perspective

[8] Service availability had hit many users, because of Kibana instability, so we hope to avoid such situation by using Google Cloud Logging.

[9] Promote log analytics via BigQuery, DataStudio, BI engine etc. Since Google Cloud Logging enables you to export logs to Petabyte scale data warehouse like BigQuery, it enables you to query log data using SQL and draw more insights.

[10] Google Cloud Logging is globally accessible. - Many times, we hear complaints about Kibana not being accessible, responding with 5XX errors etc. We hope to solve those issues, since Google Cloud Logging is a shared solution and not specific to any project/VPC etc. Many times, we hear Kibana is not accessible from a specific IP etc. This will be solved, because Google Cloud logging is global and no longer a VPC native solution like Kibana.

[11] Google Authentication - More secure way of Authentication and Authorization

New architecture

We changed the overall architecture of the current Cello to the one below, for addressing the pain points listed above and also for adding new capabilities.

Change points

Because of the reasons mentioned above, we decided to go with Google Cloud Logging. Of course, we will keep the Amazon-managed Elastic search (but this time with 75% reduced nodes) for below reasons:

Amazon-managed Elastic search serves as a backup solution when Google Cloud Logging goes down in the worst case scenario
Since we have limited visualizations in GCP, AES will serve as an alternative for such high priority visualizations
Overall log ingestion time from S3 to logging user interface will be decreased as we eliminate the landing buckets in the middle

Conclusion

The logs from all regional buckets (PET) will be directly fed to the Japan main bucket (DET) instead of the landing bucket. By doing so we eliminate Lambda+S3+NIFI in turn and save costs. AES usage will reduced by 75% which in turns will reduce OPEX costs at the AWS side. In addition, GCP logging will be the main software/application for log processing, storage and analytics. And last but not least, engineers will have a faster and more reliable experience when they are looking for required logs.

About the author

Vijay is working as Engineering Manager,Data Fast Retailing . His work location is Tokyo, Japan.He is passionate about growing new engineering teams and building Data/Logging/Machine Learning platforms while brining new capabilities to the company..

Sachin Sarangamath 3y

Insightful Article on building scalable and resilient distributed logging platform, Looking forward for more such articles

1 Reaction

Khurram Saleem 3y

Thank you Vijay Kumar for sharing this gem 💎💙

2 Reactions

See more comments

To view or add a comment, sign in

Centralized logging platform and its migration into Google Cloud

Vijay Kumar

Technical Solutions Consultant,Engineering - Google One & Photos

Introduction to the centralized logging platform (Cello)

Why logging platform

The high-level view of the logging platform looks like this:

Current architecture

Why you should chose Amazon Elastic search

Pros

Challenges

Current pain points of Amazon Managed Elastic search

Recommended by LinkedIn

Various proof of concepts

Why we chose GCP Cloud Logging

From Organization perspective

From engineer perspective

From user perspective

New architecture

Change points

Conclusion

About the author

More articles by Vijay Kumar

Others also viewed

Globalize Your Amazon Rekognition Custom Labels with AWS CDK

The AIOps Trap: Why Cloud-Native Monitoring Locks You In (And Third-Party Platforms Bleed You Dry)

Introducing S3 Files - Cloud Object Store as Fully-Featured, High-Performance File System

GroundToCloud Let’s Lift Series: Designing a File Processing Workflow with AWS Step Functions

AWS update of Week 20 (15May-21May)

Scaling APIs To New Heights: Asynchronous API with AWS Serverless

Building on Google Cloud With Gemini and Claude

Amazon Web Services Expands Bedrock GenAI Service

Bridging the Gap: How Amazon S3 Files Turns Cloud Storage into a Universal Data Fabric By: Elena Petrova | Deep Tech Correspondent Published: April 8

Solving The Cloud Native Microservice Autonomy Problem: Implementing Search With Cosmos DB Graph and Azure Search (Part 2)

Explore content categories

Introduction to the centralized logging platform (Cello)

Why logging platform

The high-level view of the logging platform looks like this:

Current architecture

Why you should chose Amazon Elastic search

Pros

Challenges

Current pain points of Amazon Managed Elastic search

Recommended by LinkedIn

Various proof of concepts

Why we chose GCP Cloud Logging

From Organization perspective

From engineer perspective

From user perspective

New architecture

Change points

Conclusion

About the author

More articles by Vijay Kumar

How I Studied and became 9X Google Cloud Certified

Others also viewed

Globalize Your Amazon Rekognition Custom Labels with AWS CDK

The AIOps Trap: Why Cloud-Native Monitoring Locks You In (And Third-Party Platforms Bleed You Dry)

Introducing S3 Files - Cloud Object Store as Fully-Featured, High-Performance File System

GroundToCloud Let’s Lift Series: Designing a File Processing Workflow with AWS Step Functions

AWS update of Week 20 (15May-21May)

Scaling APIs To New Heights: Asynchronous API with AWS Serverless

Building on Google Cloud With Gemini and Claude

Amazon Web Services Expands Bedrock GenAI Service

Bridging the Gap: How Amazon S3 Files Turns Cloud Storage into a Universal Data Fabric By: Elena Petrova | Deep Tech Correspondent Published: April 8

Solving The Cloud Native Microservice Autonomy Problem: Implementing Search With Cosmos DB Graph and Azure Search (Part 2)

Similar topics

AWS Logging Strategies for Startup Teams

Explore content categories