Mapping Business Requirements to Solution Delivery: User Profiling for Data Tiering Decisions
Elastic Data Tiers

Mapping Business Requirements to Solution Delivery: User Profiling for Data Tiering Decisions

Part 1: The User Profile Test

Longtime users of Elasticsearch know that Elastic is FAST. Its speed is one of many reasons that Elasticsearch is ranked the #1 Search Engine and has been widely adopted across use cases cases generally falling into three categories: Enterprise Search, Observability and Security.

While speed is great, it also has a direct correlation to cost and therefore needs to be considered as part of the solution requirements.

In November of 2020, Elastic announced Searchable Snapshots, an Elasticsearch feature powering new Cold and Frozen Data Tiers. By May 2021, both tiers were GA. These tiers offer significant optimization on Elastic workloads, scaling to $100Ks+ per year.

The Benefit: Significant Cost Saving Opportunities! Yay!

The Challenge: Generally speaking, my team works with administrators delivering Elastic-as-a-Service to their internal teams. With this, two issues arise:

  1. Teams with existing Elastic clusters need to justify making an architecture change at all
  2. Search speed is reduced on lower data tiers so administrators need to justify how they have made decisions on Data Distribution Across Data Tiers

Over the last year, my team has engaged in deep discovery with both existing and prospective customers to understand how to think about data tier distribution and the associated cost savings opportunity. This is Part 1 of 2 on the results.

TLDR: Distribution across Data Tiers is better thought of as a business requirement, rather than a technical requirement. Therefore, decisions on data tier distribution should be driven by the Users of the Solution, rather than the Administrators.

As there is not a good way to get the type of information we are looking for from Elasticsearch metrics today, I created a behavioral 'User Profile Test' which works as follow:

Step 1: Administrators go to 2-3 trusted users/teams and ask the following question. (Note: For best results, this question should be asked blindly with the user/teams not knowing why the question is being asked. This can be done by email, slack, etc.)

"Complete the following with request to your day to day use of {InsertCurrentToolName}*

  1. X% of the time I am searching over A {Hours/Days}
  2. Y% of the time I am searching over B {Days/Weeks}
  3. Z% of the time I search the full data set and when I do, I expect a search result in C minutes/hours

Where: X is your most common search, Y is the next common search, etc."

Example answers from three real teams:

  • Example A: "50 % time, I search in the range 0-7 days, 20% time I search in the range of 7-30 days, 30 % time, I search in the range > 30 days (generally <=90 days), I expect a search result in 1 minute"
  • Example B: "...I would say 50% of the time my team would search for 1-2 days 40% would be 1-2 weeks and 10% is a month or even further."
  • Example C: "With our existing tool [not Elastic], 40-45 % time I am searching the last 7 days, 40-45% time I am searching the last 30 days, 10 % time, I search the full 365 days and when I do it takes 5+ min. to get a response"

Interestingly, User Profile responses differentiated by customers despite use case alignment. We did not observe trends that would allow us to say: "For any Logging Use Cases: 7 Days Hot, 30 Days Warm, etc." This further supports the conclusion that data tier distribution is a business requirement and should be considered based on the particular use case rather than a generalized "industry standard".

*Important: While {InsertCurrentToolName} can be an Elasticsearch cluster, the test is equally applicable to non-Elastic tools, i.e. Splunk, DataDog, etc.

Step 2: Map the written response(s) to a table and discuss.

Heres an example of what this looks like respective of Example C:

"With our existing tool, 40-45 % time I am searching the last 7 days, 40-45% time I am searching the last 30 days, 10 % time, I search the full 365 days and when I do it takes 5+ min. to get a response"

No alt text provided for this image

You'll note that while Elastic offers 4 Data Tiers, there is no requirement to use all 4 tiers. Therefore, we use this test to answer not only how we should distribute data across tiers but also which tiers we should use at all. For the particular example, we are justified to distribute 7 Days to the Hot Tier and 23 Days to the Warm Tier (this gives us a total of 30 Days across Hot and Warm). Given only 10% of usage is beyond 30 Days and the fact that users report a high tolerance for slower search, we distribute all data greater than 30 days directly to the Frozen Data Tier. At this point, we don't have a clear reason to include the Cold Tier and so it is omitted.

In cases where multiple teams provided responses, we use the same model side-by-side to map each teams answer and compare/contrast overlaps in requirements.

Takeaways from this past year of study...

  1. Cold and Frozen Data Tiers with Searchable Snapshots are not limited to time-series data workloads. In fact, many large Search Use Cases resulted in User Profile Responses equally justifying lower data tier distributions, particularly use cases by Compliance Teams and within the Insurance and Banking industries.
  2. In all cases the initial data tiering requests from administrators followed generic models: 30 Days Hot, 60 Days Warm, etc. When asked to justify, administrators were generally unable to provide substantive answers.
  3. In all cases the initial data tiering request from administrators were overweighted towards Hot and Warm tiers when compared with actual User Profile responses.
  4. Commonly, administrators of Self-Managed Elasticsearch Clusters reported "All Hot Nodes" but true node architecture mapped closest to Elastic's "Warm Tier", making it very important to qualify naming conventions early on to avoid confusion.

Conclusion: We now know how to think about data tier distribution, and have a repeatable test we can use to get metrics for analysis. We still need to understand cost implications so that we can fully define the savings opportunity.

To learn more about Searchable Snapshots and Elastic Data Tiers, see below resources:

Really cool! Thanks for sharing. :)

Like
Reply

Great article and approach Danielle....and one that more Users of Elastic should embrace to ensure their implementation meets the business requirements.

Great article Danielle Abraham thanks for sharing your insights

Like
Reply

Danielle Abraham Thank you for sharing this insight! I am curious to see if Brad Quarry will share his point of view on this. Brad has shared some thoughts with me about the use of the terms cold and frozen possibly causing users to expect less of Elastic's capabilities and responsiveness in those tiers. Mark Simoes Bindurao Kulkarni Dinakar Challa please take a look at this article from my colleague. I look forward to hearing your thoughts and to exploring how we can help optimize your service. Jennifer Zorza Alan Sizemore Katherine Sasek, PMP Eddie M. Shri Bodas Shane Davies Aram Favela Jisha Thekkittil

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore content categories