Hype cycle of (data) platform efficiency
Cloud and AI have greatly accelerated time-to-market and technology/architecture evolution, yet the hype cycle is still applicable. History doesn't repeat itself, but it often rhymes.
There are a lot of best practice videos in Youtube and guidances provided by the vendors, but it is still quite difficult to adopt and apply those into the real work. Why? One possible side effect of the Cloud and AI acceleration is - engineers are overwhelmed by tight deadlines and product features, they don't have time or mental bandwidth to valid or understand the database or k8s internals, therefore they put a system in place which just works but does not scale. Dr. Vogels also said in his re'Invent keynote:
Your customers are never going to tell you that your database engineers are doing amazing work and they love what they've done. Only you understand that work that goes into it. Most of what we build nobody will ever see. And the only reason why we do this well is our own professional pride in operational excellence. That is what defines the best builders. They do the things properly even when nobody's watching.
It is becoming hard to appreciate the above pride especially when your stakeholders don't care and your chain of command is not watching. Have you ever/often heard from the product/service teams that they have a few weeks to launch, please don't tell them that they need to partition the table or redesign the query shape; or they don't believe that they need the "over optimizations" you suggest. Everything goes OK in the first 18~24 months, because initially the data/transaction volume is not big enough to cause incidents; or it is easy to scale up hardware to hide the problem. But after this period, you either have scaled to the biggest node type or your cost:perf ratio has gone so terribly, yet you still run into higher-than-expected latency and frequent reliability incidents. Some product/service teams are willing to fix the tech debt, but they have very low confidence in the special or advanced data technology or design patterns, or they'd rather cope with incident outage than proactively take maintenance window to fix their data model & query shape because the "just works" data architecture is quite difficult to evolve/refactor without a big surgery. In such a scenario, therefore right-sizing + container scheduling is usually the preferred "optimization" approach, while the application logic and data model change is avoided/delayed unless a code yellow/red is summoned.
Recommended by LinkedIn
At this point (between [3] and [4]), some companies "buy" and "migrate" to a different vendor for the 1 or 2 most expensive use cases to temporarily stop the cost madness. But just after 18 months, the cost spike comes back again because the new system is being swamped by the suboptimal/bad workloads again (because folks are too busy to learn internals and apply the design vigor).
Instead of buying the way out, some other companies have stronger engineering determinations to fix the inefficient patterns. At this stage, the granular ownership and provenance are required to hold certain teams accountable. As the data & engineering maturity improve over time, [6] & [7] become the standards & testaments of the engineering excellence.
[8] is the next level of efficiency if the engineering team is equipped with stable accumulation of talent and technical confidence. This capability can finally tame the next wave of significant growth without the rapid cost spike trajectory. By this phase, the buy-vs-build balance/preference has reached a new balance.
[10] is often triggered by major M&A(s) which bring the different infra, processes, and platforms into the established/paved landscape. The cycle of consolidation, migration and optimization starts again. A few companies can avoid [2] and start from [4] again, yet most companies may have to go through [2] phase again.
Which loop have you observed? [1]->[2]->[3]->[2] is very convenient and super expensive. For the data folks who still hold the pride of engineering excellence, hopefully you and me can still strive for [7] & [8] even when nobody's watching.
Business Development @ Featherless.ai | Forbes 30u30 | Fintech Founder | Building Romanian Tech Ecosystems
1moSpot on, Eric. The Phase of Internalization is the ultimate test for engineering culture, balancing control against innovation speed. Usually, the cloud bill ends up dictating the architecture long before the tech team is ready. Deeply relatable.
https://www.garudax.id/feed/update/urn:li:activity:7404551029473714177/ talks about similar cycle especially about the drawbacks of solely relying on cloud providers' solutions.