The Lean Data Warehouse – Lessons Learnt

Humans grow by applying learnt experiences on new challenges. This is why whenever we learn a new skill, the world is all of a sudden full of opportunities to apply it. I dropped over 50 kilograms of weight (nearly 8 stone, 110 lbs) over the last 16 months, at roughly the same time I started applying Data Ops principles in my professional life. The two endeavours cross referenced each other, naturally, and the lessons I learnt on one journey became drivers of successful navigation in the other. I would not go as far as to say I could not have done one without the other, but the two have become connected in my personal story.

No alt text provided for this image


When I try to define why, the best answer I can come up with is that although very different in context, the destination of the journey for better health and the destination of the journey to operational excellence in data engineering are similar: to turn the body/data pipeline into an efficient, lean mechanism.


And while I think there might be some value in the story of massive fat loss (and such stories told by others have been valuable to me on my journey), I'm writing this because I think the lessons learnt here are lessons that can be used by data professionals in their professional capacity. It is the data ops, not the weight loss, I'm advocating today.


Lesson 1: Tackle GIGO

GIGO, or "Garbage in, garbage out" has been a mindset of data engineers and BI professionals for ages. The techies tend to focus on the tech. If the data pipeline is working - i.e. it is reading the data from source, transforms it as designed and writes it to a target - the techies have done their bit. The pipeline is there. Whatever you pump through it you'll get on the other side: pump in high quality, complete, accurate data, and it'll spit out high quality insights. Pump in garbage data, and your insights will not be as good. We, the engineers and developers and architects and such, are not responsible for how good the data coming into your pipeline is more than your plumber is responsible for what the quality of the water in your pipes.

Well, that thinking must end. Your plumber would be remiss if he didn't offer a water softener where one seemed necessary, for example. It is our responsibility, in the long run, to build mechanisms to handle garbage data. Testing the data before loading, flagging possible data issues etc. save time: time to correct wrong data loaded 50 months ago, for instance.

Similarly to our accountability to note what we eat and correct our course if we've eaten something that doesn’t assist a fat loss goal (you are what you eat, they say), we are accountable to check, and sometimes correct, the data flowing in. It’s moving away from a mentality of asking “is this my responsibility” to a mentality of “will this yield the best outcome”. 


Lesson 2: Exercise

I am a firm believer long term fat loss in only possible by moving more. Exercise has many benefits. It’s not just the calories burnt, but also the muscles built, the hunger controlled. But most critically, it provides a centre of balance, something one can actively do that arranges all the other related activities around it. Eating healthy/less is mostly passive: it is more about avoiding certain foods or quantities. But exercising is active, it is something you do, and as such it lends itself to becoming the centre of gravity around the whole lifestyle one is trying to lead. 

How does that translate to data warehousing? Well, just like us, data pipelines need to run, and run often. Especially in green field projects, there’s a fear from switching the system on for automated refreshes. A sense that we might not yet be ready. Well, until you try, you won’t know the extent of the issues you’ll be facing. Start running data in as soon as the pipeline is developed and tested, before it is used in anger. Just like with exercising, it will be very difficult at first. You might not be able to do it for very long, or very often, but the more you do it, the better it will be. The first time I ran all the data pipelines on my new warehouse was roughly the same time as when I went to the gym first. I could hardly walk half a mile, and my data loads were failing daily, or creating duplicated records etc. But with consistency comes improvement, and today the pipelines are reliable and I’m practising for a half marathon, after having completed a sub-hour 10k. 


Lesson 3: Focus on processes, not goals

This might be the most important lesson. I learnt it from a book (Atomic Habits by James Clear) and it has been instrumental in my personal and professional development. If you focus on a goal or a target, you lose sight of the bigger picture. Getting to a target weight, for example, is relatively easy – but maintaining it is the real goal, and most people won’t be able to maintain a weight they got to by crash dieting. On the flip side, if you focus on the process: eat better, exercise, sleep better – the weight loss will happen, and will be sustainable because you’ve altered your way of life, not your weight alone. 

The lesson carries over to data warehousing. When we are target oriented, we’re trying to get a solution in, trying to meet a deadline for certain pipeline to be available etc., we might cut corners, develop fast and loose and promise ourselves, with best intentions, that we’ll get there first, and put in more permanent solutions later. But the harsh reality of IT systems everywhere is that there’s never time “later”. You’ll deliver your project, and another one will pop up. If it won’t, your team will be downsized. And here you are, a few months later, with scaffolding holding processes in place, just. I used to look at a new requirement and think how long would it take to develop, and add 20 percent, and send that back as an estimate. Today I know to look at a new requirement, think how long it would take to develop to operational readiness, how long it would take to write automated tests, how long it would take to stabilise, and that’s the estimate I give. It is significantly longer, yes – but the outcome is significantly more robust. Meeting a deadline means nothing if the whole thing goes KABOOM two weeks after release – it is always better to add these two weeks to the deadline, and drop a product that’s ready to use. 


These are some principles by which I design and build data pipelines, and by which I have also led a personal change, hoping my readers would find these useful. If you have any more principles or tips, drop them in the comments!

תודה רבה לך על השיתוף🙂 אני מזמינה אותך לקבוצה שלי: הקבוצה מחברת בין ישראלים ואנשי העולם במגוון תחומים. https://chat.whatsapp.com/BubG8iFDe2bHHWkNYiboeU

Like
Reply

Very interesting Nimrod, both sides of the article. I think along with pipes, data also needs to go for “exercise” - painful at first but sometimes which data you want and how you will use it only becomes clear by activating the systems. All points to more iterative cycles.

Nimrod, i am in awe of your achievements over the last year, well done

Like
Reply

To view or add a comment, sign in

More articles by Nimrod Avissar

Others also viewed

Explore content categories