MLOps: Repeatability in Production ML Systems
Why "It Worked Yesterday" Is Not a Valid State
Most machine learning systems perform well in controlled environments.
They train successfully. They pass evaluation metrics. They produce expected outputs on test data.
Then they are deployed.
And something subtle changes.
Not dramatically. Not immediately. But enough to introduce inconsistency.
The same pipeline, with the same intent, begins to produce different results.
At first, this is dismissed as noise. Later, it becomes a pattern. Eventually, it becomes a problem no one can precisely explain and no one can reliably reproduce to fix.
What Repeatability Actually Means
Repeatability is often reduced to reproducibility in experiments.
In production systems, it is stricter.
Given the same inputs, the system should produce the same outputs consistently across time and environments.
This requirement extends beyond the model. It includes data ingestion, feature generation, model inference, configuration, and the execution environment itself.
If any of these components are not controlled, repeatability breaks. It usually breaks quietly.
What Happens When Repeatability Is Missing
The absence of repeatability rarely appears as a single failure. It emerges through patterns, and teams that have lived through these recognize them immediately.
• Same pipeline, different outputs
A pipeline is executed with identical intent on two different days. The outputs are not the same.
There is no clear record of what changed. No versioned data snapshot. No traceable configuration difference. What appears stable at the surface is not reproducible underneath.
Common underlying causes include the following:
• Works locally, behaves differently in production
A model behaves as expected in a local environment. When deployed, the same pipeline produces different results. The difference is not in the model itself, but in the surrounding system.
Difference often come from:
The system is technically functional, but operationally inconsistent.
• Debugging becomes reconstruction
An issue is reported in production. The first question should be simple. What changed.
Without repeatability, this question is difficult to answer. Teams begin reconstructing past runs. Which data was used. Which model version was loaded. Which configuration was active.
Diagnosis becomes archaeology. Hours are spent recovering context that should never have been lost.
• Hidden dependencies solidify into invisible infrastructure
In the absence of structured pipelines, systems begin to rely on implicit behavior:
The system continues to function, but only under conditions that are neither visible nor controlled.
Recommended by LinkedIn
Why This Is Not a Modeling Problem
When outputs become inconsistent, the instinct is to revisit the model.
In many cases, the model is not the source of the issue. The system around it is.
Repeatability is a systems property. Without it, even a well-performing model will behave unpredictably.
You cannot tune your way out of an infrastructure problem.
Designing for Repeatability
Repeatability does not emerge by default. It must be engineered deliberately. Teams that skip this step eventually find themselves doing reconstruction work under pressure.
Version everything that affects output. This includes:
If it influences output, it must be versioned.
An S3 path without a version identifier is not a data source. It is a variable.
Enforce deterministic pipelines.
Determinism is a requirement, not an optimization.
Standardize execution environments.
A model that works locally should not behave differently in production.
Eliminate implicit state.
Make every run traceable.
Every execution should answer what data was used, what model version was applied, what configuration was active, and what output was produced. Without traceability, repeatability cannot be verified.
A Simple Test
Can you rerun a pipeline from six months ago and obtain the same result?
If the answer depends on conditions, assumptions, or reconstruction, repeatability is not established.
Closing Thought
Production ML systems do not fail because they are inaccurate.
They fail because they are inconsistent.
Accuracy can be measured. Inconsistency cannot be controlled without deliberate system design.
Repeatability is the first and most foundational layer of a reliable ML system. Without it, automation becomes unpredictable, observability loses its signal, and reliability cannot be sustained.
Every improvement layered on top of an unrepeatable system rests on unstable ground.
This is Part 1 of the RAOR series: Repeatability, Automation, Observability, Reliability.
Part 2 covers Automation. How to move from manually operated pipelines to systems that run consistently without intervention.
If your team has hit any of these patterns in production, I would be interested to hear how you approached it. Drop a comment below.
What has been the hardest part of maintaining consistency in production ML systems? Data versioning, environment drift, or something else?