Machine Learning Engineering
Oftentimes when we recruit data scientists, we prefer candidates with experiences in research or machine learning (ML) competitions. For such candidates, their major shortcoming is the engineering capacity, i.e., how to put a ML algorithm online. Competitions pursue a higher accuracy and academic research pursues new ideas. For a ML system, although the accuracy is desired, it also requires continued delivery and steady improvement, and will be of little value if otherwise. Therefore, to build a stable, efficient and flexible system is of top priority to any ML project.
In this blog I will summarize a few key points to ML engineering, from years of practice and discussions with peers. This list is by no means comprehensive or exclusive. Please drop me a line or make your comments if you find some important things that I have left out.
- Foundations: data collection, quality control, and pipeline, experiment platform
Data collection needs to cover all aspects and stay fresh. Usually the data collection task is challenged by volume and frequency. Some data are easier to collect, such as the transaction data, because of the relatively low volume. Some are more challenging, like the behavioral data and the external data. Data quality control is another issue because data flow is easy to be interrupted by system change or failure, and is subject to loss and contamination at every stage of collection. A good counter strategy it to build a data quality monitoring program. Since the website is constantly upgrading, we face the uncertainty that data collection and data pipeline are left out.
Another dimension of data quality is to post-process collected data. Behaviors from crawlers and alike should be removed; anomaly data to be detected and the whole data circulation process should be rolled back and reprocessed from the point right before the anomaly.
With clean and stable data, we need to deliver them to the designated places in a timely manner through the pipeline. Take the ML process as an example, it is by nature a multi-source data merge: the estimator pipeline for training, and the transformer pipeline for prediction. Note that different pipelines have different processing logics and requirements on speed.
Data pipeline is like the blood vein of a ML system, and the experiment platform is like the arms and legs. All ML results are delivered and evaluated through the experiment platform. A scientific experiment assigns a user to a random bucket. A number of strategies can be tested in parallel if they are independent to each other, where you can split the traffic at each level.
All experiments come with evaluations, and evaluation reports need to be automated. With evaluation results, we need statistical tests to tell which strategy is better. Chi-square test is good for normal distribution and beta distribution for Bernoulli distribution. Also, we rely on Bayesian analysis to speculate how long we need to run our experiment. This part can run really long and we need a separate blog to discuss.
2. Rule model and bad case analysis system
Once the data infrastructure and the experiment infrastructure are ready, we can start out our exciting journey: model building. Surprisingly, for a ML system, it is not the ML model, but the rule model, is often more effective at the beginning. Two reasons: first, it takes time to accumulate data, especially the target data, to make ML perform well. With data, it also takes time to adjust parameters. Second, the newly built data infrastructure is prone to bugs, and the rule model is usually more robust. As a result, the rule model will perform better at the beginning.
While building the rule model, a bad case analysis system should be kicked off. A nice way to do this is to build a wide table that records all input features and result features. Here the result features should include the target value as well as a lot of information related to the results. This table is being updated daily to help analysts identify statistical patterns from bad cases, and then design counter rules. After several iterations, you have a good chance to build a good rule combination that performs well. The disadvantage of the rule model, however, is the high manpower cost, high maintenance cost, and the low potential in continuing improvement.
Still, this rule model can be a benchmark model once it is built, and a backup model when the online system falls apart – the rule model requires less resource.
3. Platforms (monitoring, feature engineering, machine learning, and delivery)
Now, we start to talk about ML. At this point, we need to do experiments, a lot. In order to do experiment successfully and smoothly, we need the help of platforms.
First, a monitoring platform can support thorough, automate and easy-to-deploy monitoring capacity. Remember, every new experiment involves some sort of code changes, and the only way to really test the new code is to do it online. Good monitoring program detects bugs easily. Also, start with an idle run - do every model building the same except no traffic through, is always good. Up to a steady point, we start to assign traffic.
Second, a feature engineering platform, which include a feature library and some feature manipulation tools. This topic is worth a single blog, but here I will give a brief introduction. The feature library should come with two parts, an offline library and an online one. On top of it is a tooling layer for feature manipulation and exploration. Very likely that most ML works relate to fine-tune existing features or enrich with new features. So a feature engineering platform is worth your time to build.
Third, a machine learning platform. A powerful and complete ML platform includes a rich model library, a model version control system, and a model evaluation system. Here I need to emphasize the importance of scientific methods. Online experiments are scientific studies, which means results should be carefully curated, analyzed, and easy to reproduce. Notebook tools like Jupyter are helpful for such purposes.
Last but not the least, deployment. Does the model need to be calculated online or offline? If online, does it involve online features, or the model also needs to be updated online? Deployment capacity is closely related to model complexity and data throughput capacity. Competition players care about the accuracy and often build gigantic models but very difficult to deploy.
4. Considerations in architecture and team building
With platform defined, the basic architecture emerges. All platforms are standalone pieces and are connected through clearly defined interface. Data pipeline makes this connection fast and flexible.
In last chapter, we reviewed the ML system developed from scratch to fledge, and we could see that in early days, data engineers play a critical role to build the underlying data flow. After that, an analyst team comes into play to develop rule models. With multiple analysts and quick iterations, they can reach a reasonably good rule model. Then it is the game for scientists and algorithm engineers. They like to try different ML algorithms and invent new ones. Still we need analysts to debug data problems and bad cases.
Collaborations between the engineers and scientists are never easy. One of the best solutions I came across is to have engineers build analytical tools, and scientists and analysts use them. In this way, both parties will be happy. Engineers find achievements by looking far forward and abstracting the analytical needs into a concrete toolbox. Analysts are happy to have full control of the analytical chain.
As for the career advancement of team members, it can match the project progress. In general, data cleansing, monitoring, and evaluation metrics design, those tasks can be assigned to junior members under mentoring. With more experiences, the person moves to work on algorithm and platform, which requires higher analytical capacity and software development skill. Also, for a ML system, upstream and downstream should be managed by different people, which help on monitoring and timely detecting any problem.
Fresh graduates with a research background usually know a lot about algorithms, but lack the understanding of data and evaluation metrics, which prevents them from doing some basic tasks. To help them hit ground running at day one, assign them tasks like cleaning data and designing evaluation metrics. These tasks help them quickly understand the nitty gritty of data and also general business goals.
5. Advanced topics
After ML system being online for a while, it will suffer some chronic diseases. The most common one is the positive feedback loop, resulting in an overall performance degradation. The positive feedback loop, simply put, is to make the rich richer, and poor poorer. As a result, the recommendation becomes monotonic. Many reasons may cause a positive feedback loop, and the most common one is to use the model outcome as a factor to predict the effect of the next round. For example, if we only push to a user those brands that he/she has ever clicked. When the user sees the customized push, he/she will only click a subset. In consequence, the user will be pushed fewer and fewer brands time over time, with no new brand. A simple solution is to add a few random brands. Also there are tons of bandit algorithms you can try on.
Another problem is the ownership of data pipeline. In order to ship fast, many people borrow from other products a semi-finished pipeline or a finished one. The product is shipped and everyone is happy. Later one, however, due to the lack of definition in ownership, this pipeline is very likely to modified with downstream users not being notified. In Internet companies where fast is the key, entangled data pipelines are commonplace, and the risk of pipeline break down is very high.
The third problem is the obsolete threshold. ML systems tend to embed in many thresholds to do some simple judgments. For example, apply adjustments on people who have not purchased in the last 120 days or who have purchased more than $2000. These thresholds are usually derived from a data distribution at one point in time to filter out anomaly, e.g. over 99%. With time passing by, the threshold is not regularly refreshed, and maybe no longer valid.
6. Summary
In summary, ML system involves a number of interesting engineering challenges. The system requires both scalability and flexibility, which are inherent contradictory. Building such a system from scratch is a multi-facets challenge. From the system perspective, how to design the software architecture, and where are the system limitations. From the operation perspective, how to monitor, and the importance of the idle run. From the team management perspective, the composition of a team, training of junior members, and the design of various positions. With more time on maintaining the system, and more contacts with outside stakeholders, some ‘chronic diseases’ like positive feedback loops and code management challenges will emerge. It takes time and effort to solve all these problems. Still, it is easy to see the bright future of making ML an essential part of the entire business flow. Imagine one day, ML is truly democratized: all members in an organization can use the ML tool at their disposal. That is the day when ML can have the largest impact.
Reference: