A Description of Machine learning, written for MBAs
Description | Challenges | Business Strategy
Machine learning (ML) is powerful. IMO, it is basically the practical application of Artificial Intelligence (AI). A lot of leaders in data driven businesses have an intuitive sense that they could make ML useful to their customers via integrating ML into their products. Many of them just need to know what it’s all about, and what it looks like to “get there.” The purpose of this post is to give a high level MBA view of making ML useful by describing it, illuminating challenges, and giving some opinion on strategy for how you run these types of efforts.
Description:
Machine learning is the activity of running data through algorithms to accomplish two basic things: (1) group similar data, or (2) “tag” data with a label that means something to you. You’ll see the words “clustering” and “classification” used in nerdy literature to describe these two basic things. Clustering is simply grouping similar data. Classification is simply tagging data with a label that describes it. Both of these classes of ML have two modes of operation, “deterministic” and “probabilistic.” These terms mean that some algorithms give you an answer as a likelihood and others give you a yes or no. Also, ML consists of two high level methods of learning: supervised learning, and unsupervised learning. Supervised means a human being must put data together and teach the algorithm with what is called a “training set.” Unsupervised means the algorithm looks at the data and figures things out on its own. Typically, unsuspervised methods are used for clustering (grouping similar data), and supervised methods are used in classification (tagging data with a meaningful label). You may have seen mention of “deep learning,” which in practicality is nothing more than a new under the hood technique that can get you better classifications (but it requires more training data). All ML is driven by data formatted as what are called “features” or “vectors,” which are just fancy terms for “your data formatted in a way that makes it meaningful and compatible with ML algorithms.” Ok, so that’s a decent enough description of what ML actually is and generally does (albeit monumentally oversimplified from a computer scientist’s perspective), now here’s a few real uses that I am living right now.
At PlanetRisk, we use supervised learning to tag incoming data from news feeds with an “incident category.” PlanetRisk has a large database of incidents that humans created, so we build a model out of these incidents to tag incoming data streams. This ML will enable our Global Intelligence Operations Center to produce more incidents faster, which means our customers get better situational awareness (we provide this for thousands of customer assets). This is an example of supervised learning for probabilistic classification (we taught an algorithm how to label data with our incident taxonomy and the algorithm racks and stacks by likelihood). Use your imagination to think of interesting use cases…think about what you could do if you could automatically predict what a piece of data “is” as it hits your system. Think about what you could do if you knew what data was similar to any other given row/chunk of data; like a web. It’s fascinating what you might come up with.
So far, I’ve just described what ML is and kind of how it works, and a tiny bit of what PlanetRisk does with it. Now I’ll philosophize a bit on the challenges of implementing these types of initiatives, from a business perspective (I will emulate business thinking as much as my technology brain allows).
Challenges
For the purposes of this article, I’ll illuminate challenges using a random stream of thought, loosely correlated to people, data, cost, usability, and expectation factors. I encourage copious use of inferential logic on your part. A lot of business executives get frustrated by how long these types of efforts take to bear fruit, and they also wonder why the tech folks can’t give them any definitive timelines or seemingly any degree of certainty about any aspect of the product. Here are a few bullets points of challenges, and how they affect (or probably should affect) expectation and ultimately the management of an ML (or data science) initiative:
· People that know how to implement ML so it can be integrated into real business systems are hard to find and are expensive. It’s going to take longer than you think to build a team. A person who knows the business, understands the data and the use cases, and can implement the ML code into a real system is called a “unicorn.” If the timeline you were briefed starts when adequate resources materialize, the start point could slip. Also, I never recommend hiring “theoretical data scientists” because they often can’t do the data munging piece, which is 80% of the work (over-generalization, I know).
· The algorithms are commodity, but getting the data assembled, formatted, stored, and scaling how data gets TO and FROM those commodity algorithms is the hard part. The data munging will kill your timeline in a heartbeat. Never assume any data is ready for ML Also, every time you change your ML related code (or model) you will often need to reprocess all of your data.
· Just because you hire an ML expert, that doesn’t mean they will understand your data or your use case or your architecture immediately. There will be a learning curve period, full of mistakes. What they are doing is going to be imperceptibly complex to you – have operational patience, and general human compassion (for as long as you can anyway). Help them understand the data by asking them tough questions about it. The more involved you are the more productive they will be.
· Your data might not be good for ML. Often times a data science or ML initiative will uncover all types of data pedigree and data anomalies (in oversimplified terms, this means “bad data”). Don’t be surprised if your tech folks want you to buy better data to augment your own. In fact, it’s probably a good idea to first hire a consultant to estimate the value of your data for use in ML.
· ML experts are technology focused people. It can be hard to keep them on a practical track sometimes, because there are so many research “rat holes” they will need to go down to get to an endstate. It’s hard to tell the difference between going down a natural rathole, and going completely off course while it’s happening. Also, don’t be surprised if V1 of the capability is a bit “off.” Avoid this by making sure an authoritative business owner provides continuous steering and knows details…real details.
· Your IT budget is going to grow. “Big Data” and ML require hardware, sometimes a lot of it, and sometimes it’s not just commodity hardware. Your people will start asking for “clusters” of computers, maybe GPU nodes, and your cloud cost may go up.
· If your data security policies are complicated (encrypted at rest, your employees can’t see it, or only some can see certain data but not others etc) then this will slow things down. Most ML experts are not used to a security constrained data environment. This may also make things complicated if the analysis you do with ML can’t mix all the data for all the customers. Make sure your IT folks are adhering to security policy on this journey.
· Your people will want to either choose open source software, or they will want to buy COTS tools to get ML work done. ML and Big Data practitioners often jump to the open source world to solve problems. However, remember, in this case your people have to do a lot of foundational ditch digging, and maintenance, and glue coding, and maintenance of glue code etc. Force your people to think through what it’s going to take to build vs buy, and do a long term TCO of the coding work and infrastructure cost. Don’t build the commodity pieces of the puzzle, build only the stuff that adds to your business value. Technologists like to play with stuff, don’t let them play with stuff unless you see real payoff.
· Sometimes users don’t understand the results of your ML output (because it might be weird or flat wrong). Don’t let your people use data they don’t understand in algorithms they don’t understand and then distribute results that they don’t understand. At least try to make sure they understand why they don’t understand. It’s your job to sanity check and keep it real, and make sure the output makes sense for the problem your customer is trying to solve. This is really easy to say, hard as hell to do.
· Not everyone has big data. Technologists love to think you/they do so they can play with big data toys. Most businesses don’t really need to dive into the Big Data technology buzzword soup, but their technology people go there anyway because it’s cool. Encourage the use of the simplest tool for the job, and try to leverage existing investments. There is nothing wrong with reading data out of a regular good old fashioned database into an ML program, and then putting the results right back into it if your data and use case allow.
· People like to preoptimize on the tech side. Build an MVP first, then worry about scale and all that after you look at the results you’re getting and make sure someone cares about it. This is hard to do, because it’s hard to tell if results are promising or not sometimes; the results will seem like they could be, but it’s hard to be sure, this is an epistemologically challenged space.
Ok, I could go on forever with these tidbits of wisdom, and everyone who has ever been on this type of journey could add ten more items to my list. So, the question is for the MBA level business person: What is a decent strategy for running something like this? Some principles maybe?
I’m not an MBA, but I had the privilege of being mentored by a great one, and based on my experiences and input, I have reduced an ML strategy down to these axioms and sequence for this oversimplified post:
· Don’t deny the challenges. They are reality, not a nuisance. Love them.
· Decide which use case you think ML can accelerate for you. Don’t boil the ocean. The more specific you are, the more focused the tech team will be.
· Assess your data’s value against that use case
o The data is guilty until proven innocent; it’s the enemy.
o Don’t invest in any infrastructure at all to do this, other than temporary cloud formations if really needed
· If your data has potential, have your tech guys create 3 competing courses of action (COA) for implementation, including the following pieces when relevant
o IT infrastructure plan (i.e cloud vs on-prem)
o ML tools and processing platforms (Hadoop ecosystem vs say… Oracle or MS)
o Actual techniques and algorithms (Deep Learning vs “normal” ML)
o TCO of each, and a rough time line (don’t take the timeline too seriously)
o Force them to produce an MVP early, but not so early that it will inevitably suck
o Build in play time to tinker with stuff that no one on the team understands yet….perhaps have an R&D phase up front to kick the tires on some technologies.
· Have them brief the COAs to you, and collaborate your way to a simple modular plan
· Execute the plan and make sure someone is doing continuous azimuth checks
· Do your MBA stuff
That sums it up. I hope this was helpful. If you want more info or you are interested in the Big Data analytics that PlanetRisk may help you with let me know (PlanetRisk has a professional services arm) feel free to reach out to me. Good luck!
"Don’t let your people use data they don’t understand in algorithms they don’t understand and then distribute results that they don’t understand." --that's a keeper.
Very well written.
quite impressive..upto the mark
Great article! The "playtime" element is huge. Leveraging data is fundamentally exploratory, unlike other projects where you can set an agenda and follow it. You absolutely do not know anything about the data until you get your hands dirty, preprocess it, and start playing with it, and data has a way of wrecking your most elaborate and elegant models. Data is fierce, stubborn, mean and ugly, and no amount of shiny new hardware will change that. Which means, from a management standpoint, data scientists need to be given plenty of time to *discover* what their data can provide, and then management can act on it with planning and commitment of time and resources - too often the agenda comes first and then the data exploration, and then in the middle of the project people find that the data doesn't behave ... The value of playtime: two weeks ago, while on vacation at the beach, I was able to figure out a question-answering scheme for a chatbot I built. This didn't work, this didn't work, this delivered good results - phrases matched, words matched, the sentences more or less answered the question. When I came back, I plugged this feature in, and now the chatbot can answer a large number of random questions.