Analytics and Big Data, re-framing the Information Scientist
My first encounter with artificial intelligence was at the Museum of Science in Boston. They had this Tic-Tac-Toe machine and you couldn’t beat it. I remember thinking Tic-Tac-Toe would be perfect for my science fair project, and began wiring my “If this, then that” rules. I never finished, but at some point while designing my wire algorithm I realized that Mr. Tic-Tac-Toe wasn’t really smart – just a bunch of rules.
Years later, with engineering degree in hand, the challenge reemerged in the form of “automated route optimization”. My hopes were higher. Technology had advanced. Software languages had evolved. There was even a company selling an AI framework. While the data sets were larger and I had a vehicle model to account for, I realized that once again I was looking at a rule based system, albeit with far greater permutations. This wasn’t intelligence. It was rule writing labor.
Then I read an article that intrigued me regarding “real” artificial intelligence. There had been this ongoing debate in academia over how the human brain processes data and makes decisions. To emphasize the point, they named the concept “neural networks”. What really caught my interest was that the founder of MIT’s Artificial Intelligence Lab and published opponent of neural networks, Prof. Marvin Minsky, was now admitting he may have been wrong. Better yet, there was a conference scheduled at MIT where Minsky and Stanford Professor Nils Nilsson, a pioneer in neural networks, would be discussing the latest research along with the new hero in town, Danny Hillis, who would be demonstrating the neural net algorithm using one of his Thinking Machines. Really cool!
I was now firmly on the path of discovery. I took a master level course in Pattern Recognition and armed myself with the fundamentals, but it was like opening Pandora’s Box, exposing me to new challenges. The algorithms needed data, and for accuracy in their predictions, they needed lots of data. I ended up getting pointed towards multiple disparate data silos, and until you access the data yourself you really don’t know if it contains the information you’re looking for, or how useful it is due to poor quality. I was beginning to appreciate the simplicity of rule based engines.
Which leads me to the point of this article.
The primary challenge in Advanced Analytics and Big Data is understanding the full spectrum of the data needed on which to gain insight and base decisions.
The greater the number of data sources, the greater the confidence in decision making and the greater the benefits, but with this comes greater cost and greater risk in failure if you don’t start with the right approach.
You don’t start with plans to “boil the ocean”. You start out small by employing data discovery and data modelling on internal and mature data sources, and if a pattern emerges - where the benefit of embedding a rule or alert in your decision making process minimizes a more costly event occurring - then the effort is worth the cost.
Fun Note: You might consider just using Excel and sampling the data for possible patterns and anomalies. I had been using Excel for years and had never considered it as a self-service tool for data modelling until I came across John Forman’s book, “Data Smart – using Data Science to Transform Information into Insight”, http://www.john-foreman.com/data-smart-book.html. For fun, check out the author’s comments by clicking on the darkened image for a “game like” introduction to his book.
Loyalty Coupons in the Transportation Industry
It doesn’t need to be complicated to discover business value. A simple example we built years ago, embedding a data model into Seibel CRM, predicted the likelihood of losing high valued customers based on actual churn in the company’s historical data. It was a simple model with one mature data source. Armed with this information the company offered discounts, coupons and upgrades on an individual basis versus offering discounts to an entire market segment, and gained the benefit of increased customer loyalty.
Competition Analysis in Freight Shipping grew Complexity
Be careful of simple models growing in complexity. A favorite example my team and I stumbled into was with a major international freight shipping company looking to improve its forecasting, resource allocation, distribution and supply chain planning. Following good agile methodology we first built a prototype showing freight services contracted with high-valued customers, and how it varied over time. What was not intended was the discovery the company’s marketing team made, noticing that the negative trends they were seeing with specific customers might be the result of competitive advertising.
Initially, both companies started as simple data modelling exercises using high quality historical data generated from mature transaction systems and processes.
It was in the freight shipper’s model where the marketing team’s need grew in complexity. For the model to be effective, we would need to account for additional sources of data. Basing financial decisions to fend off a competitor required more data than just a drop-off in client shipping contracts.
- We needed to assess the impact of our own advertising, or lack of.
- We needed to account for variability across market segments.
- We needed to assess the impact of global economics and foreign exchange rates.
- We needed to monitor delivery performance and customer complaints.
The IT team pointed out other areas of consideration, such as scalability, third-party data quality, data size, data translation and flexibility to incorporate change.
With all these great ideas flowing around we were also guilty of getting caught up in the excitement. As we took steps going forward, one of those steps should have been revisiting the ROI model and asking, “Is the answer still worth the investment?”
Here’s where "what you don’t know about Advanced Analytics and Big Data that can hurt you".
- It’s the data source you didn’t consider when building and testing your algorithm.
- It’s the assumptions made in reducing the data size and culling out what you believe isn’t needed to validate your model’s effectiveness.
- It’s trusting that the data quality is high in Big Data, or that that poor data quality isn’t statistically significant in Big Data.
To emphasize the uncertainty associated with data modelling and predictive analytics you only need look at what big banks are going through right now to pass the Stress Testing scenarios required by Dodd Frank (hint: no passes yet). To show how confident the banking system is with its probability of default (Pd) models, the Dow Jones dropped 350 points on rumors the Basel Committee might raise the capital reserve rate. Yes, worries over China’s economic recovery and the Fed hinting at raising interest rates also occurred on the same day, but I’m betting these two items were already in Wall Street’s data model.
An interesting corollary to the point I’m making with data modelling and advanced analytics is a blog posting recently published by Stephen DeAngelis, the CEO at Enterra Solutions “Big Data Analytics and the Connected Supply Chain” http://www.enterrasolutions.com/2015/03/big-data-analytics-connected-supply-chain.html, where DeAngelis references several supply chain thought leaders addressing benefits Big Data might provide early adopters:
- Boston Consulting Group analysts speaking to the combination of “fast moving, and varied streams of Big Data and advanced tools and techniques such as geoanalytics representing the next frontier of supply chain innovation”
- Deloitte analysts writing, ““Maintaining a competitive edge means building a Digital Enterprise that’s capable of taking full advantage of social, mobile, web, cloud and analytic technologies. … It requires integration of people, processes, and capabilities to deliver an omni-channel experience.”
Clearly there’s a lot of excitement in the possibilities Big Data offers supply chain management, but it’s DeAngelis reference of, Ray Major, the Chief Strategist of Halo Business Intelligence, in Major’s “Understanding the Supply Chain Data Continuum” published September 1, 2014. http://halobi.com/2014/09/understanding-the-supply-chain-data-continuum/ where Major shares a framework “to categorize by maturity (and difficulty) the important aspects of Data”, which I found particularly interesting.
While the premise of Major’s article is the benefit companies gain in business intelligence as they move up the data complexity spectrum, starting with single sources of data, migrating to multiple sources, adding in real-time data (critical to “supply chain information management” ) and then finally reaching “The holy grail of the data continuum”, Big Data. It’s the references to “and difficulty”, “data cleansing”, and Big Data being the “most complex to manage and utilize” as it uses “unstructured data from external sources” that reinforces what I see as a growing need.
Re-framing the Role of the Information Scientist? Information Architect? Information Manager? Who has Governance?
I've read articles calling out for Chief Data Officers; but the title for me sounds like a management position. I've worked with enough data scientists to know an engineer when I see one, and they’re not cheap either. From my experience I believe there needs to be a role between what a data scientist provides in data mining, data discovery, statistical analysis and modeling, and what a business analyst provides in domain expertise, detailed business process diagramming and requirements documentation. Someone who has governance over what your company is exacting from Big Data and using as a basis of decision making.
Wikipedia offers a definition of the role of an Information Scientist,
“An information scientist is an individual, usually with a relevant subject degree or high level of subject knowledge, providing focused information to scientific and technical research staff in industry, a role quite distinct from and complementary to that of a librarian. The title also applies to an individual carrying out research in information science”.
Wikipedia goes on to describe how the field of Information Science is evolving and in my opinion the definition of an Information Scientist might need to evolve as well.
While I agree that the role requires a level of subject knowledge, I see this as more business knowledge than “research topic knowledge” and I see the role as providing information to both data scientists and business decision makers. I would reframe the role of an Information Scientist, or any suggestions for a better title, to include familiarity with internal and external data sources, data integration, data cleansing and data security. Additionally, an important responsibility for the role would be an unbiased weighing of options along with creating and maintaining costs versus benefits ROI, prior to the design and during the build as “scope creep” inevitably enters the picture. As the maturity of the firm’s use of data sources and advanced analytics matures, the long term role of the Information Scientist should include demystifying Big Data and overseeing the creation of “Data Lakes versus Data Swamps”.
Thanks Jeff, It's surprising how poor the data quality can be, especially in large enterprises that grew through multiple mergers and acquisitions. And I'm hearing a new term being coined, "DaaS - Data as a Service", and the more I think about this, the better it sounds, as there's a company whose reputation is tied to the quality of the data. i'm seeing this in Electronic Health Records for Drug Research, and it's been common practice in banking for Operational Risk data.
you make some excellent points gearing to the evolution of what criteria responsibilities are needed to ensure companies thrive buying and then using technology to advance performance.