If your analysts are using Python, you may have a data problem.
Python is one of the most popular coding languages right now, primarily because it is (a) more user friendly than languages like Java or C#, and (b) it has a lot of built-in data manipulation functionality. Python is a great resource that can be used by relatively low-skilled users to acquire, combine, clean, sort, filter and transform data - i.e. to perform 'data wrangling'.
Data Wrangling: The informal, ungoverned, and largely manual preparation of data for analysis and reporting to meet the specific needs of individuals and/or their immediate colleagues.
These are tasks that are commonly taken on by data scientists who are working with new data to determine if and how it will be useful to the broader analyst in the longer term, or by data engineers building data integration pipelines to present clean, reliable data to the broader analyst community.
Data Integration: The formal, governed, and automated preparation of data to make it available for a range of uses, including analysis and reporting by a broad range of enterprise users.
Alarm Bells
However, alarm bells should be ringing if your ordinary 'rank and file' data analysts are heavy Python users. The implication is that the data they are accessing is frequently in a rough and raw state, requiring significant individual, manual effort subsequently to make it useful for analysis. In an enterprise setting where you may have dozens or even hundreds of analysts this is BAD (capital letters and emphasis intentional).
When large numbers of individuals need to do this kind of processing before using data, we can assume that (a) similar effort is being expended repeatedly by many individuals to get the data ready, and (b) that it is highly unlikely that all individuals are 'wrangling' the data in exactly the same way.
The first point indicates that expensive analyst time and effort is being spent doing mundane, repetitive tasks. Analysts should be spending their time analyzing data, not wrangling it.
The second point all but guarantees that idiosyncrasies, errors and discrepancies will creep in, that will cause confusion and disagreement amongst both analysts and amongst the consumers of the analysis they produce, i.e. business leaders.
Recommended by LinkedIn
Shadow IT
The widespread and persistent use of use of Python or other tools like Power-Query for data wrangling by business data analysts is a form of 'shadow IT'. Shadow IT is what develops when end users stop asking IT to deliver solutions for them, and decide to implement their own solutions instead. This may be because it is slow or expensive to get IT to facilitate their requirements, or because there is no capacity (technical or otherwise) for such requests to be implemented.
This results in the development of ad hoc solutions, and the execution of tasks and processes happening freely 'in the wild', that really should be centrally planned and controlled.
In a good enterprise data architecture, the data your analysts are using on a regular basis should already be clean, consistent, reliable and transformed, and aside from occasional proofs-of-concept, the use of Python or other tools for data wrangling should be largely unnecessary.
Where analysts need a new data wrangling task done regularly, they should assume that others need it too, or will need it in the future, and should seek to have it implemented in the data layer so that it is readily available for all users. In making this request, it may become apparent that the task has already been done for them in a manner they were unaware of, or that there is some reason why it shouldn't be done that way, avoiding wasted time and confusion.
Conclusion
The data that most organizations are using regularly - sales, CRM, production, operations etc. are stable and predictable. Even if the business may want to investigate different aspects of the data more closely or in different ways over time the data doesn't really change.
End-user data wrangling in situations like this, like the development of 'spreadmarts' and other Shadow IT, is typically a symptom of a disfunction in data management, rather than a deliberate strategy. Where you see it developing, ask why, and no doubt the can of worms will soon be exposed.
John Thompson is a Director with EY's Technology Consulting practice. His primary focus for many years has been the effective design and management of enterprise data systems.
Good article John
Hi John, that's a very good point !