If your analysts are using Python, you may have a data problem.
(Image from https://pxhere.com/en/photo/760671)

If your analysts are using Python, you may have a data problem.

Python is one of the most popular coding languages right now, primarily because it is (a) more user friendly than languages like Java or C#, and (b) it has a lot of built-in data manipulation functionality. Python is a great resource that can be used by relatively low-skilled users to acquire, combine, clean, sort, filter and transform data - i.e. to perform 'data wrangling'.

Data Wrangling: The informal, ungoverned, and largely manual preparation of data for analysis and reporting to meet the specific needs of individuals and/or their immediate colleagues.

These are tasks that are commonly taken on by data scientists who are working with new data to determine if and how it will be useful to the broader analyst in the longer term, or by data engineers building data integration pipelines to present clean, reliable data to the broader analyst community.

Data Integration: The formal, governed, and automated preparation of data to make it available for a range of uses, including analysis and reporting by a broad range of enterprise users.

Alarm Bells

However, alarm bells should be ringing if your ordinary 'rank and file' data analysts are heavy Python users. The implication is that the data they are accessing is frequently in a rough and raw state, requiring significant individual, manual effort subsequently to make it useful for analysis. In an enterprise setting where you may have dozens or even hundreds of analysts this is BAD (capital letters and emphasis intentional).

When large numbers of individuals need to do this kind of processing before using data, we can assume that (a) similar effort is being expended repeatedly by many individuals to get the data ready, and (b) that it is highly unlikely that all individuals are 'wrangling' the data in exactly the same way.

The first point indicates that expensive analyst time and effort is being spent doing mundane, repetitive tasks. Analysts should be spending their time analyzing data, not wrangling it.

The second point all but guarantees that idiosyncrasies, errors and discrepancies will creep in, that will cause confusion and disagreement amongst both analysts and amongst the consumers of the analysis they produce, i.e. business leaders.

Shadow IT

The widespread and persistent use of use of Python or other tools like Power-Query for data wrangling by business data analysts is a form of 'shadow IT'. Shadow IT is what develops when end users stop asking IT to deliver solutions for them, and decide to implement their own solutions instead. This may be because it is slow or expensive to get IT to facilitate their requirements, or because there is no capacity (technical or otherwise) for such requests to be implemented.

This results in the development of ad hoc solutions, and the execution of tasks and processes happening freely 'in the wild', that really should be centrally planned and controlled.

In a good enterprise data architecture, the data your analysts are using on a regular basis should already be clean, consistent, reliable and transformed, and aside from occasional proofs-of-concept, the use of Python or other tools for data wrangling should be largely unnecessary.

Where analysts need a new data wrangling task done regularly, they should assume that others need it too, or will need it in the future, and should seek to have it implemented in the data layer so that it is readily available for all users. In making this request, it may become apparent that the task has already been done for them in a manner they were unaware of, or that there is some reason why it shouldn't be done that way, avoiding wasted time and confusion.

Conclusion

The data that most organizations are using regularly - sales, CRM, production, operations etc. are stable and predictable. Even if the business may want to investigate different aspects of the data more closely or in different ways over time the data doesn't really change.

End-user data wrangling in situations like this, like the development of 'spreadmarts' and other Shadow IT, is typically a symptom of a disfunction in data management, rather than a deliberate strategy. Where you see it developing, ask why, and no doubt the can of worms will soon be exposed.

John Thompson is a Director with EY's Technology Consulting practice. His primary focus for many years has been the effective design and management of enterprise data systems.

To view or add a comment, sign in

More articles by John Thompson

  • You Buy Your Freedom - AI and Regulation

    When I joined my old company, Client Solutions, it was part of a larger group of companies called the Horizon Group…

    3 Comments
  • Is the DMBoK Wrong on Data Governance?

    Data governance is a vital component of business data strategy, but confusion remains over who should actually ‘own’…

    4 Comments
  • What is a Data Fabric?

    Imagine a world where your data flows effortlessly across systems, unlocking powerful analytics and AI without the…

  • The Data Model is the API

    The age of big data created new opportunities for Enterprise Data, but also confusion. The bold promises of new…

  • If Cork won't buy AI, maybe AI could buy Cork?

    I was chatting with a colleague of mine recently who had encountered some scepticism when presenting about the benefits…

  • Data Centers and Sustainability

    There has been much negative talk recently in Ireland about data-centers and the resources they consume. The debate is…

    3 Comments
  • Where's the (Business) Logic?

    Business Logic must be applied to data in many situations in order to fulfil business requirements not present in the…

  • The Burger Architecture - A Real Data Stack?

    In a recent article I wrote about the 'medallion architecture', which various vendors describe as a series of 'layers'…

  • Is 'Medallion' a Data Architecture?

    'Medallion Architecture' is a phrase that had been used a lot over the last few years, and I have to admit its one that…

  • The Pocket Data Warehouse

    The 'pocket data warehouse' is a simple, but powerful idea that leverages clever data modelling techniques to provides…

    1 Comment

Others also viewed

Explore content categories