Agile Data Scientists Do Scale

Agile Data Scientists Do Scale

According to a Harvard Business Review 'Data Scientist's don't scale'[1].  I'm a Data Scientist, and perhaps surprisingly, I pretty much agree.

Due to the hype and rapid growth of Big Data Engineering and Data Science, it seems many companies and practitioners have gotten so excited by hiring, building infrastructure, fashionable models and shiny technology that one crucial part of the field seems to be missing - delivery. I hear of countless stories, in both small & large companies, where teams are built, clusters bought, prototype algorithms written and software is installed, but it then takes months or even longer to deliver working data driven applications, or for insights to be acted on. Hype is thick in the air but delivery is thin on the ground.

The review and related blogs correctly point out that we should focus on AI applications, that is automation. My addition is that these applications can not always be easily bought in for many domains. In such cases they should be built in-house and the builders ought to be Agile Big Data Engineers and Data Scientists that understand the importance of weekly or fortnightly iteration. The title of Data Scientist is not dead, but keeping Data Science alive will mean shifting the focus of the Data Scientist away from hacking, ad hoc analysis and prototyping and on to high quality code, automation, applications and Agile methodologies. Let's remember the technology industry has a habit of finding ways to automate the job of those that lack the imagination to transition to automators, i.e. those that cannot be scaled.

A decade or two ago delivery used be a problem in the software development industry as a whole. Releasing working software would also take months or even years. Then came along Agile, XP, Scrum and TDD and for those correctly practicing it, it seemed to solve the problem and suddenly working software could be delivered every couple of weeks.

I'm going to write some posts on how to apply and adapt Agile methodologies to Data Science and Big Data. I doubt I'll cover all areas, but rather focus on the areas that seem most difficult to adapt for data, and to be most overlooked.

Why Agile methodologies are so lacking in Data Science and Big Data is confusing. Perhaps it's the age of the industry? To be frank I believe a smidgen of elitism aims to distinguish the industries from regular software development as though the practices and principles are beneath the concerns of mighty data minds. One other issue seems to be a big misconception that "exploratory" work precludes frequent iteration over automated end to end applications, that is Data Scientists claims they need to "explore" for a month or two before they can deliver. This I see as ironic since the tension between exploratory work and continuous delivery is exactly what Agile solves. Finally another recurring misconception is that the day to day practices of Agile, like tests, automation, clean code and clean structure are "time consuming" and will slow down "exploratory" work. This is also ironic since again Agile aims to make exploratory work faster and less laborious. Hopefully the details of my posts will flesh out why these objections are misconceptions.

Lost Agile Concepts

Automatic tests are absolutely critical in correctly practicing Agile, and from TDD evolved more acronyms and terms than many Data Scientists have written tests; TDD, BDD, DDD, ATDD, SDD, EDD, CDD, unit tests, integrations tests, black box tests, end-to-end tests, systems tests, acceptance tests, property based tests, example based tests, functional tests, contract based tests, etc. At a glance things like interactive work, long running jobs, unclear objectives, peculiar development environments, etc preclude *DD approaches. Nevertheless, if one strips away the unnecessary verbosity of *DD the remaining core can easily accommodate such problems.

The next most lacking parts in the data profession seems to be a lack of high-quality code, code structure and cross-functional teams, which go hand in hand. Most definitions of Data Science actually include "hacking" as a skill. If writing bad code is part of the definition of Data Science it's no wonder then that some believe the profession "can't scale". Cross-functional teams and cross-functional team members have the obvious benefit of being able to deliver end to end working (data-driven) software. In the data professions this means the team must be able to ETL, clean, prototype, collaborate and productionise. Collaboration and productionisation cannot happen without high-quality code, which is why I consider the principles to go hand in hand.

Finally the industry desperately needs to learn the lessons already learnt in web development around decoupling separate responsibilities. We should take inspiration from the MVC (Model View Controller) architectural pattern to clean up Data Science development. We'll introduce A2EM (Ad Hoc Analysis To ETL & Evaluation, Models & Mathematics) that's primary goal is to decouple production code (EM) from ones ad hoc analytical environment via a simple pattern & process where two locations and two development tools (notebooks & IDEs) are used (by a single cross-functional team member).

Why Should Data Professionals Care?

Sometimes Data Science can feel like academia except much better paid.  So until the bubble bursts, which still doesn't seem to be any time soon, should we just have as much fun as possible?

I feel Data Scientists and Big Data Engineers have a responsibility to their business and to the profession to also focus on continuous delivery via Agile methodologies.  When executives start noticing lack of return on investment, slow delivery or scalability problems they will start hunting for a silver bullet, i.e. an easy to understand solution to a complicated problem. Sometimes silver bullets do exist, particularly when the solution required is either not domain specific, or the domain is large enough to span many companies. For example it would be insane for a company to try to build their own chat platform since this is obviously going to be a solved problem. As platforms or out-sourcing solutions are required by less and less companies there comes a cut off point where in house solutions become more efficient. It's a trade off between economies of scale and diseconomies of middle men and lack of specificity.

If data professionals fail to show ongoing business value in-house and repeat "just one more month", then executives may turn to an external solution when such a solution might be completely inappropriate and fail miserably. The reaction in software development to heal the divide between the business and developers was Agile. Agile in-house teams would often be chosen over the solutions providers, consequently the businesses saved money, developers made money and society overall became more efficient.  In fact many developers now claim to actually enjoy Agile practices, even TDD.

It's up to us to change the culture bottom up because Agile happens mainly at the bottom. Agile is not just physically standing up for daily meetings, or having a retrospective every two weeks. The devil is in the details, and we, engineers & scientists, are the closest to those details.

Ask yourself "do I deliver working (data driven & automated) output that has some business value, has some entry point independent of my own skills, knowledge & environment, and do this every couple of weeks?", or ask yourself "do I scale?"

Follow Up Posts

https://www.garudax.id/pulse/agile-data-code-structure-quality-sam-savage

https://www.garudax.id/pulse/data-science-test-driven-development-sam-savage 

References:

[1] - https://hbr.org/2015/05/data-scientists-dont-scale

Random Links & Inspiration:

Agile Manifesto: http://agilemanifesto.org

New London Meetup - Agile Data: http://www.meetup.com/Agile-Data-London-Meetup/

(Hilarious) The Land That Scrum Forgot: https://www.youtube.com/watch?v=hG4LH6P8Syk

Clean Architecture and Design: https://www.youtube.com/watch?v=Nsjsiz2A9mg

(Another) Clean Architecture: https://www.youtube.com/watch?v=Nltqi7ODZTM 

Simple Design: http://www.jamesshore.com/Agile-Book/simple_design.html

Done Means Done: http://www.allaboutagile.com/agile-principle-7-done-means-done/

(Another) Simple Design: http://guide.agilealliance.org/guide/simple-design.html

Professional Software Development: https://www.youtube.com/watch?v=zwtg7lIMUaQ

Refers to CRISP-DM as "waterfall" http://www.kbartocha.com/2013/07/10/agile-analytics-11-tips-to-make-your-repotinganalytics-more-lean/ 

 

Hi Sam, good to see people talking about this issue. On the research team I lead, we're experimenting with agile work flows for many of the reasons you mention here.

Like
Reply

Hi Sam! Nice post. I am really interested about articles like your . I think that the last paragraph must be read for all people that work with BI and Data Science. For the company leaders should be "Is my teams scaling?" ou somethink like that.

Like
Reply

Jacek Kowalski Thanks for the extended explanation, I believe we are pretty much on the same page now. Yes, not everything can be automated *now*, there will always be exceptions, and in fact in my post I even support the notion that this becomes harder the more specialised a problem is. Nevertheless I still believe Agile methodologies can be applied to the vast majority of cases, and in some sense even those edge cases where code is executed only once. Yes language choice can make certain Agile methodologies harder, especially R. I choose Scala since at a fundamental level it is vastly more powerful than R, Python & SAS, (though it lacks ML libraries). There is a couple of jokes in the Scala world, one "if you can't do it in one line, you're probably doing something wrong", and "once you go Scala you don't go back". Anyway, the main point here is that Scala is not only a highly mathematical language (since it's foundation is Functional Programming, Category Theory, Lambda Calculus & Type Theory) but it's also a production worthy language since it's easily testable, functional, statically (very) strongly typed language that interoperates easily with JVM tech stacks. Of course the biggest caveat with Scala is it's notorious for having the steepest learning curve, yes it's 10x more powerful, but I'd say it takes at least 10x longer to learn than most other languages. Anyway, I digress. Thanks for your input I enjoyed the discussion.

Like
Reply

> I'm extremely suspicious of any claim of large amounts of non-trivial code that is > executed once, with a single set of parameters, which then generates a report with a > single unchanging view. [...] > I think saying you "totally disagree" is a bit strong I can give you many examples of complex one-off disposable code. Years ago we had to decide on peering agreements based on network traffic data. First the Internet traffic flow data needed to be collected which required some experimental design. Then the flows had to be resolved by autonomous systems and domains which was a rather slow process so to speed it up we again resorted to experimental design (PPS sampling). Once we were done a decision was made, agreements signed and the code never revisited again. Another example was a network test transponder deployment strategy which involved a lot of simulations for different experimental designs. The input was a large teletraffic matrix and some technology specific experimental parameters. Once the optimum deployment strategy was found that was it. What I am trying to demonstrate is that it all depends on the type and complexity of the problem that you are dealing with. I don't know if you have any experience writing R/S-plus code that you earlier compared to a shell script which is grossly misplaced. The power of R is that there are hundreds of packages with very powerful and often cutting edge analytics written by a large community of users. Those package are often written in C and contain thousands of lines of complex code which I don't have to write. Although my R code may have 100 lines, it is in fact very complex but the complexity is hidden away from me. Given the typical size of the R code its maintenance is not an issue. When I deal with a new problem and related data that is supposed to be processed on an ongoing basis I always try to quickly prove the concept before making my recommendations for developers. There is usually too much work involved in building a proper production application to risk it without a prior verification of analytics and as I mentioned earlier unless you deal with a simple problem analytics is the hardest part, at least in the kind of problems that I deal with. Obviously R is not suitable for most production applications but from my point of view it's the best exploration and PoC tool for analytics. Obviously saying that I totally disagree with you was exaggeration in response to your equally strong claims about Analytics being a low hanging fruit. As you know there is a an area termed Predictive APIs that builds Analytics for the masses and I have keen interest in it (I am on the program committee of the PAPIs 2015 Conference in Sydney http://www.papis.io/sydney/ which I would recommend you attend). However, even companies that have been playing in this space for a while do not claim that they can automate everything and eliminate data scientists.

Like
Reply

Spot on, Data Science can't scale because it is practiced like random hacking. The roots of Data Science come from research and academic institutions and these have never been subject to 'productization'. So it is not so odd that Agile methodologies have not been applied to this area. I agree that Agile needs to be applied to this space, however the big open question here is how. It is not that obvious. At Alluviate we are working towards that goal, how to build a platform to do Data Science in a scalable manner.

To view or add a comment, sign in

More articles by Sam Savage

Others also viewed

Explore content categories