Statistical Computing Environment: Why?
In today’s world, every drug development organization should have a modern statistical computing environment (SCE). I have been involved in creation and provisioning of several SCE’s at this point, and I can vouch that setting these up may not be easy, especially if you have many ongoing studies, but there is a need to make the switch.
What is an SCE?
In short, this is a system that houses all of the statistical computing needs of the company, including (1) data storage; (2) computing capabilities; (3) and statistical language.
Many organizations historically have used desktops or desktop-like computing for the computing engine, which also hosts SAS as the language, and a shared drive for their storage. The computing grid and virtual machines expanded this capability, but did not fundamentally alter the paradigm. But with the modern era with large RWE data sets, AI capabilities, connected data, and more, this framework needs to be revisited.
The new SCE will make use of distributed computing, modern data flow architectures, and newer, more efficient languages. As I discussed in a prior post, programmer now need to be polyglots, and not focused on a single language (e.g., SAS). Each language has its own advantages and disadvantages, and programmers should be able to take advantage of each.
An SCE is a cloud-based computing environment that provides statistical computing power, data storage, backups, multiple statistical languages, collaboration capability, global presence, and programming efficiencies and robustness. You can outsource this capability to others, or larger organization will tend to bring that in-house.
Advantages of SCE
Compared with the traditional desktop-based computing environment (and I include virtual machines as part of the desktop environment), there are numerous advantages to an SCE in today’s environment. See the chart below.
I would like to talk a little about the collaboration and computational power advantages, because that is where SCE’s shine.
Collaboration
A colleague and I were working on a tricky analysis, and were working on different pieces of the program. We had both been working on this particularly analysis for about 1 working day, say total of 7 hours each. I was using Notepad++, a nice system that facilitates collaboration as it lets us you know if the file you are working on has changed, and it then add the changes to your version. The other programmer used Notepad, and overwrote every thing I had done: 7 hours washed out!
Maybe if my colleague had been using Notepad++ this would have been less of an issue, but this sort of loss can still happen. Note with systems like Git, which are typically baked into SCE’s from the get-go. Git-like systems preserve all of the version that are saved, so that if someone else stomps on top of your code, you can step back a version if needed.
Statistical programming can be thought of as a waterfall programming shop that produces lots of little programs, since most of the activity is programming. If we are professional programmers, it makes sense to use professional tools. This allows more people to work on a particular problem, while eliminating the loss of code by overwriting, accidental movements, and much more.
Computational Power
You have a deadline 2 days away to finish a table, and it requires you to process 20 million patients from a real-world data set, put a sample of those into SDTM format, and include in the analysis with the data collected from the controlled trial. And because you are working on the grid with a single CPU, your program will run 3 days. Yes, this scenario is getting more common.
Another one: FDA has asked for PK model to be updated, requiring 1000 bootstrapped models to be fit, and this needs to go back out tomorrow.
Another one: You need to generate data listings that show the progress of data cleaning, and this requires merge of several very large data sets, creating gigabyte data sets. And your organization needs to do this daily on several large trials with upcoming database locks.
I could go on and on, but I think I established the need with these few examples.
Clinical trials have historically note been considered large data sets and have not required huge computational capabilities, in the same way we think of those features now, but that is changing. Demographers always had large data sets, such as census data, but clinical trials have data on maybe 1000 patients, 20 domains, 8 visits per patient. But this is changing. With the advent of connected devices that send data very frequently, to inclusion of real world data, the size of data set continue to grow.
To solve these problems, you need systems that have parallel computing capabilities, with multiple nodes that can be spun up when needed, and that can utilizing the computing language that is best for the need.
Recommended by LinkedIn
Using SCE’s that can handle both the data storage needs, the data transfer needs, and the computational needs will facilitate solving the problems in the timeframe needed.
Some Requirements To Consider
Your requirements will be a function of what you do, your organization’s size, interactions with others, how wedded you are to your current language, etc. But some considerations are below.
Compliance & Governance (GxP)
Language Agility
Data & Environment Control
Workflow Automation
Fit For Your Team
Quick Comparison of Some Top Vendors
Basic Implementation Roadmap
Some things to think about when setting up an SCE. I have this for about 90 days, and that is achievable for smaller organizations. For larger, expect this to take longer.
Conclusion
Moving to an SCE allows for greater efficiency, more stable environments, broader analytical capabilities, and reduced timelines when done properly. The natural inclination is to assume that the current computing systems are adequate, since it is working. But the competitive environment is quickly changing, and so we need to look at what our computing needs in that light. Moving to an SCE requires planning, but is not as expensive as you may think.
If you would like a discussion about your SCE, feel free to reach out to me through a DM. I have been involved in several of these, including with Domino Data Labs and SAS, and both of these (and others) have a lot to offer.