Statistical Computing Environment: Why?

In today’s world, every drug development organization should have a modern statistical computing environment (SCE). I have been involved in creation and provisioning of several SCE’s at this point, and I can vouch that setting these up may not be easy, especially if you have many ongoing studies, but there is a need to make the switch.

What is an SCE?

In short, this is a system that houses all of the statistical computing needs of the company, including (1) data storage; (2) computing capabilities; (3) and statistical language.

Many organizations historically have used desktops or desktop-like computing for the computing engine, which also hosts SAS as the language, and a shared drive for their storage. The computing grid and virtual machines expanded this capability, but did not fundamentally alter the paradigm. But with the modern era with large RWE data sets, AI capabilities, connected data, and more, this framework needs to be revisited.

The new SCE will make use of distributed computing, modern data flow architectures, and newer, more efficient languages. As I discussed in a prior post, programmer now need to be polyglots, and not focused on a single language (e.g., SAS). Each language has its own advantages and disadvantages, and programmers should be able to take advantage of each.

An SCE is a cloud-based computing environment that provides statistical computing power, data storage, backups, multiple statistical languages, collaboration capability, global presence, and programming efficiencies and robustness. You can outsource this capability to others, or larger organization will tend to bring that in-house.

Advantages of SCE

Compared with the traditional desktop-based computing environment (and I include virtual machines as part of the desktop environment), there are numerous advantages to an SCE in today’s environment. See the chart below.

I would like to talk a little about the collaboration and computational power advantages, because that is where SCE’s shine.

Collaboration

A colleague and I were working on a tricky analysis, and were working on different pieces of the program. We had both been working on this particularly analysis for about 1 working day, say total of 7 hours each. I was using Notepad++, a nice system that facilitates collaboration as it lets us you know if the file you are working on has changed, and it then add the changes to your version. The other programmer used Notepad, and overwrote every thing I had done: 7 hours washed out!

Maybe if my colleague had been using Notepad++ this would have been less of an issue, but this sort of loss can still happen. Note with systems like Git, which are typically baked into SCE’s from the get-go. Git-like systems preserve all of the version that are saved, so that if someone else stomps on top of your code, you can step back a version if needed.

Statistical programming can be thought of as a waterfall programming shop that produces lots of little programs, since most of the activity is programming. If we are professional programmers, it makes sense to use professional tools. This allows more people to work on a particular problem, while eliminating the loss of code by overwriting, accidental movements, and much more.

Computational Power

You have a deadline 2 days away to finish a table, and it requires you to process 20 million patients from a real-world data set, put a sample of those into SDTM format, and include in the analysis with the data collected from the controlled trial. And because you are working on the grid with a single CPU, your program will run 3 days. Yes, this scenario is getting more common.

Another one: FDA has asked for PK model to be updated, requiring 1000 bootstrapped models to be fit, and this needs to go back out tomorrow.

Another one: You need to generate data listings that show the progress of data cleaning, and this requires merge of several very large data sets, creating gigabyte data sets. And your organization needs to do this daily on several large trials with upcoming database locks.

I could go on and on, but I think I established the need with these few examples.

Clinical trials have historically note been considered large data sets and have not required huge computational capabilities, in the same way we think of those features now, but that is changing. Demographers always had large data sets, such as census data, but clinical trials have data on maybe 1000 patients, 20 domains, 8 visits per patient. But this is changing. With the advent of connected devices that send data very frequently, to inclusion of real world data, the size of data set continue to grow.

To solve these problems, you need systems that have parallel computing capabilities, with multiple nodes that can be spun up when needed, and that can utilizing the computing language that is best for the need.

Using SCE’s that can handle both the data storage needs, the data transfer needs, and the computational needs will facilitate solving the problems in the timeframe needed.

Some Requirements To Consider

Your requirements will be a function of what you do, your organization’s size, interactions with others, how wedded you are to your current language, etc. But some considerations are below.

Compliance & Governance (GxP)

Audit Trails: Does it capture the "Who, What, When, and Why" for every code execution? This is becoming increasingly important with regulators.
Validation: Does the vendor provide an IQ/OQ (Installation/Operational Qualification) package?
Part 11 Compliance: Support for electronic signatures and forced versioning of outputs.
eTMF: Ability to file programs and outputs into the eTMF automatically.

Language Agility

Does it support SAS, R, and Python simultaneously? Legacy code is usually in SAS, though many bespoke procedures are in R, and for many biometrics functions Python will soon be the language of choice. Julia next?

Data & Environment Control

Containerization: Can you "lock" a specific version of an R package (e.g., ggplot2 v3.4) so that an update doesn't break your 3-year-old study? Really important for R (or Python packages if you are using Python for the statistical analyses), but even with SAS this problem arises. When the FDA asks for a check on an analysis and a new population flag, you need to quickly have the old settings replicable.
Centralized Storage: Direct integration with Clinical Data Repositories (CDR) or EDCs.

Workflow Automation

Does it have a "Promotion" workflow (Dev → Test → Prod)? Having some ability to automate and track this helps significantly.
Can it automate the creation of a "Submission-Ready Package" (SDTM, ADaM, TLFs)? This is harder than it looks, and AI tools will get you some of the way there, but humans are still incredibly important for this task.
Does it allow for the use of work from prior projects for the new project (e.g., Table A from last study to support Table A in the current study, assuming same customer for both)?

Fit For Your Team

How easy is it to get information into and out of the SCE?
How much training is required to use it?
Can you expand it if needed, either for a specific project or because your business is growing?
Does the commercial terms fit with your business’s cash flow and revenue needs? Do you need an asset to depreciate or do you need to rent capability?
How effective is the customer service? Some surprisingly expensive services have ineffective customer service.
Is there any additional features that need to be added to the system? Any customization?
What is the uptime guarantees? Global organizations need close to 99.9% uptime. Note that 99.9% uptime means there are around 9 hours of downtime in a year: Can you organization afford that?

Quick Comparison of Some Top Vendors

Basic Implementation Roadmap

Some things to think about when setting up an SCE. I have this for about 90 days, and that is achievable for smaller organizations. For larger, expect this to take longer.

Phase 1: Environment Hardening (Weeks 1–4) Set up secure cloud VPC (e.g., AWS, Azure, Oracle). Note that some vendors do this for you.
Install the SCE software and configure Single Sign-On (SSO). SSO itself can take several weeks, and many vendors charge an SSO fee.
Plan out data flow from outside vendors, and from inside data sources (e.g., Rave). Do you need to standardize data into a data lake? If so, then recommend to do this on the pilot study. Will add another month at least to the process.
Phase 2: Validation & Qualification (Weeks 5–8)
Write and execute Installation Qualification (IQ) and Operational Qualification (OQ).
Create Standard Operating Procedures (SOPs) for code promotion and data access. You SOPs may need to change if the system handles code promotion for you. And do not forget data retention policies, which may need to be reviewed in light of current regulations and the new technologies. Archived data can be relegated to cheaper, slower memory.
Phase 3: Pilot Study Migration (Weeks 9–12)
Select one "low-risk" (usually a Phase I trial or an ongoing meta-analysis).
Train employees on the new workflow. This includes Data movement into and out of the SCE Code sharing Code development through Git-type features Output delivery methods Any AI/ML capabilities in the system, or that may be new.

Conclusion

Moving to an SCE allows for greater efficiency, more stable environments, broader analytical capabilities, and reduced timelines when done properly. The natural inclination is to assume that the current computing systems are adequate, since it is working. But the competitive environment is quickly changing, and so we need to look at what our computing needs in that light. Moving to an SCE requires planning, but is not as expensive as you may think.

If you would like a discussion about your SCE, feel free to reach out to me through a DM. I have been involved in several of these, including with Domino Data Labs and SAS, and both of these (and others) have a lot to offer.

Statistical Computing Environment: Why?

Russell Reeve

What is an SCE?

Advantages of SCE

Collaboration

Computational Power

Recommended by LinkedIn

Some Requirements To Consider

Compliance & Governance (GxP)

Language Agility

Data & Environment Control

Workflow Automation

Fit For Your Team

Quick Comparison of Some Top Vendors

Basic Implementation Roadmap

Conclusion

More articles by Russell Reeve

Others also viewed

Setting Up Your Own Local Home Tech Hub: Automate Workflows, Use AI, and Manage Data with n8n, Ollama, and Postgres (All with Docker!)

Scale with a K.I.S.S: Keep It Simple, Stupid

I Let an AI Touch My Production Database at 2 AM

Consistency Models in Distributed System

NL2SQL in the Real World: Why "Native" Approaches Fail in Complex Enterprise Schemas

MCP vs Function Calling

Cache Optimization: Focusing on Data Alignment

Model Context Protocol Quick Start Guide

What is MCP?

From Local Certainty to Distributed Chaos: Handling Consistency and Integrity in .NET Systems

Explore content categories

What is an SCE?

Advantages of SCE

Collaboration

Computational Power

Recommended by LinkedIn

Some Requirements To Consider

Compliance & Governance (GxP)

Language Agility

Data & Environment Control

Workflow Automation

Fit For Your Team

Quick Comparison of Some Top Vendors

Basic Implementation Roadmap

Conclusion

More articles by Russell Reeve

LinkedIn posts: Which is best, human or AI?

Bayesian Statistics in a Frequentist World

Precision Oncology & Basket Trials

Diversity in Clinical Trial Data

No Job and No Employees?

Resource Management: Manual or AI

Realization and utilization

Cost Reduction

Improving Profitability: Utilization

Changes in clinical trial phase distribution by year

Others also viewed

Setting Up Your Own Local Home Tech Hub: Automate Workflows, Use AI, and Manage Data with n8n, Ollama, and Postgres (All with Docker!)

Scale with a K.I.S.S: Keep It Simple, Stupid

I Let an AI Touch My Production Database at 2 AM

Consistency Models in Distributed System

NL2SQL in the Real World: Why "Native" Approaches Fail in Complex Enterprise Schemas

MCP vs Function Calling

Cache Optimization: Focusing on Data Alignment

Model Context Protocol Quick Start Guide

What is MCP?

From Local Certainty to Distributed Chaos: Handling Consistency and Integrity in .NET Systems

Similar topics

Benefits of Collaboration in Software Development

Cloud-Based Document Collaboration

Explore content categories