Validating R and Python for GxP Use in the Life Sciences

Magnus Mengelbier

Published Mar 8, 2026

Validation as a Life Science industry praxis and requirement is nothing new. Validating GxP solutions, systems and apps is well known and applying those same software validation principles to R and Python packages has been a point of discussion for quite some time. Validating an R or a Python package for one-off analysis or to establish standard analysis and analysis methods is not the same as validating an app, system or solution with a predefined set of static features.

Most often, we associate validation with executing one or more tests. True that testing is part of validation, but validation is an entire process where tests and testing is one of the instruments we have to demonstrate and document that a package, in our case, is fit for purpose. Purpose is simply how, where and when we intend to use it.

The initial approach was to perceive R and Python, when used for analysis, as a statistical analysis application, similar to the classification of SAS, making their respective packages modules. This would make R and Python a configurable (modular) software which is notoriously difficult to validate if the number of modules exceeds a handful and the modules can theoretically be combined in any number and any order. However, once a package is installed and subsequently loaded or imported, i.e. R’s library and require methods and Python’s notorious from and import, the package functions are seldom modular. In a simplified perspective, they consist of static functions and computational methods. Hence they would then be categorized as non-configurable (GAMP version 5 Category 3).

If we instead view packages as language extensions, then R and Python packages are now suddenly indistinguishable from the base language. But we rarely validate a language as it is essentially software infrastructure, i.e. GAMP version 5 Category 1, whose deployment should be well documented but not necessarily independently validated as part of Computer System Validation (CSV). This point of view would conflict with internal expectations on validation and ICH Good Clinical Practice (GCP) that stipulates a system and its components and associated processes used for analysing clinical trial data should be validated. This low-level categorization would probably also not pass an audit or inspection, not just for GCP but extendable to other GxP domains, so we have to do something.

As we as an industry have adopted a risk-based approach, then how do we manage the inherited risk of using the R and Python languages, including the language extensions in the form of packages, for data processing and analysis when the validation requirement is less than if they were part of a statistical application? Rather than validate R and Python plus packages from a software point of view, what would happen if we instead assessed the risk of using R and Python for data processing and analysis based on the Quality Control (QC) processes, e.g. the Quality Checks and output validation that is a staple of our day-to-day tasks, to ensure that the processing and analysis methods are applicable and appropriate and produce the correct result (ICH GCP (E6 R3) section 3.16.2(f)).

Packages and packages

There is a broad consensus that all packages are not equal. We use packages from configuring our compute environment and program sessions to simple data processing and advanced analysis. Sometimes a single process or program would use packages across the entire spectrum.

The R Validation Hub formed by PSI AIMS Special Interest Group in 2018 published the whitepaper A Risk-based Approach for Assessing R package Accuracy within a Validated Infrastructure in early 2020. The white paper proposed that packages belong to one of two categories, Intended for Use and Imports, where the Intended for Use packages are in principle the only ones that users will access and call on directly. The collection of Imports is there only to ensure the Intended for Use packages can function. However, it is very common that packages categorised as Imports are used directly, either by users or AI-driven tools simply discovering that they are there, installed and available.

The white paper further divides packages into the categories Statistical and Non-statistical packages which are then employed as part of determining the overall risk for a particular package (more on that later)..

In practice, we employ packages to serve different purposes. In a general sense, we could divide packages into three broad package categories instead of the matrix Intended for Use versus Imports and Statistical versus Non-statistical.

Utility - A package providing utility functions that facilitate and used for convenience but not necessary
Framework - Package consisting of functions and computational methods that require direct user input to function
Method - Package that provide implementations of functions and computational methods with limited or no user input

The Utility and Framework packages have one other useful quality. If they do not produce the result we would expect, we either not use them or have the ability to correct the pending issue to ensure that the expected result is produced. As a simple example, if the cell width of an output is too narrow, we simply increase the cell width or if the model does not fit, we submit an updated model, when and if permitted.

The Method packages are often similar to Statistical packages as described in the R Validation Hub white paper with the exception that a Statistical package may be a member of either the Framework or Method categories depending on the degree of user input, which we will discover has a significant impact on risk after a brief segway considering basic validation principles.

Basic Validation Principles

The validation for GxP systems, tools and utilities have been a fundamental requirement in the industry as far as most can remember and, in many ways, has not changed, even with the introduction of risk-based and agile approaches. To keep it simple, we shall focus on GCP but be mindful that the same approaches can be extended to GxP domains where R and Python are used for data processing and analysis.

ICH Good Clinical Practice (E6) Revision 3 section 9.3

Computerised systems used in clinical trials should be fit for purpose (e.g., through risk-based validation, if appropriate), and factors critical to their quality should be addressed in their design or adaptation for clinical trial purposes to ensure the integrity of relevant trial data.

Validation can apply to anything from an entire solution or environment or to one of its subcomponents such as an application, a library, a package, a utility or a macro or whatever. For simplicity, most solutions and environments are commonly a sum of the validated parts, i.e. its components. If all components are validated and properly maintained, the solution or environment is deemed to be in a validated state. This approach is common as any change in a component would not trigger a re-validation of the entire solution or environment, just the affected parts. Note that we use the term component here instead of directly referring to a package since a package may rely on system or external resources that would need to be included in re-validation assessment and scope.

Validation is also a process that is performed in any case when R and Python and their respective packages are used for data processing and analysis in a GCP context. It is not optional with very few exceptions. If the risk assessment determines that validation is not necessary, that decision and rationale is nonetheless documented as such as part of the “validation” process. Unfortunately, just claiming that validation is not necessary is not good enough. It may seem counter-intuitive but the decision not to validate is still part of the process of validation.

Let us assume that you have determined validation is required and, to keep this from becoming abstract, our focus will be on validating R and Python packages, rather than keeping it general. If you want to generalize the context, replace package with the Thing, meaning that application, library, package, utility or macro or whatever it is that you need to validate.

The R Validation Hub white paper refers to FDA’s Glossary of Computer System Software Development Terminology to define validation, which is one of the many recognised definitions.

Establishing documented evidence which provides a high degree of assurance that a specific process will consistently produce a product meeting its predetermined specifications and quality attributes.

The third revision of ICH GCP (E6) section 4.3.4(b) also defines validation, along with additional sections further clarifying responsibilities, roles, scope and expectations, as

Validation should demonstrate that the system conforms to the established requirements for completeness, accuracy and reliability and that its performance is consistent with its intended purpose.

Instead of picking apart, comparing and reading into the many different definitions and supporting clauses and sections, there is a general phrase that is useful when discussing validation.

Validation is a process to document and demonstrate that the R | Python package works as expected and is fit for its intended purpose

There are a lot of moving parts in that phrase, so let us break it down, even if it may seem obvious. But before we do, note that the above phrase does not mention risk or risk-based approach as it applies to the entire process and not any particular part as we will discuss further later on.

Intended Purpose

The phrase its intended purpose is much more complex and involved than most often credited. The most basic question is to what end, i.e. what is the purpose of the package and what will it be used for. Is it for data storage, derivation, part of an analysis method, report output, etc.? Is it a utility, framework or method package?

The question of how the package will be used is equally important, since derivations, analysis and reporting can be performed in many different ways, just as one of the many use cases. The how also includes what business processes, i.e. Standard Operating Procedures (SOPs, Work instructions (WIs) and guidelines, are in play when the package is being used to ensure that its use is applicable, it is used appropriately and to ensure that the results produced by the package functionality are correct.

As Expected

The term as expected is simply a description of what the intended function of the package is. In classic waterfall development, that would be your Business, User and Functional requirements. In agile methodologies, that is most often your Epics and User Stories.

In other words, you cannot validate something if you do not know what it is intended to do, how it is intended to function given a specific set of inputs and, given those inputs, what is the expected outcome.

In both cases, it is important to note that the requirement or user story should be testable, meaning as defined, one or more tests can be constructed to demonstrate that the requirement or story is fulfilled, e.g. as implemented it either works or does not work.

That would imply that requirements and user stories are a necessary input to validate an R or a Python package. It would be reasonable to expect that you can create a set of comprehensive well-defined requirements and user stories for any R and Python packages developed in-house.

Risk

The term risk or the phrase risk-based approach is quite often thrown about in validation circles, and we have already leaned on it here more than once, without really considering what it means in a practical sense.

The idea is not new, since good enough from the past is a form of justifiable risk, i.e. risk-based approach. Risk matrices have also been in use as far as many of us can remember. What is new, is that risk matrices or documenting risk as part of the validation is no longer optional if the validation is based on a risk-based approach or process.

An example of risk can be that a switch will fail 7 times out of 2000 uses, but the likelihood of a failure increases after it is pressed or flipped 800 times. If it is mission critical and you are in a space ship far from home, you may want to consider a second or back-up switch. The point is that some risk is quantifiable from engineering and empirical testing.

A more applicable example is the risk that a package will produce an incorrect result given a specific or specific type of input, like an argument or input data, hence why we validate arguments or are defensive in standard functions for data analysis. But the failure rate, compared to a switch, can be difficult to quantify. Instead, we most often use subjective assessments of the perceived risk that the package produces an incorrect result.

A common and popular approach, and using R as the example, is to use standard risk metric algorithms and assessments, such as riskmetric, risk.assessr and apps like riskassessment, just to name an arbitrary few. It is a very valid risk-based approach, but, in a general sense, these risk factors and metrics treat the package as a finite entity and focus solely on the package, most often individually, and do not consider the wider context on how the package will be used in situ and what quality controls, if any, are applied.

Many of these algorithms are based on factors of quality, i.e. factors that would represent if the package is well established and maintained. Some of these factors can also be somewhat misleading, such as the number of downloads. For example, package A downloaded 2 million times may be considered more well established than package B with only 20 000 downloads. But what if the population using package A is 20 million users and the target population for package B is 60 000. One could argue that package B is more established compared to package A as a larger proportion of the target user base uses it.

Another popular metric is test coverage, the proportion of primary functions and methods that has one or more tests associated with it. Actually, the metric represents the number, or proportion, of primary functions and methods that are called from within a test program or a test scenario. The principle being larger coverage, the better the testing for the package. However, the metric is based on counts, not quality. There are countless examples and discussions where a package has near perfect coverage but the tests contribute almost nothing to assert that the functions and methods work as expected and produce the correct result, sometimes referred to as gaming the metric or gaming coverage.

But, with emphasis, the inputs into package functions and computational methods are very seldom static when we use packages to process and analyse data, more specifically biological data within the Life Sciences, Healthcare and Medicine. The challenge is compounded since the data is derived, which derivations and transformations are seldom identically replicated across projects even if the analysis and context remains the same. It is therefore very unlikely that tests provided with the package, or additional tests we add, will cover all use cases and scenarios. We can employ QC processes with quality checks and output validation to manage these risks but this is more assumed than formally and explicitly included when assessing the risk of a package in current praxis.

We can continue along that path and define the risk for a particular package as the risk that one or more of its functions and computational methods produces an incorrect result or results in an error that cannot be identified and corrected as part of the QC process, i.e. a QC based risk model. This is of course the aim we identified at the outset for using validated systems, applications and components together with quality processes.

The purpose of validation then becomes to manage the risks that cannot be managed by the QC process or take care of the repetitive QC tasks that already is or can with certainty be standardized as part of testing during validation.

We have to acknowledge that there may remain risks that the combination of QC processes and validation cannot mitigate, just as with validation without considering QC. In such a case, it could potentially be considered as an acceptable risk. It may also clearly highlight that additional QC process steps are required when using a particular package for that particular data processing or analysis. This is not unprecedented since we regularly use tools that require additional mandated training and standard procedures when used.

The quality metrics discussed previously are still useful as they provide additional input to refine the risk under the QC-risk model. For example, a higher quality assessment may result in a lower QC threshold than a lower quality package since the perceived risk is higher for packages of lower quality.

There are two additional cases to consider. If no QC processes exist or are considered as part of the risk assessment, then all risk is managed by validation, which is customary for apps and automated and unmonitored data processes. We then simply default to the validation approach most organisations already employ.

On the other hand, if we simply use the software validation approach, say for an app, then the QC-risk model approach would naturally identify the need for additional validation or simply that quality control steps are necessary to verify and corroborate results produced and provided through the app, hopefully resulting in a more robust process.

These two cases also highlight that the QC-risk model defaults to the software and application validation model when QC processes are not included as part of the risk assessment.

Putting QC-risk Model Into Practice

The approach to incorporating the QC-risk model into R and Python package validation is surprisingly straighforward, simply because we do not have to change the validation process and tools.

An obvious change will be to the risk assessment for each individual package as it now needs to explicitly refer to

quality control processes already in place
documenting the risk that a package function or computational method would produce a result or result in an error that would not be identified and corrected by the quality control process; and
if there are any additional quality control steps needed for the package work products.

The most significant change is actually in regards to the expected level of testing. Since fit for its intended purpose is now based on quality control measures and processes, works as expected is suddenly a more functional reference than numerical validation (see basic principles of validation).

We would also only need to demonstrate and document that the packages within the Utility and Framework categories function as installed given that the QC-risk model is based on the principle that a user is able to and expected to correct an incorrect result or resolve an error. In the vast majority of cases, this is well covered by the tests provided with the package by the developer(s).

As a consequence, the packages that may need extensive testing are the Method packages that implement methods with limited or no user input. In practice those are few and far between, since Method packages are mostly used when there is certainty that the method is appropriate, applicable and is expected to produce a correct reproducible result. In common cases, resolving issues, incorrect results or errors with functions and computational methods in Method packages usually simply entails correcting or updating the input data to resolve the pending issue.

Not surprisingly, this level of expected testing is easily automated, and that includes producing documented evidence of the test results, which many organisations have already done to some degree.

Before we conclude, there is one important note. Throughout this article, we have not once mentioned open-source, simply because the principles apply equally to open-source and commercial software. The QC-risk model is in many ways already applied to SAS across Life Science organisations, so it is awesome that the principles for R and Python equally apply to SAS language environments and solutions without carving out exceptions or finding novel convoluted arguments.

The entire package validation process can now also be so standardised such that the customary package Validation Plan becomes a Standard Operating Procedure (SOP), the risk assessment, testing and documentation is a Work Instruction (WI) and the entire experience more closely resembles a request-to-sign-off workflow rather than the tedious complex validation that we have become so accustomed to. There should be no reason why such a validation process cannot be implemented to allow quarterly updates to R and Python environments, but those bag of tricks is another story in itself.

🌳 Dirk Van Krunckelsven 📊 1mo

Thanks Magnus - interesting read Phil Mason Jean-Michel Bodart Tine Casneuf Bjorn Daems Sandeep Juneja Bérénice Wulbrecht Stijn Rogiers Homayoun Pourheidari Pascal BOUQUET Tamal Kumar De WIM GOELEVEN Stijn Verherstraeten Amauri Cruz Andy Nobibux

Adrian Olszewski 1mo

Magnus Mengelbier This is a very comprehensive article. And addresses the doubts I have when hear about "5-star popularity". To me there's only one kind of validation - NOT based on any risk metrics (there can be an old, even *archived* and *abandoned*! archive, without excellent webpage, GitHub community, thousands of downloads and cited everywhere - and still doing great and be of top importance!), but based on the numerical validation. Unfortunately, I haven't seen any approach covering this part other than by just running unit tests. I work in a 100%-R based CRO, my daily duty is to verify the software we use - in many ways. I care exactly 0% if the software has nice vignettes and is "some kind of standard" (like the "zombie" package geepack, crashing one session and with unfixed issues for long years), if it cannot provide me with trustful results, ideally at least comparable with acclaimed commercial software (differences always happen due to many factors). In 2020 I gave a presentation on the R/Pharma conference covering these topics. But I didn't see any big interest in it. https://www.researchgate.net/publication/345778861_Numerical_validation_as_a_critical_aspect_in_bringing_R_to_the_Clinical_Research

1 Reaction

José C. Lacal 1mo

Great article, Magnus.Thank you for sharing it.

See more comments

To view or add a comment, sign in

Validating R and Python for GxP Use in the Life Sciences

Magnus Mengelbier

Packages and packages

Basic Validation Principles

Intended Purpose

As Expected

Recommended by LinkedIn

Testing and the Evidence

Risk

Putting QC-risk Model Into Practice

More articles by Magnus Mengelbier

Others also viewed

A simple Python script for converting SDF libraries to SMILES with ALCOA+ metadata

Open Addressing in Hash Tables: Concepts, Algorithms, and Python Sets

A gentle guide into Decision Trees with Python

Python Dictionaries

"Python DSA Journey: Next 10 Days - Exploring Advanced Concepts and Problem-Solving.

Mastering Linear Search: A Comprehensive Guide for Efficient Searching

The open standard for machine learning interoperability

The Ultimate Guide To Speech Recognition With Python

Predicting stock prices with FBProphet in Python

Generative AI Beyond Python: Why Java Still Matters in the Enterprise

Explore content categories

Packages and packages

Basic Validation Principles

Intended Purpose

As Expected

Recommended by LinkedIn

Testing and the Evidence

Risk

Putting QC-risk Model Into Practice

More articles by Magnus Mengelbier

Exploring AI for Code Development

A Look at Python, SAS and R for GxP

Others also viewed

A simple Python script for converting SDF libraries to SMILES with ALCOA+ metadata

Open Addressing in Hash Tables: Concepts, Algorithms, and Python Sets

A gentle guide into Decision Trees with Python

Python Dictionaries

"Python DSA Journey: Next 10 Days - Exploring Advanced Concepts and Problem-Solving.

Mastering Linear Search: A Comprehensive Guide for Efficient Searching

The open standard for machine learning interoperability

The Ultimate Guide To Speech Recognition With Python

Predicting stock prices with FBProphet in Python

Generative AI Beyond Python: Why Java Still Matters in the Enterprise

Explore content categories