The Geometric Mean

Christopher Smith

Published Jan 11, 2019

The Beginning

I was recently tasked with creating an output for a clinical trial report that involved a geometric mean calculation. I thought for sure, having a mathematics and statistics background, that certainly I would have acquired this knowledge and that I simply didn't recall it. Having hung onto my college textbooks all these years, the search began. Scanning the index of these books, I encountered Geometric Series, Geometric Distribution, Geometric Random Variable, Geometric Probability Function, but no Geometric Mean. Down to the last book, it was a supplemental book from a Mathematical Statistical course called Stat Labs: Mathematical Statistics Through Applications by Deborah Nolan and Terry Speed, and there it was. A nice crisp definition, that couldn't be questioned like that of a Google search or Wikipedia. It stated,

It turns out that capitol pi symbol is not something from a language of an alien species, but it is telling us to multiply all the numbers together. So, the geometric mean of 2, 4, 6 is the cube root of 2*4*6 or ~3.63. Also, with a little elbow grease and dusting off of some high school algebra, we can use logarithm and exponent properties to show

Note that, this now requires the extra condition that our original values x > 0 since log 0 is undefined. Also note, that log is the natural log (base e).

Now, let's go on to talk about how to do this as a SAS programmer.

The Middle

DATA STEP

Unbeknown to me, SAS actually has a GEOMEAN() function. But, in the spirit of overkill here are 4 ways to calculate the geometric mean in SAS:

The X1 variable uses the original product definition. Notice that, GEOMEAN() used for X2 variable has the advantage of only considering non-missing values. Another interesting function, is the CONSTANT() function used by X3. Other constants this function can handle are the glorious pi value (3.14159) and Euler-Mascheroni constant (0.577), but I digress. Variables X3 and X4 use the equivalent alternative definition containing the log transformed values.

These methods of calculating the geometric mean are great and all, but data observations come in rows not variables (or columns). Without awkwardly transposing the data simply to apply the GEOMEAN() function, we are left to seek other solutions. I thought perhaps, the SQL procedure would let us use GEOMEAN as an aggregate function, but that was met with high resistance:

LET'S GET MEAN

Since we have an alternative definition of the geometric mean, the best approach in my opinion, is to

Apply the log transformation in a data step,
Calculate the arithmetic mean using the MEANS procedure
Exponentiate the mean using the EXP() function in a subsequent data step.

And, there we have it, all that's left is some extra data manipulation to incorporate this into the main report. However, the fun stuff is over, so let's wrap this up.

The End

Geometric mean is typically calculated when dealing with pharmacokinetic (PK) concentrations of a drug. However, there is usually a Lower Limit of Quantitation (LLOQ) where the drug concentration can not be detected. As it relates to the geometric mean calculation, it is good that we can't have negative numbers in this scenario since this violates the definition of a geometric mean. However, I am curious about the handling of these Below Quantitation Limit (BQL) results. What is the best way to handle those cases? If we handle them as being 0, this automatically makes the geometric mean 0 based on the original product definition. Is it better to treat them as missing or impute values as LLOQ divided by 2? If you are a statistician in the pharmaceutical industry, I'd love to hear from you in the comments below. Do these methods introduce any bias?

To view or add a comment, sign in

The Geometric Mean

Christopher Smith

The Beginning

The Middle

DATA STEP

LET'S GET MEAN

The End

More articles by Christopher Smith

Others also viewed

Using Machine Learning Model in R(Rice identification)

Check Regional Information via Coarsen

A Reference Notebook for (+30) Statistical Charts in Seaborn

Unveiling Price Predictions with Airplane Data

DAX ANALYSIS PART 2: HANDLING MISSING DATA AND TIME SERIES VISUALIZATION

Data manipulation with Pandas is beautiful but how NumPy can improve your algorithm? One small example.

Basics of Data Imputation :-

Set theory and SQL tables

R and Tidyverse 1.3.0:-

How to “apply” your Panda(s)

Explore content categories

The Beginning

The Middle

DATA STEP

LET'S GET MEAN

The End

More articles by Christopher Smith

Finding Hope

Marathon Man

Don’t Snooze Your Retirement Savings

Casino Games: A Losing Bet

Debt: It’s Personal

Good Programming Practices, Writing Clean Code, and Winning Over Your Teammates

The Kaplan-Meier Survival Time Percentiles

Partial End Date Imputation with Leap Year Considerations

Others also viewed

Using Machine Learning Model in R(Rice identification)

Check Regional Information via Coarsen

A Reference Notebook for (+30) Statistical Charts in Seaborn

Unveiling Price Predictions with Airplane Data

DAX ANALYSIS PART 2: HANDLING MISSING DATA AND TIME SERIES VISUALIZATION

Data manipulation with Pandas is beautiful but how NumPy can improve your algorithm? One small example.

Basics of Data Imputation :-

Set theory and SQL tables

R and Tidyverse 1.3.0:-

How to “apply” your Panda(s)

Explore content categories