The Geometric Mean
The Beginning
I was recently tasked with creating an output for a clinical trial report that involved a geometric mean calculation. I thought for sure, having a mathematics and statistics background, that certainly I would have acquired this knowledge and that I simply didn't recall it. Having hung onto my college textbooks all these years, the search began. Scanning the index of these books, I encountered Geometric Series, Geometric Distribution, Geometric Random Variable, Geometric Probability Function, but no Geometric Mean. Down to the last book, it was a supplemental book from a Mathematical Statistical course called Stat Labs: Mathematical Statistics Through Applications by Deborah Nolan and Terry Speed, and there it was. A nice crisp definition, that couldn't be questioned like that of a Google search or Wikipedia. It stated,
It turns out that capitol pi symbol is not something from a language of an alien species, but it is telling us to multiply all the numbers together. So, the geometric mean of 2, 4, 6 is the cube root of 2*4*6 or ~3.63. Also, with a little elbow grease and dusting off of some high school algebra, we can use logarithm and exponent properties to show
Note that, this now requires the extra condition that our original values x > 0 since log 0 is undefined. Also note, that log is the natural log (base e).
Now, let's go on to talk about how to do this as a SAS programmer.
The Middle
DATA STEP
Unbeknown to me, SAS actually has a GEOMEAN() function. But, in the spirit of overkill here are 4 ways to calculate the geometric mean in SAS:
The X1 variable uses the original product definition. Notice that, GEOMEAN() used for X2 variable has the advantage of only considering non-missing values. Another interesting function, is the CONSTANT() function used by X3. Other constants this function can handle are the glorious pi value (3.14159) and Euler-Mascheroni constant (0.577), but I digress. Variables X3 and X4 use the equivalent alternative definition containing the log transformed values.
These methods of calculating the geometric mean are great and all, but data observations come in rows not variables (or columns). Without awkwardly transposing the data simply to apply the GEOMEAN() function, we are left to seek other solutions. I thought perhaps, the SQL procedure would let us use GEOMEAN as an aggregate function, but that was met with high resistance:
LET'S GET MEAN
Since we have an alternative definition of the geometric mean, the best approach in my opinion, is to
- Apply the log transformation in a data step,
- Calculate the arithmetic mean using the MEANS procedure
- Exponentiate the mean using the EXP() function in a subsequent data step.
And, there we have it, all that's left is some extra data manipulation to incorporate this into the main report. However, the fun stuff is over, so let's wrap this up.
The End
Geometric mean is typically calculated when dealing with pharmacokinetic (PK) concentrations of a drug. However, there is usually a Lower Limit of Quantitation (LLOQ) where the drug concentration can not be detected. As it relates to the geometric mean calculation, it is good that we can't have negative numbers in this scenario since this violates the definition of a geometric mean. However, I am curious about the handling of these Below Quantitation Limit (BQL) results. What is the best way to handle those cases? If we handle them as being 0, this automatically makes the geometric mean 0 based on the original product definition. Is it better to treat them as missing or impute values as LLOQ divided by 2? If you are a statistician in the pharmaceutical industry, I'd love to hear from you in the comments below. Do these methods introduce any bias?