Designed inferences
Starting blocks
Anyway, we gave this design to the algorithm, which produced correctly the analysis of variance, and in one case one of the interactions came out in a higher stratum of error than the main effects that contributed to it. If you had asked me previously, I would have said that this would never happen, but that was what the algorithm produced and when we looked at it, it was entirely right. We regarded that as something of a triumph.(1)
This is John Nelder in conversation describing the development of his algorithm for analysing designed experiments that was eventually incorporated into Genstat. As part of my consulting work I recently came across an example of a similar phenomenon, which I thought might be interesting to share. It illustrates some important aspects of design and analysis and relates to components of variation, a sadly neglected topic.
Facts and factorials
The client wished to compare the effect of eight treatments on a within-subject basis. However, each treatment itself could be described in terms of the combination of three factors given simultaneously, which I shall call F1, F2, F3, each of which they would give at two levels, Low (L) or High (H). In what follows I shall use labels such as, for example, LHL to mean F1=L, F2=H, F3=L.
There was a limit on the number of periods for which subjects could be treated and the scientists decided to use two periods only and to only compare directly a treatment with what they called its reciprocal, formed by exchanging each high level with a low level and vice versa. For example, they would compare HHH, with LLL and compare HLH with LHL
This means that the treatments could be regarded as being compared in four separate AB/BA cross-over trials each using two sequences of the treatments. The proposed design is shown in Figure 1 below.
Figure 1. Sequences for a possible clinical trial
Note that Trial is just a label to correspond to a conceptual grouping of the sequences and that what is shown is just the minimal rational design, which would, in practice, be replicated many times. The sequences can be read off by picking a given subject and looking at the treatment given in periods 1 and 2. For each subject given a pair of treatments in one order, another is given the same pair of treatments in the reverse order. Here the standard ordering of subjects is shown but in practice they would be randomised in equal numbers to one of the 16 sequences.
Statisticians don't always know better
They just think they do. My first thought was that this was not necessarily such a good design. With eight possible treatments, then there are (8 x 7)/2= 28 possible pairwise comparisons that can be made and here only 8 can be made on a within-subject basis. Might it be possible to run a design with all 28 x 2 = 56 possible sequences?
It then occurred to me, however, that I should check the design formally using Genstat with its unique approach to block and treatment structure. This requires you to declare the block structure, that is to say the structure that describes how the experimental material varies and then the treatment structure, that is to say the structure of what you intend to apply to the experimental material. Figure 2 shows the code and the result. (Note that in Genstat, which uses the Wilkinson and Rogers(2) notation, A*B*C corresponds to A+B+C+A.B+A.C+B.C+A.B.C and . is used to illustrate an interaction.)
Figure 2. Dummy analysis of variance of the design using Genstat. Code in three lines at the top, followed by output.
Genstat has taken my data, notes what I have designated as the block structure (Subject + Period) and what I have designated as the treatment structure (F1*F2*F3) and then told me to what part of the block structure the variation in treatment structure must be compared. Here the two strata describing block variation are between subject and the subject-by-period interaction, which, in the context of a cross-over design , would be referred to as the within-subject error. Note that I have not had to provide it with any outcome data for it to do this.
The message is that the main effects F1, F2 and F3 can be estimated efficiently on a within-subject basis and so, rather surprisingly, can the three way interaction F1.F2.F3. However, the two-way interaction cannot. That must be estimated on a between-subject basis.
Contrasting argument
Another way of looking at this is to take the weights for the contrasts that are involved for the various main effects and interactions and check whether they sum to zero over a given subject or not. If they sum to zero, the variance can be estimated on a purely within-subject basis and if they don't, they cannot. The situation is shown in Figure 3 below.
Figure 3. Weights for main effect and interactive contrasts for the proposed design cross-over trial
The top half of the figure shows the weights for the contrasts. (Or, strictly speaking, the values are proportional to the weights for the contrasts.) Once they have been established for the main effects, they can then be obtained for interactions by multiplying the relevant columns. Thus, for example, the weights for F13 are obtained by multipliying the weights for F1 by the weights for F3. the weights for F123 are obtained by multiplying the weights for F1 by those for F2 and then by those for F3.
The table in the lower half of the figure gives the sum of the weights by subject. Those for F1, F2, F3 and F123 sum to zero. those for F12, F13 and F23 sum either to -2 or 2.
The bottom line
The bottom line is that the design is quite a good one for estimating the main effects of the three constituent factors. Rather surprisingly, it is also good at estimating the three way interaction although, this is unlikely to be of direct interest itself. However, for the purpose of estimating the two-way interactions, it is rather weak and this in turn implies that with the exception of comparing the eight treatments that are compared directly, it will be poor for effecting the overall effect of treatments, since the effect of a treatment is the sum of the main effects and interactions involved. Genstat warns me of this too, since if I fit treatment as a whole I get the output in Figure 4.
Figure 4. Result of applying the block structure treatment structure approach to treatment as a whole.
General lessons
The first lesson is that it is valuable to have a principled way of evaluating designs in terms of block structure and treatment structure. This is what Genstat provides and I am not aware that any other package does. Of course, by thinking about the weights and finding them, as I later did, I could have got there myself but that requires greater thought and insight and may be even more difficult in more complex cases.
The second lesson is that estimating standard errors appropriately can be difficult. Complex designed experiments bring this home to us but just because your data are not from a designed experiment does not mean that issues like this are not relevant. This has implications for the analysis of epidemiological studies and the use of historical data(3,4) and also for causal analysis. See To Infinity and Beyond for a discussion.
The third lesson is that close communication and collaboration between consulting statistician and life scientist client may be necessary to get the best out of designs and usually, that both will have something valuable to learn.
References
1. Senn SJ. A conversation with John Nelder. Statistical Science. 2003;18(1):118-131.
2. Wilkinson G, Rogers C. Symbolic description of factorial models for analysis of variance. Applied Statistics. 1973:392-399.
3. Collignon O, Schritz A, Senn SJ, Spezia R. Clustered allocation as a way of understanding historical controls: Components of variation and regulatory considerations. Statistical Methods in Medical Research. 2020; 29(7):1960-1971.
4. Collignon O, Schritz A, Spezia R, Senn SJ. Implementing Historical Controls in Oncology Trials. The oncologist. 2021.
.
Nice example! We have GENSTAT available for these sorts of questions. The "modern" approach of throwing data (simulated or real) at a Mixed Effects Model seems to discourage proper thought...
Thought provoking 😊