Latin Hypercube Sampling
In my previous post “Bayesian Optimization”, I demonstrated the optimization procedure based on Bayesian method. However, Bayesian optimization has an issue of determining the initial samples in order to kick off the optimization process, and we call this issue “code start”.
There are several ways to solve the cold start problem, and mostly often used methods include random search and Latin Hypercube sampling. Some research shows that Latin Hypercube sampling performs better than random search and SAS Visual Data Mining and Machine Learning uses Latin Hypercube sampling to generate the initial set of hyperparameters configurations. In this article, I will show you how to implement the Latin Hypercube sampling method with SAS/IML.
Revisit the Algorithm
The algorithm I used is from the paper Bayes Factors for Variance Components in the Mixed Linear Model, but in my implementation, I extended the algorithm to support any-dimensional data’s sampling, and the value space of hyperparameters does not limit to unit space.
To illustrate the algorithm, here we use a two-dimensional data’s sampling as an example.
In order to obtain a Latin hypercube sample of size T from the two-dimensional space, we devise the following algorithm:
1. Divide the interval of each dimension into T equally spaced subintervals.
2. Randomly sample two points Eta and Nu within each subinterval. Call them Eta(i) and Nu(i) when they are sampled from the interval(i), that is from [(i-1)/T, i/T).
3. At this point we have Eta(1), ..., Eta(T) and Nu(1),..., Nu(T).
Let i = 1 and consider Eta(i).
Pick an integer j at random such that 1 <= j <= T and let x(i) = (Eta(i), Nu(j)) be the first point in the Latin hypercube sample.
Now, refrain j from further consideration.
Increase i by 1 and repeat step 3 until all the Eta(i) are considered.
This gives us a Latin hypercube sample that we label as x(1),... ,X(T).
SAS Implementation
%macro latinHyperCubeSamp;
start latinHyperCubeSamp(seed=0, sampleSize, designTable, resultTable);
/* read design table */
use designTable;
read all var {minValue maxValue} into designTable;
read all var {parameterName} into varNames;
close designTable;
varNames = T(varNames);
/* sampling */
dimension = nrow(designTable);
Eta =j(dimension, sampleSize, .);
do d=1 to dimension;
minValue = designTable[d,1];
maxValue = designTable[d,2];
intervals = do(minValue, maxValue, (maxValue-minValue)/sampleSize);
do i=1 to sampleSize;
Eta[d, i]= intervals[1,i] + uniform(seed)*(maxValue-minValue)/sampleSize;
end;
end;
lhs = j(sampleSize, dimension, .);
mattrib lhs colname=varNames;
selectedFlag = j(dimension, sampleSize, 0);
do i=1 to sampleSize;
lhs[i, 1] = Eta[1, i];
do d=2 to dimension;
notSelected = loc(selectedFlag[d,] = 0);
u = uniform(seed);
selection = ceil(ncol(notSelected)*u);
j = notSelected[1, selection];
selectedFlag[d, j] = 1; /* refrain j from further consideration */
Nu = Eta[d, j];
lhs[i, d] = Nu;
end;
end;
tbl = TableCreate(varNames, lhs);
call TableWriteToDataSet(tbl, resultTable);
finish;
%mend latinHyperCubeSamp;
Instructions on the LHS function usage
1) Create a design table to specify the name, minimum and maximum of each hyperparameter.
In this example table, there are 5 hyperparameters. Column minValue saves minimum and column maxValue saves maximum.
2) Add the following two lines to your IML Proc code.
%latinHyperCubeSamp;
run latinHyperCubeSamp(0, 20, designTable, "lhsExampleResult1");
The function latinHyperCubeSamp has four parameters.
latinHyperCubeSamp(seed=0, sampleSize, designTable, resultTable);
The first parameter is random seed and its default value is 0, which uses the clock time as the seed.
The second parameter is the size of the sample.
The third parameter is the table name of your design table.
The last parameter is the table name where to save the sampling result.
Two Examples
Example 1 - Draw sample from two parameters.
data designTable;
input parameterName :$32. minValue maxValue ;
cards;
x1 0 1
x2 -2 -1
;
run;
proc iml;
%latinHyperCubeSamp;
run latinHyperCubeSamp(0, 20, designTable, "lhsExampleResult1");
run;
quit;
To check the sampling result, I created a plot to show all sample points. There are 20 sample points displayed as dots, and each axis draws 20 grid lines to show its subintervals. From the figure, we can see that each subinterval has one sample point only.
Example 2 - Draw sample from five parameters.
data designTable;
input parameterName :$32. minValue maxValue ;
cards;
parm1 0 1
parm2 -2 -1
parm3 10 15
parm4 0.1 0.3
parm5 100 200
;
run;
proc iml;
%latinHyperCubeSamp;
run latinHyperCubeSamp(0, 10, designTable, "lhsExampleResult2");
run;
quit;
The full code is available at GitHub.