Getting Started with DataFrames in Julia

Getting Started with DataFrames in Julia


Introduction

In our previous article we learned about how to install Julia and run it in a IJulia Jupyter notebook. In this article we will create a notebook to explore a key component for working with large data sets, data frames. We will explore data frames using the well known R data set called iris (R is another popular machine learning language framework). iris is a collection of measurements for 3 different species of irises and is a useful set of data for correlating the flower's measurements with its species type. In our example, we will show you how to read in this data into a data frame and explore, alter, and feed the data into a machine learning algorithm to create a model for predicting different species of the iris.

In order to bring in the data set we need to import the RDatasets package. If you are running locally, you can bring this into your Julia environment by typing Pkg.add("RDatasets") at the Julia command prompt. After installing the package, we can import it and read in our iris data.

After executing the Jupyter cells, the program reads in the iris data into a dataframe and assigns it to the variable named iris. By looking at the output, we can kind of glean that the dataframe is very much like a table or spreadsheet. It consists of rows and columns. Each row of our dataset represents an instance of data on a particular flower. Each column represents a measurement on the flower. For example the first flower in the data frame has a sepal length of 5.1 cm. The last column in our data frame, is the species of the flower. There are three species represented in the data set: (Iris setosaIris virginica and Iris versicolor). The data set appears to be ordered by species, but as you will soon see we can manipulate the data frame and order it by however we want.

For example, if we wanted to sort the irises by sepal width, we could execute the following function on the dataframe which will sort it in-place:

sort!(iris, [:SepalWidth])

A few useful analysis functions for Data Frames

Before we get started manipulating the data, its good to know a few good exploritory functions for understanding the data of the dataframes. Executing describe on the data frame gives us some interesting overall statistics about the data:

For example, the average value of our sepal length is 5.84333, and it only varies from the min value by a centimeter and a half. The max sepal length is almost twice the min sepal length. Also we can tell from the describe function, that we only have 3 unique species and absolutely no missing data.

A few other useful functions to help us get our bearings on the dataset are head and tail. The head function gives us the first 6 rows by default. tail gives us the last 6 rows.

We can also get some additional useful information out of functions like size, names. size gives us the dimensions of the frame which is 150 rows of flowers and 5 columns of measurements (Sepal Length, Sepal Width, Petal Length, Petal Width, Species). names gives us the names of all the columns in an array.

Manipulating the Data Frame

Now that we know a little about our data frame, we want to be able to slice and dice it as we see fit. It turns out data frames are pretty simple to manipulate in julia. Let's say we wanted to reduce the data set down to just data length and data width. We can do this simply with the following array manipulation symbols:

The first colon after the open bracket tells us to take all the rows. The [:SepalLength, :SepalWidth] tells us to take only columns named SepalLength and SepalWidth. The above function call will produce a new array with a new size of (150,2) instead of (150,5). if we want to use the new array, we could assign it to a variable. Note: The new data frame we created does not affect the original iris data set we loaded as all the function calls on dataframes are immutable unless we use a mutable function (indicated by an !) that changes them in place.

What if I wanted to pull out all the irises whose sepal length was greater than 5 cm? With the data frame framework, its pretty straightforward:

iris[:SepalLength] .>5 tells the framework to only pull rows that have SepalLength greater than 5. Note you need the period before the greater than symbol for filtering data. Without it, you will receive an error because the period indicates julia to do an elementwise comparison.

What if I wanted to know the sum of all the petal lengths and petal widths using data frames? the dataframe framework has a colwise function that lets perform any function on the column. For example here is how I can tally up the sums of the petal lengths and petal width columns of the data frame:

I could even create my own column wise functions:

The above maximum function uses the built in julia reduce function to determine the maximum value in the measurements array.

The (a,b) -> (a > b) ? a : b statement which is passed to the reduce function is simply another way to write a function taking two parameters: a and b. The function checks whether a is greater than b, if a is greater it returns a otherwise it returns b. The reduce function uses this function cumulatively against the array, always tracking the largest value.

The colwise function then calls maximum on each of the arrays it passes into the second parameter (in this case the arrays are petal length, and petal width).

Here is a possible useful function for colwise that builds on the previous example:

The reduce function still handles finding the maximum value as before, and the map function maps every element of the array to the array elements value divided by the maximum value. The normalize function essentially reduces every value in the array to 1.0 or less proportional to the maximum value of the array. Why is this useful? Normalization is a way to make every column in the data array of equal weight to each other. Sometimes this helps prevent bias from higher valued attributes in a particular learning algorithm.

Plot it Out

Nothing gives you the big picture like a graphic representation of the data. As the old saying goes, "a picture is a thousand words". Let's plot the data of the iris columns against each other so we can get a visual feel for their overall correlation. We will use the Julia StatsPlot library which is really convenient for plotting data frames.

The StatsPlot library has a correlation plot that will plot all the columns in the data frame against one another. This will tell us if there is any relationship between Petal Length to Petal Width, for example. Or maybe there is a correlation between Petal Length and Sepal Length. Let's take a look:

The corrplot (correlation plot) function displays to us the following: the lower left hand corner gives us scatter plots between the various combination of columns, the diagnol provides histogram to give us a feel for how samples are distributed across the different features, and the upper right hand corner gives us a combined histogram scatter plot indicating the density of how each flowers measurement pair correlates. We can take a closer look at an individual relationship by producing a scatter plot. For example, using the scatter function, we can plot Sepal Length vs PetalLength.

Although not a beautiful linear correlation between the two attributes, one can see that there is some sort of correlation between the features SepalLength and Petal Length because as the Sepal Length increases, the Petal Length seems to increase as well.

Machine Learning a Random Forest

We got a head start with working our way around data frames and now its time to make them useful! It would be nice given our current dataset of iris measurements, if someone handed us a pretty iris flower, we could predict the species from the previous data. In other words, we want to build a mathematical model based on the data that will allow us to predict the species of a fresh iris handed to us based on its measurements. Because we are classifying a flower based on the measurement of its features, we need a classification algorithm to help us predict the new iris's species. For this particular set of data, we will choose the Random Forest Classification Algorithm to spin up our model.

What is a Random Forest? Sounds like a place someone might get lost when walking through the woods, but it's actually a set of decision trees. Random Forest is in a class of algorithms called supervised learning. Supervised learning means that we know the output we are trying to achieve based on the input we are given. We use the input, output pairing to teach our model. In the case of the iris, the input/output pairing would be the

input: (Sepal Width, Sepal Length, Petal Width, Petal Length)

and

output: The Species of the Iris (either: Iris setosaIris virginica or Iris versicolor)

Once we have created the model, we feed the new iris data through the model and have it classify it as one of the possible output species.

The Random Forest is built by creating a multitude of decision trees from the training data. Each tree is built up by creating a set of rules that work their way through each branch of the tree and eventually point to the species at the end of the decision path. Hopefully when we merge the decisions for all the trees (for example, majority vote from all the trees decision), we come up with a final classification for our output, i.e. the determined species. By using many trees and randomly selecting starting points for each tree, it reduces the variance and avoids overfitting. BTW, you can find a really good explanation of decision trees in this video by Brandon Rohrer.

Now let's get started by implementing the RandomForestClassifier in Julia. We will take advantage of a few different libraries here. We will use the DecisionTree package and ScikitLearn for building our models and evaluating their results and we will use a package called MLDataUtils, which will allow us to split up our iris data into both training data for training our model, and test data, for testing the predictability of our model.

Preparing our data for learning

Before we start with our machine training, we need to get the data in a form we can use for the Random Forest algorithm. Currently the species is a string, and we need to turn it into an index representing one of the 3 unique species. We'll create an extra column on our dataframe mapping the species to a number (1 - 3).

The code above determines the unique species in the iris data frame. Then it creates a dictionary to map the species name (as a key) to the species index. Finally it creates a new column in the data frame :SpeciesEnumerator that maps the current species to its corresponding index in the dictionary. So now we can work with numbers as outputs to our Random Forest model for training! Below we can see the new column added to our data set corresponding to the Species.

Next we will pull out the input data features out of the dataframe into a 2d array called x, and the output data into an array called y:

Now we need to be able to shuffle and split the data into training and test data sets. The reason we need to shuffle the data is because currently the data is all sorted by species, we want to make sure the training data is random enough that it will include a variety of species when we split it. The MLDataUtils library gives us what we need to perform these tasks. Note that we need to transpose the x data in order to align properly with the output y data to use this package. we can transpose the x data back once we've finished the shuffle and split process.

The code above has two functions from the MLDataUtils package that help us shuffle and split our data: shuffleobs and splitobs. shuffleobs shuffles the x and y data together in a random order, so that the x and y rows still align. splitobs splits the data according to the at fraction parameter set to .67. This will split the data making two thirds training data and one third test data.

For a sanity check, we used println to take a look at the first few rows of training data after the shuffle and split. We can see from the first 4 output rows of the training x data and training y output that the data is sufficiently shuffled. So now we have the data from iris prepared to use in our Random Forest algorithm. We have the training data (x_train, y_train) to train the Random Forest Classifier model and we have the test data (x_test, y_test) to use to test the accuracy of our model.

Training the Data

Let's take advantage of the DecisionTree package to train the data we have prepared. Using the RandomForestClassifier, we can fit the data to our model. First step is to pick the tuning parameters. You can play around with these parameters, but how these are currently set, seem to produce pretty good results. This classifier is set to 50 trees, each with a max depth of 4, but feel free to play around with the parameters by increasing or decreasing them. In theory, the more trees you pick, the less variance, but the longer it will take. Using the fit! function from the DecisionTree library, we can fit the model on our training data. Just a few lines of Julia code, and you've fitted a very effective machine learning classification model!

Let's not get excited yet, its time to see how well the model works on the test data. We can use the predict function in the decision tree library and run it on our test data. Then we can compare the predicted data against the actual output data (y_test) and see how well we did using an error rate function contained in the MLBase package:

Our error rate looks pretty good. .06 is the fraction of samples that the model got wrong out of all the samples in the test set. Something called a confusion matrix can give us a little more information as to the correctness of the prediction results:

The values along the diagnol are all correctly predicted values, but the 3 showing in the third row tells us that the model incorrectly predicted 3 of the rows in the y_test data as being species 2 instead of species 3. The matrix corresponds correctly to our error rate, because out of the 50 total samples showing in the confusion matrix, 3 flowers were misclassified. (3/50 = .06)

Conclusion

Machine Learning is quickly becoming recognized as a means for making extremely accurate predictions on certain problem sets that contain rich data. Imagine, classification can be applied to genome data to detect disease, insurance data to predict fraud, or forensic data to catch a criminal. You could even use it to classify buy or sell categories for instruments on the stock market. The possibilities are endless and with the help of frameworks like Julia, this is becoming an easier task to implement on the data with a wide array of machine learning algorithms to choose from. The trick is cleaning up the data to ready it for these algorithms, and choosing the correct algorithm(s) for the problem at hand.

To view or add a comment, sign in

More articles by Mike Gold

Others also viewed

Explore content categories