Using Deep Learning to Predict One’s Future Occupation Based on Early Life Experiences

Jenny Yue Jin

Published May 14, 2017

Many of the most influential people were child prodigies (Mozart started composing when he was 5; Tesla helped his mother invent small household appliances), while some other geniuses like Einstein struggled with school. Can nature and (early) nurture really determine one’s destiny? We are here to find out by examining the relationship between early life experiences (family, childhood) and future occupation. Specifically, we want to learn this relationship with regards to the most influential people in history. The occupations considered are:

Artist
Athlete
Author
Businessmen (and women)
Entertainer
Politician
Scientist

This is not to say that other occupations are not important – some are defined too ambiguously for machine learning purposes (e.g. “social”), some do not have enough data available, and some are assigned at birth (e.g. royalties).

For those reading this post as a tutorial, the basics of Recurrent Neural Networks (RNN), Convolutional Neural Networks (CNN), and Long Short Term Memory (LSTM) architecture are skimmed over. For details on how they work, this entertaining and educational blog on RNN by Andrej Karpathy, this overview of CNN written by Stanford, and Colah’s blog on LSTM are great resources and a lot of fun to read!

Data

I needed detailed descriptions of people’s early life experiences (family, childhood, school, etc). And in order to perform deep learning, I needed a lot of it, preferably in somewhat standardized formats.

Crawling Data

Luckily, Wikipedia has decently organized biography pages, that are often populated with at least one section on “early life”, “family life” or “childhood” for famous people. I crawled names, then extract early-life texts using MediaWiki and Python’s beautifulsoup4 package.

Names of the mega-influencers in history – Nelson Mandela, William Shakespeare, and Thomas Jefferson – were easily crawled from pages such as List of Top 100 Famous People, 100 People who Changed the World, Inspirational People. This gave rise to ~250 people (clearly not enough for deep learning).
Between 1901 and 2016, the Nobel Prizes and the Prize in Economic Sciences were awarded 579 times to 911 people and organizations. With some receiving the Nobel Prize more than once, this makes a total of 881 individuals and 23 organizations. That’s better, but now our data is heavily screwed towards scientists and authors.
Finally, I resorted to lists of People on Wikipedia itself for A) less famous people, and B) more politicians, business people, athletes, artists and entertainers.

Cleansing and Quality

Approximately 22% of the names do not have dedicated “early life” sections on Wikipedia (some are polluted with professional career information, some do not have those sections or Wiki pages altogether).

Texts were split into tokens with space as the delimiter (i.e. words and punctuations are tokens). Most data points have 100~700 tokens. To get a sense of how much information that is, check out a 254-token extract for Al Gore.

Histograms of the number of tokens and their logs are shown below.

In total there were 3,976 usable data points, divided evenly among 7 occupations. The evenness was by design, because the fact that there was 10 times more data on entertainers than businessmen does not mean that there are 10 times more movie stars than office workers in real life. Popularity on Wikipedia simply reflects public interest, and not the occupation ratio in reality. Therefore, I trimmed extra data points to make all occupations have the same number of samples. Perhaps a research on the proportions of these occupations in real life will result in a more realistic sample that gives more accurate predictions of one’s destiny.

The data was split into 8:1:1 train:validation:test using shuffled stratified split to ensure the same proportions of occupations across training, validation and testing. So each of the 7 occupations has:

454 training data points
57 validation data points
57 testing data points

Word2Vec

The first step was to convert text to numbers for training.

Convert text to lowercase, split by spaces, so that a token corresponds to either a word or a punctuation.
Convert unique tokens to vectors. For this I experimented with two approaches:starting with a random embedding matrix and learning as part of the training process; and
using pre-trained Global Vectors for Word Representation (a.k.a. GloVe) embedding produced by Stanford

Given the amount of training data available, using pre-made embeddings outperformed the self-trained one significantly. So it was used for all the models going forward.

The Elephant in the Room: Variable Lengths

One of the biggest challenges in this project is the fact that samples have drastically different number of tokens. (E.g. Anthony Doerr, a winner of the Pulitzer Prize for Fiction, has only 49 tokens; whereas Angelina Jolie has 844.) In fact, our token-lengths distribution has a standard deviation of 306.06, mean of 286.50, and coefficient of variation of 107%. So chopping off longer texts is clearly out of the question. Instead, shorter texts were padded with UNK tokens and hopefully the model figure out the rest. We shall see soon that this proved to be a challenge for some models. (Note: bucketing was used to keep memory usage in check.)

Convolutional Neural Networks

To test the waters quickly, I first tried CNN, even though CNN is employed more often for image-related than text-related learning. (See here for a detailed description of how CNN works.)

Each text sample in a batch has num_seq rows and vecDim columns (the dimension of the word2vec embedding, e.g. 300 for 300-long vector representations of a word token). Note that num_seq is not equal to the actual number of tokens in the sample, because it was padded to have the same number of sequences as the other samples in the same batch.

As shown in the image below:

Convolutions are done across columns, so that filter width of k results in a vector of length num_seq - k +1.
Either Maxpool or Local Normalization is done after convolutions. Local Normalization had slightly better performance across almost all parameters.

A snippet of the code (full code on Github):

l = ConvMaxpoolLayer(layer1.output, layer1.output_shape,
                     convParams_={'filterShape': (filterSize, self.embeddingDim),
                                  'numFeaturesPerFilter': self.numFeaturesPerFilter, 'activation': 'relu'},
                     maxPoolParams_={'ksize': (inputNumCols - filterSize + 1, 1), 'padding': 'VALID'},
                     loggerFactory=self.loggerFactory)

Recurrent Neural Networks

RNNs are very well suited for Natural Language Processing problems, especially when combined with Long Short Term Memory units. Furthermore, bi-directional RNNs takes into account both the text before and after a sequence, making it a good candidate for making sense of “stories” like the early life descriptions in our case (graph courtesy of Colah’s blog).

Code Layout

Proper code architecture as well as testing mechanisms are crucial to experimenting with different models rapidly.

Layer Base Class

Correct handling of Tensorflow’s vector dimensions made it difficult to insert and swap layers like Lego blocks to build models very quickly. For instance, constructing a Fully Connected layer requires the weight matrix’s dimensions to be constants instead of Variables. That means the other layers’ input and output dimensions have to be known at construction time. Therefore, each Layer class must have an output_shape property. Also, different layers have different input dimensions requirements (e.g. Tensorflow’s convolution requires 4-D input even though in our case there’s always only 1 channel of input). So each Layer class must also know how to manipulate its input to have the correct number of dimensions.

A snippet of the AbstractLayer class is shown below (full code here).

class AbstractLayer(metaclass=ABCMeta):

    @property
    def output(self):
        return self.__output

    @output.setter
    def output(self, val):
        self.__output = self.activationFunc(val)

    @abstractmethod
    def make_graph(self):
        raise NotImplementedError('This (%s) is an abstract class.' % self.__class__.__name__)

    @property
    @abstractmethod
    def output_shape(self):
        raise NotImplementedError('This (%s) is an abstract class.' % self.__class__.__name__)

    def input_modifier(self, val):
        return val

    @property
    def input(self):
        return self.__input

    @input.setter
    def input(self, val):
        self.__input = self.input_modifier(val)

Model Base Class

All models should be able to

build its layers/graph
train
evaluate
be assigned a new learning rate when needed
store outputs (for debugging purposes)
unit-test itself using various parameters

A snippet of the AbstractModel class is shown below (full code here).

class AbstractModel(metaclass=ABCMeta):

    def __init__(self, input_, initialLearningRate, loggerFactory_=None):
        ...
        self.make_graph()

        with name_scope('predictions'):
            self.pred = tf.argmax(self.output, 1)
            self.trueY = tf.argmax(self.y, 1)

        with name_scope('metrics'):
            self.cost = tf.reduce_mean(
                tf.nn.softmax_cross_entropy_with_logits(
                    logits=self.output,
                    labels=self.y)) \
                        + self.l2Loss

            self.accuracy = tf.reduce_mean(tf.cast(tf.equal(self.pred, self.trueY), tf.float32))

            summary.scalar('cost', self.cost)
            summary.scalar('accuracy', self.accuracy)

        with name_scope('optimizer'):
            self.optimizer = tf.train.AdamOptimizer(learning_rate=self._lr).minimize(self.cost)

        self.merged_summaries = summary.merge_all()

    def assign_lr(self, sess_, newLearningRate_):
        ...

    def train_op(self, sess_, feedDict_, computeMetrics_):
        ...

    def evaluate(self, sess_, feedDict_):
        ...

    @classmethod
    def run_thru_data(cls, dataReaderKlass, dataScale, modelParams, runScale, useCPU=True, **otherDataReaderKwargs):
        ...

    @abstractmethod
    def make_graph(self):
        ...

Models

I experimented with 6 network structures, named Marks I~VI (Ironman reference!).

All models are combinations of RNN and CNN in some way. In order to quickly gauge whether a model would be a viable candidate, I initially tested them on just 2 occupations – politician and scientist.

Mark 1

The first model, Mark 1, consists of multiple CNN (Convolution followed by either Maxpool or Local Normalization) of various filter widths. Convolution is done across word2vec token vectors for neighboring tokens/sequences, then Maxpooled or Local-Normed across sequences. The CNNs’ outputs are then concatenated, have a dropout layer applied to it, and fed into a fully connected layer (pictorial representation below).

Code snippet of Mark 1 below (full code here).

for filterSize in self.filterSizes:

    l = ConvMaxpoolLayer(layer1.output, layer1.output_shape,
                         convParams_={'filterShape': (filterSize, self.embeddingDim),
                                      'numFeaturesPerFilter': self.numFeaturesPerFilter, 'activation': 'relu'},
                         maxPoolParams_={'ksize': (inputNumCols - filterSize + 1, 1), 'padding': 'VALID'},
                         loggerFactory=self.loggerFactory)

    layer2_outputs.append(l.output)
...
# layer3: dropout
self.add_layers(DropoutLayer.new(self.pooledKeepProb))

# layer4: fully connected
lastLayer = self.add_layers(FullyConnectedLayer.new(self.numClasses))

self.l2Loss = self.l2RegLambda * (tf.nn.l2_loss(lastLayer.weights) + tf.nn.l2_loss(lastLayer.biases))

When tested with the 2-occupation datasetof Politicians vs Scientists, validation and test accuracies hovered around 0.75 with either Maxpool or Local Normalization, or any regularization scheme (L2 vs dropout). Although this showed signs of learning (better than 0.5), it did not seem quite good enough. Reviewing the weight matrix showed that the problem was with variable sequence lengths – as mentioned earlier, the number of sequences/tokens range from hundreds to thousands.

Mark 2

Then I experimented with RNNs. The “size” of the RNN depends on the sequence length, posing a similar problem to CNNs used in Mark 1. Shown below are illustrations of RNN chains for texts with few and many tokens/sequences/steps (these terms are used interchangeably in this post), respectively.

Luckily, Tensorflow has a bidirectional_dynamic_rnn method that handles varying sequence lengths across samples in a batch!

self.outputs = tf.concat(
    tf.nn.bidirectional_dynamic_rnn(self.forwardCells, self.backwardCells,
                                    time_major=False, inputs=self.x, dtype=tf.float32,
                                    sequence_length=self.numSeqs,
                                    swap_memory=True)[0], 2)

Caveat: one has to be careful to extract the correct last step’s output, which depends on the number of sequences.

def last_relevant(output_, lengths_, numRows_=1):
    batch_size = tf.shape(output_)[0]
    max_length = tf.shape(output_)[1]
    out_size = int(output_.get_shape()[2])
    index = tf.expand_dims(tf.range(0, batch_size),-1) * max_length \
            + tf.tile(tf.expand_dims(lengths_ - 1, -1), [1, numRows_]) + tf.range(-numRows_+1, 1)

    flat = tf.reshape(output_, [-1, out_size])

    return tf.gather(flat, index)

Now we are ready to create Mark 2, where instead of Maxpool or Local Normalization, convolution layers are followed by RNNs.

Code snippet of Mark 2 below, full code here.

        for filterShape, keepProb in zip(self.convFilterShapes, self.convKeepProbs):

            cnn = ConvLocalnormLayer(self.x, (-1, self.maxNumSeqs, self.vecDim),
                                     convParams_={'filterShape': (filterShape[0], self.vecDim if filterShape[1]==-1 else filterShape[1]),
                                                  'numFeaturesPerFilter': self.convNumFeaturesPerFilter,
                                                  'keepProb': keepProb,
                                                  'activation': 'relu'})

            newInput, newInputNumCols = convert_to_3d(cnn.output, cnn.output_shape)

            rnn = RNNLayer({'x': newInput, 'numSeqs': self.numSeqs - filterShape[0] + 1},
                           (-1, cnn.output_shape[1], newInputNumCols),
                           self.rnnNumCellUnits, self.rnnKeepProbs)

Mark 2 gave a 2-occupation test accuracy of 0.80, compared to 0.75 for Mark 1.

Mark 3

Let’s go back to basics and see how well bi-directional LSTM-RNN does on its own. Mark 3 is just RNN followed by a fully-connected layer.

Code snippet of Mark 3 below, full code here.

self.add_layers(RNNLayer.new(self.rnnNumCellUnits), self.input, (-1, -1, self.vecDim))

With 3 LSTM cells of 64, 32 and 16 units, respectively, Mark 3 achieved a 2-occupation test accuracy of 0.89 !!

Marks 4 & 5

Given the success of the RNN-first approach, I experimented with feeding RNN’s output into CNN (Mark 4), as well as running RNN and CNN independently and concatenating the results (Mark 5).

Neither showed noticeable improvement over Mark 3.

Mark 6

From a data-driven product point of view, I prefer simpler models even if it’s at the cost of chasing after third-decimal-place accuracies with more complex models.

Therefore, I decided to explore RNN-only structures, from wide and deep RNNs to parallel RNNs, which is Mark 6.

Code snippet of Mark 6 below (full code here).

makers = [RNNLayer.new(c.numCellUnits, c.keepProbs, activation=c.activation) for c in self.rnnConfigs]
self.add_layers(makers, self.input, (-1, -1, self.vecDim))

self.add_layers(DropoutLayer.new(self.pooledKeepProb, self.pooledActivation))
self.add_layers(FullyConnectedLayer.new(self.numClasses))

Results

Now we are ready to run Mark 3 and Mark 6 on the full dataset. Note that since there are 7 occupations, a random guess would have a 0.14 accuracy.

Mark 3

Mark 3 is the dynamic bi-directional RNN wrapped around layers of LSTM cells. Networks were built by tuning the following parameters:

number of LSTM cells/layers
number of hidden units in each LSTM cell
the dropout probably of each LSTM cell

The plot below illustrates the relationship between test accuracy (y-axis) and:

number of LSTM layers (x-axis)
the number of hidden units of the largest LSTM cell (size of the bubbles): the sizes are 16, 32, 64, 128, 512, and 1024
how the number of LSTM hidden units progress through the layers (color of the bubbles), e.g. “decreasing” refers to [64, 32, 16], “increasing” refers to [16, 32, 64], and “constant” refers to [32, 32, 32], etc.
Note that the accuracy numbers are averaged across configurations of LSTM dropout probabilities. However, dropout probabilities of ~0.4 (i.e. 0.6 probability of keeping) gave the best results.

We can see that:

3 and 4 layers give the best results.
Larger cells don’t correspond to the best results (perhaps due to the amount of data). 64 is the optimal largest cell size.
“Constant” cell size patterns consistently gives the best results.

Mark 6

The top three Mark 3 structures that gave the highest test accuracies were combined to form Mark 6 (concatenating a series of Mark 3’s and fully-connected the output). They are:

When tested on both the validation and test sets, Mark 6 achieved an accuracy of 0.74. Below are the Confusion Matrices. Since all occupations have the same number of samples, no normalization is necessary.

Here are precision, recall and F1 scores by occupations.

Some interesting observations:

Author is by far the most “misunderstood” occupation. The top 3 occupations incorrectly predicted as Author are Politician, Businessmen and Artist. (After all, the difference between Artistic and Author is not always black-and-white.)
Scientist and Artist are the most “predictable” occupations.

Was the model wrong, or did they miss their calling?

Now here’s where we venture into the philosophical realm of modeling an imperfect real world with a theoretical Machine Learning model. Interesting “wrong cases”:

Karl Marx is predicted to be a Politician.
David Crausby is predicted to be an Artist instead of a Politician (which in some people’s opinion is a form of art).
Nelson Mandela is predicted to be a (full-time) Author, which is not entirely wrong since he did author some books!
Dwight Eisenhower is predicted to be an Athlete.

Caveats and Potential Improvements

This fun side project was thrown together in my spare time. Things can certainly be improved with more time and resources.

Because the data was based on the most influential individuals in history, prediction logic may not apply to “ordinary” people! I looked for detailed early life descriptions of everyday folks but to no avail. Please ping me if you know a good source, because this would lead to an even more interesting project: Who Will Become Influential .
Due to hardware and time constraints, I did not test larger/more complex network structures with more data.
The world has changed since 200 years ago, what applied back then may no longer hold true in the present. It’d be interesting to see how the model behaviour changes when fed data from different eras.

Thanks for reading thus far! I hope this post was useful and entertaining for you. :) Finally, I enjoy meeting cool fellow creators by working on fun projects together. Feel free to reach out to me at jj@planetj.io if you’d like to collaborate on an interesting project.

This page was generated by GitHub Pages.

To view or add a comment, sign in

Using Deep Learning to Predict One’s Future Occupation Based on Early Life Experiences

Jenny Yue Jin

Data

Crawling Data

Cleansing and Quality

Word2Vec

The Elephant in the Room: Variable Lengths

Convolutional Neural Networks

Recurrent Neural Networks

Code Layout

Layer Base Class

Model Base Class

Models

Mark 1

Mark 2

Mark 3

Marks 4 & 5

Mark 6

Results

Mark 3

Mark 6

Was the model wrong, or did they miss their calling?

Caveats and Potential Improvements

More articles by Jenny Yue Jin

Others also viewed

How to Explain Deep Learning using Chaos and Complexity

Google TensorFlow simple examples -- Think, Understand, IMPLEMENT :-)

Deep Learning Resources and Study Path For Aspiring Data Scientist

Deep Learning Part 5: Leveraging TensorFlow and Keras

"Yippe Ka Ye" Deep Learning

A simple CNN In TensorFlow: Practical CIFAR-10 Guide

Sequence Modelling using Deep Learning (RNNs), A developer's guide

Transitioning to an AI/ML research - Road Map

Predicting admissions into UCLA using a Neural Network

Commoner’s Pathway to Machine Learning - Logistic Regression, Neural Networks and the Rest!

Explore content categories

Data

Crawling Data

Cleansing and Quality

Word2Vec

The Elephant in the Room: Variable Lengths

Convolutional Neural Networks

Recurrent Neural Networks

Code Layout

Layer Base Class

Model Base Class

Models

Mark 1

Mark 2

Mark 3

Marks 4 & 5

Mark 6

Results

Mark 3

Mark 6

Was the model wrong, or did they miss their calling?

Caveats and Potential Improvements

More articles by Jenny Yue Jin

Model selection is just a for-loop. And I wrote that loop.

Practical Techniques for Aspiring Data Scientists

Small Tool for Practicing Investor Q&A

Automatic Outfit Generator

Apologies to Anyone that was harassed by FounderDating on my behalf

Others also viewed

How to Explain Deep Learning using Chaos and Complexity

Google TensorFlow simple examples -- Think, Understand, IMPLEMENT :-)

Deep Learning Resources and Study Path For Aspiring Data Scientist

Deep Learning Part 5: Leveraging TensorFlow and Keras

"Yippe Ka Ye" Deep Learning

A simple CNN In TensorFlow: Practical CIFAR-10 Guide

Sequence Modelling using Deep Learning (RNNs), A developer's guide

Transitioning to an AI/ML research - Road Map

Predicting admissions into UCLA using a Neural Network

Commoner’s Pathway to Machine Learning - Logistic Regression, Neural Networks and the Rest!

Explore content categories