Dimensionality reduction using Principal Component Analysis: A small case study
A couple of weeks ago I had the chance to play around with PCA, to see its effects first hand. I wrote a small piece of code that "guesses" the country, when given the capital.
To note here, the algorithm does not know what a country is nor what a capital is. It just works off a 300 dimensional word vector (word2vec), and emulates an anology that I ask of it (example: Canada : Ottawa ; ? : Cairo) . This is done via the following steps using the Cosine Similarity method:
- Analogy is put to use in the following way King - Man + Woman = Queen or in this particular case Country1 - Capital1 + Capital2 = Country2_to_be_found
- I then run through a vocabulary of 10000 words to find the word whose 300 dimensional vector subtends the smallest angle with Country2_to_be_found. This is done by finding the word whose 'cosine of the angle' with Country2_to_be_found is the highest. This is called Cosine Similarity. This word then becomes the best candidate and the closest match for Country2_to_be_found .
- I then find the accuracy of the above method by comparing Country2_to_be_found to the actual Country2 that Capital2 is the capital of .
- I get an accuracy of ~92% with the above method , with word vector of length 300
I then use Principal Component Analysis to reduce the dimensionality of the word vectors from 300 in steps of 50 to see its effect on the accuracy of my objective. Repeating steps 1 to 4, for each transformed word vector set.
The results were very interesting: The dimensionality of the vectors could be reduced from 300 to 70 and still retain the 92% accuracy! The accuracy then falls off a cliff.
In other words , PCA helped reduce the dimensionality of the data 4 fold and still retain all the necessary information for the job at hand with no performance hit. The caveats that remain are: this kind of performance depends on the problem statement at hand, and the corpus on which the word vectors were trained on; but it still was a bit surprising.
Makes me wish I had paid more attention to Eigen Vectors and Values as a kid..