Can you teach GPT* model to write ruby?
The last 2 weekends were a lot of fun for me. I have been learning Ruby for my new job @Shopify, and I thought well maybe I try to teach a GPT model to generate Ruby code so we learn together :). To be honest it was just an excuse to learn HuggingFace GPT2 APIs. I thought this will take quite some time and most probably will not work but will be fun. And..., was I wrong, after about 3 days of work I am seeing some interesting results and I couldn't resist sharing (although I am far from done)
The approach was very simple:
- Write some python code to search for public GitHub repos that have ruby code, clone them, delete everything, and just keep the ruby files. I kept cloning till I got around 150MB of ruby files (I am sure there is a lot of repeated samples and tutorials there)
- Treat the code like normal text (for simplicity) and just train a Tokenizer to convert the ruby files into one training data set and encode it in a way that I can use with HuggingFaceAPI
- Fine-tune a pre-trained GPT2 model, following HuggingFace fine-tune GPT tutorial, using the training data set we created
- Trained the model for only 5 Epoch (around 14 hours using my RTX3090). Here is a sample code the model generated:
# frozen_string_literal: true
module ActionView
module Template::Handlers
class Raw
def call(template, source)
"#{source.inspect}.html_safe;"
end
end
end
end
Not too bad, considering that I did not touch this output at all it comes out this way from the model, formating, and everything. I found that it actually learned very interesting things about the formatting, and the structure of Classes, Modules, and functions. Even the frozen_string_literal at the start of the file :). It learned to use 2 spaces for indentation. I have to say it does a better job with formatting than me :)
Of course, it is far from perfect, but to be able to do that without being a data scientist, and train it on a local machine without crazy hardware (I used one GPU RTX 3090) only in few days and used ruby code scrabbed randomly form public git repos without any cleaning whatsoever makes me so excited for the possibilities.
I will be sharing the code as a training session to explain the whole process in detail if there is enough interest.
The moral of the story is, never stop learning, if you have an idea regardless of how crazy it looks (actually the crazier the better), go for it and try, what do you have to lose. Worst case you will learn that it is not doable. I never call these failures, I call them "I succeeded in learning how not to do <add idea here> " :)
Happy learning everyone.
Useful links:
- HuggingFace: https://huggingface.co/
- My ML introduction sessions: https://github.com/mohmiim/MLIntroduction
You are Always playing on the edge of innovative tech. Cool stuff! We have been testing GPT3 too against the Arria Language Models head to head. Not good for accurate language generation but code generation could be safer. Let’s connect soon!
Very Cool Moh....