Building an LLM wrapper prototype to develop my ML and coding skills...
It has been 7 months since I began up-leveling my technical skills through courses and projects. To apply these skills, I decided to develop a prototype that would demonstrate everything I've learned so far.
I leveraged my knowledge of algorithms, APIs, data cleaning and processing, large language models, machine learning, prompt engineering, and Python to create a system that understands the relationship between skills and job roles. This system allows job seekers to input their current skills and receive predicted and generated lists of potential career transition paths.
In this article, I will cover the steps I took to develop a solution I've named Skill(X).
System Overview
"Roles to consider":
"Matched Job Title #",
"Job description": "Concise and informative job description (around 2-3 sentences). Focus on key responsibilities and required skills.",
"Skill match analysis summary": "Overall analytical assessment of how well the user's *entire skillset* (all 5 skills combined) aligns with the requirements of this role. Consider the relative importance of each skill to the role. Examples: 'Strong overall match...', 'Partial match...', 'Weak overall match...'",
"Skill match analysis explained":
"Skill": "Skill Name",
"Match": "Weak/Medium/Strong",
"Justification": "Brief explanation."
"How your skills apply to this role": "Explain how the user's provided skills can be practically applied to this specific job. Provide concrete examples and focus on transferable skills. Aim for 3-4 sentences."
"Skill Development": "Provide recommendations to the user to help them develop this skill."
With that context in mind, for this next part, I shared my code with Google Gemini so it could generate a detailed description of the model creation, training, and testing process...
Here's what Gemini had to say about my code:
Step 1: Preparing the Data
The project begins by organizing data from a large database of job postings. Each job role (e.g., "Software Engineer") and skill set (e.g., "Python, SQL, Machine Learning") is extracted and prepped for analysis. To manage and manipulate this data efficiently, we use pandas, a Python library designed for data manipulation and analysis. By adding the data to a pandas DataFrame, we structure it in a tabular format, where each row represents a job posting, and columns contain details like job roles, skills, and other relevant information. This makes it easier to clean, transform, and analyze the data.
Step 2: Understanding Skills and Roles with SBERT
To help the computer "understand" text, we use a special tool called Sentence-BERT (SBERT). SBERT is a pre-trained language model that turns text into embeddings – numerical representations that capture the meaning of words and sentences. Think of it like creating a unique, detailed barcode for each role and skill set, enabling the system to recognize patterns and similarities between them. For example:
By converting both skills and job roles into embeddings, the system can calculate similarities between them and identify which roles align with which skill sets.
Step 3: Labeling Roles
Each job role is converted into a numerical label using Label Encoding from the sklearn library. This process assigns a unique number to each job role, making it easier for the system to process and classify them. For example:
Label encoding ensures that roles are represented in a consistent and machine-readable format, simplifying their use in the model.
# Extract roles and skill sets
print("Extracting roles and skills from dataset...")
roles = df['Role'].values
skills = df['skills'].values
print(f"Number of roles: {len(roles)} | Number of skill sets: {len(skills)}")
# Initialize the pre-trained SBERT model
print("Initializing pre-trained SBERT model...")
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')
# Create embeddings for roles and skills
print("Creating embeddings for roles and skills...")
role_embeddings = sbert_model.encode(roles, show_progress_bar=True)
print("Role embeddings complete.")
print(f"Role Embeddings Shape: {role_embeddings.shape}")
skill_embeddings = sbert_model.encode(skills, show_progress_bar=True)
print("Skill embeddings complete.")
print(f"Role Embeddings Shape: {skill_embeddings.shape}")
# Encode the roles as labels
print("Encoding roles as labels...")
label_encoder = LabelEncoder()
role_labels = label_encoder.fit_transform(roles)
print("Role labels encoded. Sample labels:")
print(role_labels[:5]) # Display first 5 encoded labels
Step 4: Splitting the Data
The dataset is divided into training and testing sets using the train_test_split function from sklearn. This step ensures that the model learns from one portion of the data (training set) and is tested on a different portion (testing set) to evaluate its performance. For example:
This separation prevents overfitting and ensures the model generalizes well to new data.
Step 5: Building the Brain (Neural Network)
A neural network acts as the brain of the system. Built using Keras, a high-level deep learning API in TensorFlow, this network is designed to find patterns in skill embeddings and map them to job roles. TensorFlow, the underlying framework, handles complex mathematical operations and optimization processes efficiently.
Recommended by LinkedIn
The network has multiple layers:
Output Layer: This predicts the job role, outputting probabilities for each role. For example, it might say there’s a 75% chance the role is "Data Scientist" and 25% chance it’s "Machine Learning Engineer."
# Build the neural network model
print("Building the neural network model...")
model = tf.keras.Sequential([
tf.keras.layers.Dense(512, activation='relu', input_shape=(384,)), # Explicit input shape
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(376, activation='softmax') # softmax for classification
])
Step 6: Training the System
The neural network is trained using the training data. During training, the system adjusts its internal settings to minimize errors in predicting roles. This is achieved using a process called optimization, where the model continuously improves by comparing its predictions to the actual roles and correcting itself.
The optimization process relies on an algorithm called the Adam Optimizer, short for "Adaptive Moment Estimation." Adam combines two key techniques:
Step 7: Testing and Evaluating
After training, the model is evaluated on the testing data. It predicts roles for new skill sets, and we measure how accurate these predictions are. The accuracy score shows how well the system performs overall. For instance, if the model correctly predicts 90 out of 100 roles, the accuracy is 90%.
End of Gemini response.
If you'd like to review the solution in more detail feel free to review my Github repo for Skill(X).
What's next?
Although this was just a prototype, it was fun to test the solution and review its predictions.
From here, there are many opportunities for fine-tuning and further development:
If you test the prototype, please note the following:
- Due to free hosting and usage limitations, the app might not be available 24/7. Limits on the number of concurrent users are also in place and rate limiting may occur.
- The app is optimized for desktops. While functional on mobile devices, the skill match analysis table may not be fully responsive extending beyond the screen.
- Like all models, the accuracy of its predictions depends on the quality and mix of the training data and the inputs you provide. Gaps in the roles and skills training data may exist, leading to poor matches. As a fallback, Gemini is asked to generate replacements when this occurs.
Thanks for reading.
David