Tuning Supervised Machine Learning Models
As a continuation the previous two posts, I will go into detail on the methods I used to fine tune my Congress stock purchasing AI.
I decided to use supervised machine learning models because they can leverage historical data to identify patterns. This enables a more data driven approach and systematic decision making process. By training on past price movements of historical politicians trades, the model attempts to make a decision tree for a given trade. By splitting the large dataset into 85% training and 15% testing, the model optimizes for accuracy in determining if the stock will be up or down after a period of time.
Choosing the right machine learning model was a crucial decision for this project. I prioritized speed, understanding, and reliability and thus landed on a Random Forest Classifier.
For the rapid development, I wanted a model which was fast to train and iterate. Unlike XGBoost or deep learning models, Random Forest trains decision trees in parallel which speeds up the training process.
I wanted to apply my knowledge of fundamental analysis to determine the health of a company. A model like a neural network would be difficult to understand why a model made certain decisions. The Random Forest classifier gives a clear importance metric for each of my inputs which allows me to understand how my model works better.
Other models like Gradient Boosting can sometimes overfit without careful tuning, and deep learning models require a large datasets to be effective which not possible only a few thousand transactions since 2014. Once again, the Random Forest Classifier was a solid middle ground which balanced accuracy, stability, and low maintenance.
Once the model was selected, the next step was to tune the model with the data collected in my second article. This process was mostly straight forward by splitting the categorical and numerical values from my dataset. Once this process was complete, the data was then thrown into a Column Transformer, which is a preprocessor for the data.
Recommended by LinkedIn
preprocessor = ColumnTransformer(
transformers=[
("cat", categorical_transformer, categorical_features),
('num', numerical_transformer, numerical_features)
]
)
The classifier was calculated by determining if (1) the asset was above the stocks price on the date the filing was made, or (0) if the asset was below.
Next, the training and testing data was made by splitting the X input (the categorical and numerical data), and the Y (classifier variable) into the prementioned 85-15 split.
X = data[categorical_features + numerical_features]
# 1 if profitable after 7 days, 0 otherwise
y = data['Profitable7']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=69)
The model was then passed into a Pipeline which handles the preprocessor and the classifier which is the Random Forest Classifier.
model = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_jobs=-1))
])
Once the model was trained, I used the GridSearchCV parameter tuning to determine the optimal input values for the classifier model. To avoid over tuning for the initial stage, I only looked to tune using the n estimators, max depth, and min samples split parameters for the model.
# hyperparam tuning
param_grid = {
'classifier__n_estimators': [100, 250, 500],
'classifier__max_depth': [3, 5, 7, 9, 15, 20],
'classifier__min_samples_split': [2, 4, 6, 8],
}
grid_search = GridSearchCV(
estimator=model,
param_grid=param_grid,
scoring='accuracy',
cv=3,
refit='f1',
# verbose=2,
n_jobs=-1,
)
grid_search.fit(X_train, y_train)
print("Best parameters found:")
print(grid_search.best_params_)
print("Best score found:")
print(grid_search.best_score_)
best_model = grid_search.best_estimator_
test_score = best_model.score(X_test, y_test)
print(f"Test score: {test_score}")
Through careful tuning of the parameter grid, I was able to boost the model’s accuracy from 63% to 68%. This improvement highlights the power of fine-tuning and reinforces the value of continuously iterating to achieve better results.