Optuna the Best Hyperparameter Tuning Tools for Machine Learning
I was working as Research Analyst with Zamilur Rahman, Ph.D. in a Data Science project to build the automate machine learning for the text classification. I cannot tell the topic, but I am able to share the insights jobs. When we are in process for training machine learning models such as Logistic Regression, KNN, SVC and so on, we were facing a problem that was tuning parameter for each steps in pipeline and model's parameters.
At the beginning, we would come with Grid Search or Random Search. However, they all have problems. Grid Search is an brute force principle, which defines a grid of hyperparameter values and exhaustively search from all possible combinations. This causes computationally expensive when it comes to large number of hyperparameters or wide range of values. Besides, Grid Search is less efficient for complex optimization problems, since it does not handle non linear relationship between hyperparameters.
Whereas, Random Search is less complex and less accuracy then Grid Search. It explores a predefined search combination by randomly selecting combination and evaluate model performance. Hence, there is no guarantee of finding the best set of hyperparameters, in exchange, this method costs less computational time.
These problems leads me to find another solution, which is Optuna. It's principle is based on Bayesian Optimization techniques that rationally explore search space, focusing on promising regions and reducing computational ahead. Optuna also handle non linear relationship hyperparameter and provides seamless integration with machine learning frameworks and customization options.
With the advantages of prioritizing promising regions of the hyperparameter space to reduce the numbers of model evaluation required, Optuna simply outmarch Grid Search and Random Search. It is a preferred choice for complex scale optimization problems because of it's high efficiency and fast convergence.
Here is the sample code for Optuna:
import optuna
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
# Sample text data and labels (replace with your data)
X = ["text data sample 1", "text data sample 2", "text data sample 3", "text data sample 4"]
y = [0, 1, 0, 1] # Binary labels
# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
def objective(trial):
# Define the hyperparameters to tune
max_df = trial.suggest_uniform('max_df', 0.5, 1.0)
min_df = trial.suggest_uniform('min_df', 0.0, 0.5)
max_features = trial.suggest_int('max_features', 100, 1000)
ngram_range = trial.suggest_categorical('ngram_range', [(1, 1), (1, 2), (2, 2)])
# Create a CountVectorizer instance with the parameters
vectorizer = CountVectorizer(max_df=max_df, min_df=min_df, max_features=max_features, ngram_range=ngram_range)
# Create a pipeline
clf = Pipeline([
('vect', vectorizer),
('clf', LogisticRegression())
])
# Evaluate the model
score = cross_val_score(clf, X_train, y_train, cv=3).mean()
return score
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(study.best_params)
In this script:
Execute this script to start the hyperparameter optimization. Optuna will try different combinations of the parameters for CountVectorizer to find the best configuration according to the cross-validated score.
Remember to adjust the parameters, classifier, and the dataset according to specific needs. The range and choice of parameters in suggest_* functions can be modified based on the nature of my dataset and the problem I solved.
Once I have found the best hyperparameters for my pipeline using Optuna, you can use this pipeline to make predictions on my test set. Here's how I can do it:
from sklearn.metrics import classification_report
# Recreate the pipeline with the best parameters
best_params = study.best_params
pipeline_optimized = Pipeline([
('tfidf', CountVectorizer(max_df=best_params['max_df'], min_df=best_params['min_df'], max_features=best_params['max_features'], ngram_range=best_params['ngram_range'])),
('classifier', LogisticRegression())
])
# Fit the pipeline on the entire training data
pipeline_optimized.fit(X_train, y_train)
# Make predictions on the test set
predictions = pipeline_optimized.predict(X_test)
# Evaluate the predictions
print(classification_report(y_test, predictions))
In this code:
Typically, when this code is executed, the Optuna will only run on single core CPU, which is not ideally for complex data. However, Optuna definitely run on multiple core processor, which can execute faster when CPU such as Thread ripper or Intel Xeon. Ordinarily, it does not need to have such high end CPU, the current CPU like Intel Core i5 or Ryzen can be also utilized this. The only downside is that Optuna required to run on back-end.
However, the native parallelization in Optuna for executing multiple trials simultaneously is managed by using a shared database. Each Optuna worker picks up a trial from the shared study and works on it. Here's a basic setup:
Recommended by LinkedIn
Step 1: Setup a Shared Study Storage
First, you need a shared storage system that all workers can access. This could be a file-based SQLite database or a more robust system like MySQL or PostgreSQL for larger scale parallelization.
# Create a study object and optimize
storage_url = "mysql+mysqlconnector://optuna_user:123456@localhost:3306/optuna_db"
Step 2: Run the Optimization
You run the same optimization script on multiple cores/machines. Each script connects to the shared study and performs optimization. Optuna handles synchronization to ensure that the same trial is not evaluated more than once.
import optuna
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
# Sample text data and labels (replace with your data)
X = ["text data sample 1", "text data sample 2", "text data sample 3", "text data sample 4"]
y = [0, 1, 0, 1] # Binary labels
# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
def objective(trial):
# Define the hyperparameters to tune
max_df = trial.suggest_uniform('max_df', 0.5, 1.0)
min_df = trial.suggest_uniform('min_df', 0.0, 0.5)
max_features = trial.suggest_int('max_features', 100, 1000)
ngram_range = trial.suggest_categorical('ngram_range', [(1, 1), (1, 2), (2, 2)])
# Create a CountVectorizer instance with the parameters
vectorizer = CountVectorizer(max_df=max_df, min_df=min_df, max_features=max_features, ngram_range=ngram_range)
# Create a pipeline
clf = Pipeline([
('vect', vectorizer),
('clf', LogisticRegression())
])
# Evaluate the model
score = cross_val_score(clf, X_train, y_train, cv=3).mean()
return score
study = optuna.create_study(study_name="Count_LogisticRegression_study",direction='maximize',load_if_exists=True, storage=storage_url)
with parallel_backend('multiprocessing'): # Overrides `prefer="threads"` to use multi-processing.
study.optimize(objective, n_trials=100, n_jobs=-1)
# Best trial
trial = study.best_trial
print('Accuracy: {}'.format(trial.value))
print("Best hyperparameters: {}".format(trial.params))
Running Optuna in Distributed Mode
For truly distributed computing across multiple machines, it should be ensure that:
Considerations
Additionally, Optuna offers several visualization features to help you understand and analyze the optimization results. These visualizations can provide insights into the hyperparameter space, the performance of trials, and the convergence of the study. To use Optuna's visualization features, I used plotly or matplotlib libraries installed. It doesn't need to have database storage to visualize, but this has to be done right after the tuning job finished since it's in-memory storage (which won't store any data once Python season terminated).
from optuna.visualization import plot_optimization_history, plot_param_importances
# Plot optimization history
optimization_history = plot_optimization_history(study)
optimization_history.show()
# Plot hyperparameter importance
param_importances = plot_param_importances(study)
param_importances.show()
For a long term study, this method is not wise when comparing other model studies. It is recommended to run on back-end, which is not only faster, but also reliability. Here is the code to visualize the study from database:
from optuna.visualization import plot_optimization_history, plot_param_importances
# Assuming the study has already been created and optimized elsewhere
# Load the study
loaded_study = optuna.load_study(study_name='example_study', storage=sql_server_storage_url)
# Visualization: Optimization History
optimization_history = plot_optimization_history(loaded_study)
optimization_history.show()
# Visualization: Parameter Importance
param_importances = plot_param_importances(loaded_study)
param_importances.show()
This approach allows user to leverage the power of Optuna for hyperparameter optimization with the robustness of SQL Server for storage, alongside the convenience of visual analysis directly from stored studies.
Thank you for spending time to read my article, I hope you find something useful for your machine learning or data science study.
Let connect: My Linked In | My Github
Cliffton Nguyen
Congratulations on publishing your second article, Cliffton—looking forward to gaining insights on hyperparameter tuning with Optuna!