Optuna the Best Hyperparameter Tuning Tools for Machine Learning

Cliffton Nguyen

Published Feb 17, 2024

I was working as Research Analyst with Zamilur Rahman, Ph.D. in a Data Science project to build the automate machine learning for the text classification. I cannot tell the topic, but I am able to share the insights jobs. When we are in process for training machine learning models such as Logistic Regression, KNN, SVC and so on, we were facing a problem that was tuning parameter for each steps in pipeline and model's parameters.

At the beginning, we would come with Grid Search or Random Search. However, they all have problems. Grid Search is an brute force principle, which defines a grid of hyperparameter values and exhaustively search from all possible combinations. This causes computationally expensive when it comes to large number of hyperparameters or wide range of values. Besides, Grid Search is less efficient for complex optimization problems, since it does not handle non linear relationship between hyperparameters.

Whereas, Random Search is less complex and less accuracy then Grid Search. It explores a predefined search combination by randomly selecting combination and evaluate model performance. Hence, there is no guarantee of finding the best set of hyperparameters, in exchange, this method costs less computational time.

These problems leads me to find another solution, which is Optuna. It's principle is based on Bayesian Optimization techniques that rationally explore search space, focusing on promising regions and reducing computational ahead. Optuna also handle non linear relationship hyperparameter and provides seamless integration with machine learning frameworks and customization options.

With the advantages of prioritizing promising regions of the hyperparameter space to reduce the numbers of model evaluation required, Optuna simply outmarch Grid Search and Random Search. It is a preferred choice for complex scale optimization problems because of it's high efficiency and fast convergence.

Here is the sample code for Optuna:

import optuna
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline

# Sample text data and labels (replace with your data)
X = ["text data sample 1", "text data sample 2", "text data sample 3", "text data sample 4"]
y = [0, 1, 0, 1]  # Binary labels

# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

def objective(trial):
    # Define the hyperparameters to tune
    max_df = trial.suggest_uniform('max_df', 0.5, 1.0)
    min_df = trial.suggest_uniform('min_df', 0.0, 0.5)
    max_features = trial.suggest_int('max_features', 100, 1000)
    ngram_range = trial.suggest_categorical('ngram_range', [(1, 1), (1, 2), (2, 2)])

    # Create a CountVectorizer instance with the parameters
    vectorizer = CountVectorizer(max_df=max_df, min_df=min_df, max_features=max_features, ngram_range=ngram_range)

    # Create a pipeline
    clf = Pipeline([
        ('vect', vectorizer),
        ('clf', LogisticRegression())
    ])

    # Evaluate the model
    score = cross_val_score(clf, X_train, y_train, cv=3).mean()
    return score

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

print(study.best_params)

In this script:

CountVectorizer is set up with several parameters to be tuned: max_df, min_df, max_features, and ngram_range.
The objective function creates a pipeline with CountVectorizer and LogisticRegression.
Optuna is used to find the best combination of CountVectorizer parameters.

Execute this script to start the hyperparameter optimization. Optuna will try different combinations of the parameters for CountVectorizer to find the best configuration according to the cross-validated score.

Remember to adjust the parameters, classifier, and the dataset according to specific needs. The range and choice of parameters in suggest_* functions can be modified based on the nature of my dataset and the problem I solved.

Once I have found the best hyperparameters for my pipeline using Optuna, you can use this pipeline to make predictions on my test set. Here's how I can do it:

Recreate the Pipeline with Optimal Parameters: First, I need to recreate the pipeline using the best hyperparameters obtained from the Optuna study.
Fit the Pipeline on the Entire Training Set: Train this pipeline on my entire training dataset.
Make Predictions on the Test Set: Use the trained pipeline to make predictions on the test set.

from sklearn.metrics import classification_report

# Recreate the pipeline with the best parameters
best_params = study.best_params
pipeline_optimized = Pipeline([
    ('tfidf', CountVectorizer(max_df=best_params['max_df'], min_df=best_params['min_df'], max_features=best_params['max_features'], ngram_range=best_params['ngram_range'])),
    ('classifier', LogisticRegression())
])

# Fit the pipeline on the entire training data
pipeline_optimized.fit(X_train, y_train)

# Make predictions on the test set
predictions = pipeline_optimized.predict(X_test)

# Evaluate the predictions
print(classification_report(y_test, predictions))

In this code:

classification_report from scikit-learn is used to generate a report on the performance of the model on the test set, including metrics like precision, recall, and F1-score.
The pipeline is trained on the entire training set (X_train, y_train) and then used to make predictions on the test set (X_test).
Ensure that the data transformations (like vectorization) are consistent between training and testing. The pipeline takes care of this by applying the same transformation to the test set as was applied to the training set.

Typically, when this code is executed, the Optuna will only run on single core CPU, which is not ideally for complex data. However, Optuna definitely run on multiple core processor, which can execute faster when CPU such as Thread ripper or Intel Xeon. Ordinarily, it does not need to have such high end CPU, the current CPU like Intel Core i5 or Ryzen can be also utilized this. The only downside is that Optuna required to run on back-end.

However, the native parallelization in Optuna for executing multiple trials simultaneously is managed by using a shared database. Each Optuna worker picks up a trial from the shared study and works on it. Here's a basic setup:

Recommended by LinkedIn

Gini Index -CART Decision Algorithm in Machine Learning

Riya Patadiya 6 years ago

IID in machine learning

Ajit Jaokar 1 year ago

Automated Augmentation Explained

Daniel McGeough Jr. 10 months ago

Step 1: Setup a Shared Study Storage

First, you need a shared storage system that all workers can access. This could be a file-based SQLite database or a more robust system like MySQL or PostgreSQL for larger scale parallelization.

# Create a study object and optimize
storage_url = "mysql+mysqlconnector://optuna_user:123456@localhost:3306/optuna_db"

Step 2: Run the Optimization

You run the same optimization script on multiple cores/machines. Each script connects to the shared study and performs optimization. Optuna handles synchronization to ensure that the same trial is not evaluated more than once.

import optuna
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline

# Sample text data and labels (replace with your data)
X = ["text data sample 1", "text data sample 2", "text data sample 3", "text data sample 4"]
y = [0, 1, 0, 1]  # Binary labels

# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

def objective(trial):
    # Define the hyperparameters to tune
    max_df = trial.suggest_uniform('max_df', 0.5, 1.0)
    min_df = trial.suggest_uniform('min_df', 0.0, 0.5)
    max_features = trial.suggest_int('max_features', 100, 1000)
    ngram_range = trial.suggest_categorical('ngram_range', [(1, 1), (1, 2), (2, 2)])

    # Create a CountVectorizer instance with the parameters
    vectorizer = CountVectorizer(max_df=max_df, min_df=min_df, max_features=max_features, ngram_range=ngram_range)

    # Create a pipeline
    clf = Pipeline([
        ('vect', vectorizer),
        ('clf', LogisticRegression())
    ])

    # Evaluate the model
    score = cross_val_score(clf, X_train, y_train, cv=3).mean()
    return score

study = optuna.create_study(study_name="Count_LogisticRegression_study",direction='maximize',load_if_exists=True, storage=storage_url)

with parallel_backend('multiprocessing'):  # Overrides `prefer="threads"` to use multi-processing.
    study.optimize(objective, n_trials=100, n_jobs=-1)

# Best trial
trial = study.best_trial
print('Accuracy: {}'.format(trial.value))
print("Best hyperparameters: {}".format(trial.params))

Running Optuna in Distributed Mode

For truly distributed computing across multiple machines, it should be ensure that:

Database Accessibility: The database specified in the storage parameter of create_study or load_study must be accessible from all machines.
Identical Environment: Each worker machine should have the same environment setup, including identical versions of Python, Optuna, and all relevant libraries.
Concurrent Script Execution: Execute Optuna optimization script simultaneously on all worker machines. Each instance will pick up different trials from the study and evaluate them.

Considerations

SQLite Limitations: SQLite is not recommended for high-concurrency scenarios or distributed computing across multiple machines due to locking issues and performance. Use a more robust database like MySQL or PostgreSQL for distributed optimization.
Resource Management: Ensure the system(s) have enough resources to handle the parallel trials, especially if using heavy models or large datasets.

Additionally, Optuna offers several visualization features to help you understand and analyze the optimization results. These visualizations can provide insights into the hyperparameter space, the performance of trials, and the convergence of the study. To use Optuna's visualization features, I used plotly or matplotlib libraries installed. It doesn't need to have database storage to visualize, but this has to be done right after the tuning job finished since it's in-memory storage (which won't store any data once Python season terminated).

from optuna.visualization import plot_optimization_history, plot_param_importances

# Plot optimization history
optimization_history = plot_optimization_history(study)
optimization_history.show()

# Plot hyperparameter importance
param_importances = plot_param_importances(study)
param_importances.show()

For a long term study, this method is not wise when comparing other model studies. It is recommended to run on back-end, which is not only faster, but also reliability. Here is the code to visualize the study from database:

from optuna.visualization import plot_optimization_history, plot_param_importances

# Assuming the study has already been created and optimized elsewhere

# Load the study
loaded_study = optuna.load_study(study_name='example_study', storage=sql_server_storage_url)

# Visualization: Optimization History
optimization_history = plot_optimization_history(loaded_study)
optimization_history.show()

# Visualization: Parameter Importance
param_importances = plot_param_importances(loaded_study)
param_importances.show()

This approach allows user to leverage the power of Optuna for hyperparameter optimization with the robustness of SQL Server for storage, alongside the convenience of visual analysis directly from stored studies.

Thank you for spending time to read my article, I hope you find something useful for your machine learning or data science study.

Let connect: My Linked In | My Github

Cliffton Nguyen

TOMEK 2y

Congratulations on publishing your second article, Cliffton—looking forward to gaining insights on hyperparameter tuning with Optuna!

See more comments

To view or add a comment, sign in

Optuna the Best Hyperparameter Tuning Tools for Machine Learning

Cliffton Nguyen

Recommended by LinkedIn

Step 1: Setup a Shared Study Storage

Step 2: Run the Optimization

Running Optuna in Distributed Mode

Considerations

More articles by Cliffton Nguyen

Others also viewed

Understanding Machine Learning Algorithms: A Comprehensive Guide

K-Nearest Neighbors (KNN) vs. K-Means: Understanding the Key Differences

Machine Learning Simplified - 1

Building a more accurate model

A Deep Dive into Ensemble Algorithms and Combining Multiple Models

Building a Machine Learning Pipeline – Exploration and Data Processing

Understanding Decision Trees in Machine Learning: A Comprehensive Guide

Handling the over-complex decisions in data science using Bayesian Optimization

Underlying assumptions of ML algorithms

How to Optimize Machine Learning Performance

Parameter Tuning Strategies for Large Language Models

Optimization Techniques for Artificial Intelligence

Advanced LLM Parameter Tuning Techniques

Tips for Creating a Machine Learning Experimentation Environment

Explore content categories

Recommended by LinkedIn

Step 1: Setup a Shared Study Storage

Step 2: Run the Optimization

Running Optuna in Distributed Mode

Considerations

More articles by Cliffton Nguyen

India Crop Analysis from 1978 to 2017 with Python and SQL

Others also viewed

Understanding Machine Learning Algorithms: A Comprehensive Guide

K-Nearest Neighbors (KNN) vs. K-Means: Understanding the Key Differences

Machine Learning Simplified - 1

Building a more accurate model

A Deep Dive into Ensemble Algorithms and Combining Multiple Models

Building a Machine Learning Pipeline – Exploration and Data Processing

Understanding Decision Trees in Machine Learning: A Comprehensive Guide

Handling the over-complex decisions in data science using Bayesian Optimization

Underlying assumptions of ML algorithms

Similar topics

How to Optimize Machine Learning Performance

Parameter Tuning Strategies for Large Language Models

Optimization Techniques for Artificial Intelligence

Advanced LLM Parameter Tuning Techniques

Tips for Creating a Machine Learning Experimentation Environment

Explore content categories