Evaluation-driven development of a LLM intent classifier using Inspect

Corvus L.

Published Oct 22, 2025

In the previous article, we've discussed how to use Inspect by the UK AI Security Institute to develop a binary classifier using the evaluation-driven approach. We'll now work on an intent classifier - a commonly used method to route user requests to the appropriate workflow.

Full code is at https://github.com/corvuslee/inspect-ai/blob/main/intent_classifier.py

Recap

LLM application development requires a systematic approach to perform and keep track of experiments. It usually starts with curating an evaluation dataset, deciding on the evaluation metrics, then fitting the dataset to a current draft of the task. The development process is iterative and driven by the evaluation result.

Article content — Workflow of evaluation-driven development

Installation

Follow the instructions at https://inspect.aisi.org.uk/#getting-started. To do it using uv:

# Clone the repo
git clone https://github.com/corvuslee/inspect-ai.git

# Install libraries in pyproject.toml
cd inspect-ai
uv sync

It will install extra dependencies to invoke OpenAI API compatible models - like gpt-oss in Amazon Bedrock.

Evaluation Dataset

I've only one record per intent category in the example. You may want to have a larger dataset (10+ per intent) for development.

[
    {
        "input": [
            {
                "role": "user",
                "content": "What are your business hours?"
            }
        ],
        "target": "informational"
    },
    ...
]

Evaluation metrics

The scorer class is customisable. I'm using the built-in match scorer, which provides the default metrics: accuracy and standard error of the mean. The code snippet of the task now looks like this:

@task
def intent_classifier():
    dataset = json_dataset("data/intent_classifier/dataset.json")
    return Task(
        dataset=dataset,
        scorer=match(),
    )

Task

After dataset and scorer, the remaining components are:

System prompt
Tool definition
Output formatter for evaluation

1. System prompt

For an intent classifier, I usually write role, goal, and guideline in the system prompt as shown below. Copy & paste from an existing guideline is usually a good start as modern LLMs (as of October 2025) are quite good at parsing long context.

Recommended by LinkedIn

Agents in The Terminal: Exploring AI-Powered CLIs

Novatics 9 months ago

How GPT talks to APIs

Karel Husa 2 years ago

🔍 OpenAI AgentKit vs. n8n: A Comparative Overview

Pankaj Neupaney 6 months ago

The assistant is an intent classifier. It analyzes user messages and classifies them into appropriate categories.

It returns only the category name.

2. Tool definition

Automation application requires a consistent output format – in this case a JSON with the field `category` and value the category name. The `reason` field is absent as I want to use the LLM's reasoning feature to write them as free text instead. The example here uses custom tools instead of structured output because the former one has wider LLM support.

Write a simple function (which does nothing) with a meaningful name, docstring, input, and output. Inspect will turn the function into the right tool format for the target LLMs.

@tool
def classify_intent():
    async def execute(category: str):
        """
        Classify the user message into an intent category

        Args:
            category: the intent category:
                - informational: ... description ...
                - transactional: ... description ...
                - account_management: ... description ...
                - technical: ... description ...
                - customer_service: ... description ...
                - general: ... description ...
        """
        return str(category)

    return execute

3. Output formatter

As the built-in scorer will only look at the last assistant message, I've to use this hacky function to extract the value from tool use.

@solver
def format_output():
    async def solve(state: TaskState, generate: Generate):
        model_output = ModelOutput.from_content(
            model=state.model.name, content=state.messages[-1].content
        )
        state.output = model_output
        return state
    return solve

And now the whole task would look like this:

@task
def intent_classifier():
    dataset = json_dataset("data/intent_classifier/dataset.json")
    system = "data/intent_classifier/system.txt"
    return Task(
        dataset=dataset,
        solver=[
            system_message(system),
            use_tools(classify_intent()),
            generate(tool_calls="single"),
            format_output(),
        ],
        scorer=match(),
    )

Evaluation

Provide the task, model, and model parameters and run. If you're using other LLMs, check the doc and see how to enable reasoning.

eval(
        intent_classifier(),
        model="openai-api/bedrock/openai.gpt-oss-20b-1:0",
        reasoning_effort="low"
    )

Launch the log viewer.

inspect view

======== Running on http://127.0.0.1:7575 ========

Open the link (without https) with your browser and check the overall score and details of each sample.

Evaluation-driven development of a LLM intent classifier using Inspect

Corvus L.

Recap

Installation

Evaluation Dataset

Evaluation metrics

Task

1. System prompt

Recommended by LinkedIn

2. Tool definition

3. Output formatter

Evaluation

Next step

More articles by Corvus L.

Others also viewed

Scaffolding is coping not scaling, and other lessons from Codex | OpenAI’s Thibault Sottiaux

AI-Generated Code: A Strategy for Current and Secure Libraries

What is Retrieval Augmented Generation?

OpenAI Unveils GPT-4o-2024-08-06: Revolutionizing JSON Structured Outputs with 50% Cheaper Inputs

🚨 Is Your AI Safe From Prompt Injection?

Why Rig? 5 Compelling Reasons to Use Rig for Your Next LLM Project

OpenAI Releases New Agentic Models for Codex

We Have a Deployment Problem

OpenAI’s New Agent Framework: A Seamless Multi-Agent Orchestration with MCP

Automating Model Evaluation Using LLMs

How to Perform LLM Text Classification

How to Make LLM Output More Human-Like

LLM Applications for Intermediate Programming Tasks

Evaluating AI-Generated Content With LLMs

Improving LLM Coding Accuracy with Code Intelligence

Explore content categories

Recap

Installation

Evaluation Dataset

Evaluation metrics

Task

1. System prompt

Recommended by LinkedIn

2. Tool definition

3. Output formatter

Evaluation

Next step

More articles by Corvus L.

Product Substitution for Online Supermarkets

Local LLM Inference with llama.cpp on Mac with Metal 4

Evaluation-driven development of a LLM binary classifier using Inspect

Intent Classifier with Function Calling

Robust GenAI – sometimes you don't need a LLM

Robust GenAI – Divide and conquer

Robust GenAI - Don't boil the ocean

Sensitivity Analysis of a Multi-objective Optimization Model

Multi-objective Optimization with Gurobi in Python

Central Limit Theorem with R

Others also viewed

Scaffolding is coping not scaling, and other lessons from Codex | OpenAI’s Thibault Sottiaux

AI-Generated Code: A Strategy for Current and Secure Libraries

What is Retrieval Augmented Generation?

OpenAI Unveils GPT-4o-2024-08-06: Revolutionizing JSON Structured Outputs with 50% Cheaper Inputs

🚨 Is Your AI Safe From Prompt Injection?

Why Rig? 5 Compelling Reasons to Use Rig for Your Next LLM Project

OpenAI Releases New Agentic Models for Codex

We Have a Deployment Problem

OpenAI’s New Agent Framework: A Seamless Multi-Agent Orchestration with MCP

Similar topics

Automating Model Evaluation Using LLMs

How to Perform LLM Text Classification

How to Make LLM Output More Human-Like

LLM Applications for Intermediate Programming Tasks

Evaluating AI-Generated Content With LLMs

Improving LLM Coding Accuracy with Code Intelligence

Explore content categories