Evaluation-driven development of a LLM intent classifier using Inspect
In the previous article, we've discussed how to use Inspect by the UK AI Security Institute to develop a binary classifier using the evaluation-driven approach. We'll now work on an intent classifier - a commonly used method to route user requests to the appropriate workflow.
Recap
LLM application development requires a systematic approach to perform and keep track of experiments. It usually starts with curating an evaluation dataset, deciding on the evaluation metrics, then fitting the dataset to a current draft of the task. The development process is iterative and driven by the evaluation result.
Installation
Follow the instructions at https://inspect.aisi.org.uk/#getting-started. To do it using uv:
# Clone the repo
git clone https://github.com/corvuslee/inspect-ai.git
# Install libraries in pyproject.toml
cd inspect-ai
uv sync
It will install extra dependencies to invoke OpenAI API compatible models - like gpt-oss in Amazon Bedrock.
Evaluation Dataset
I've only one record per intent category in the example. You may want to have a larger dataset (10+ per intent) for development.
[
{
"input": [
{
"role": "user",
"content": "What are your business hours?"
}
],
"target": "informational"
},
...
]
Evaluation metrics
The scorer class is customisable. I'm using the built-in match scorer, which provides the default metrics: accuracy and standard error of the mean. The code snippet of the task now looks like this:
@task
def intent_classifier():
dataset = json_dataset("data/intent_classifier/dataset.json")
return Task(
dataset=dataset,
scorer=match(),
)
Task
After dataset and scorer, the remaining components are:
1. System prompt
For an intent classifier, I usually write role, goal, and guideline in the system prompt as shown below. Copy & paste from an existing guideline is usually a good start as modern LLMs (as of October 2025) are quite good at parsing long context.
Recommended by LinkedIn
The assistant is an intent classifier. It analyzes user messages and classifies them into appropriate categories.
It returns only the category name.
2. Tool definition
Automation application requires a consistent output format – in this case a JSON with the field `category` and value the category name. The `reason` field is absent as I want to use the LLM's reasoning feature to write them as free text instead. The example here uses custom tools instead of structured output because the former one has wider LLM support.
Write a simple function (which does nothing) with a meaningful name, docstring, input, and output. Inspect will turn the function into the right tool format for the target LLMs.
@tool
def classify_intent():
async def execute(category: str):
"""
Classify the user message into an intent category
Args:
category: the intent category:
- informational: ... description ...
- transactional: ... description ...
- account_management: ... description ...
- technical: ... description ...
- customer_service: ... description ...
- general: ... description ...
"""
return str(category)
return execute
3. Output formatter
As the built-in scorer will only look at the last assistant message, I've to use this hacky function to extract the value from tool use.
@solver
def format_output():
async def solve(state: TaskState, generate: Generate):
model_output = ModelOutput.from_content(
model=state.model.name, content=state.messages[-1].content
)
state.output = model_output
return state
return solve
And now the whole task would look like this:
@task
def intent_classifier():
dataset = json_dataset("data/intent_classifier/dataset.json")
system = "data/intent_classifier/system.txt"
return Task(
dataset=dataset,
solver=[
system_message(system),
use_tools(classify_intent()),
generate(tool_calls="single"),
format_output(),
],
scorer=match(),
)
Evaluation
Provide the task, model, and model parameters and run. If you're using other LLMs, check the doc and see how to enable reasoning.
eval(
intent_classifier(),
model="openai-api/bedrock/openai.gpt-oss-20b-1:0",
reasoning_effort="low"
)
Launch the log viewer.
inspect view
======== Running on http://127.0.0.1:7575 ========
Open the link (without https) with your browser and check the overall score and details of each sample.
Next step
Provide your own dataset and system prompt, then evaluate on different models.