Evaluation-driven development of a LLM binary classifier using Inspect

This is a tutorial showing how to use Inspect by the UK AI Security Institute to systematically create a binary classifier using LLMs. Full code is at https://github.com/corvuslee/inspect-ai/blob/main/classifier.py

Introduction

LLM application development requires a systematic approach to perform and keep track of experiments. It usually starts with curating an evaluation dataset, deciding on the evaluation metrics, then fitting the dataset to a current draft of the task. The development process is iterative and driven by the evaluation result.

Article content
Workflow of evaluation-driven prompt development

To illustrate how to do this using Inspect, I'll use an example scenario of a binary classification – provided recommendation of a cookery class based on an existing guideline.

Installation

Follow the instructions at https://inspect.aisi.org.uk/#getting-started. To do it using uv:

# Install libraries in pyproject.toml
uv sync        

It will install extra dependencies to invoke OpenAI API compatible models - like gpt-oss in Amazon Bedrock.

The VSCode extension is not mandatory, but the log viewer is quite handy.

Evaluation Dataset

The dataset class supports CSV, JSON, and JSONL format. I'm using ChatMessage in a JSON file as it's simple to create multi-turn conversation as input. The example has two records, but you may want to have 10-100 for development.

[
    {
        "input": [
            {
                "role": "user",
                "content": "I want to open the best steakhouse in the city"
            }
        ],
        "target": "True"
    },
    {
        "input": [
            {
                "role": "user",
                "content": "I want to look cool"
            }
        ],
        "target": "False"
    }
]        

Evaluation metrics

The scorer class is customisable. I'm using the built-in match scorer, which provides the default metrics: accuracy and standard error of the mean. The code snippet of the task now looks like this:

@task
def binary_classifier():
    dataset = json_dataset("data/binary_classifier/dataset.json")
    return Task(
        dataset=dataset,
        scorer=match(),
    )        

Task

After dataset and scorer, the remaining components are:

  1. System prompt
  2. Tool definition
  3. Output formatter for evaluation

1. System prompt

For a binary classifier, I usually write role, goal, and guideline in the system prompt as shown below. Copy & paste from an existing guideline is usually a good start as modern LLMs (as of July 2025) are quite good at parsing long document.

You're a evidence-based AI decision maker. Your goal is assess if the applicant can benefit from the cookery class based on this guideline with solid reasons.

---
Cookery class recommendation guideline
1. Strong motivation
2. Potential of becoming a professional chef
3. Don't ask for any additional information from the applicant        

2. Tool definition

Automation application requires a consistent output format – in this case a JSON with the field `recommend` and value as either True or False. The `reason` field is absent as I want to use the LLM's reasoning feature to write them as free text instead. The example here uses custom tools instead of structured output because the former one has wider LLM support.

Write a simple function (which does nothing) with a meaningful name, docstring, input, and output. Inspect will turn the function into the right tool format for the target LLMs.

@tool
def json_output():
    async def execute(recommend: bool):
        """
        Format the output to the predefined JSON schema

        Args:
            recommend: whether or not you would recommend a cookery class
        """
        return str(recommend)

    return execute        

3. Output formatter

As the built-in scorer will only look at the last assistant message, I've to use this hacky function to extract the value from tool use.

@solver
def format_output():
    async def solve(state: TaskState, generate: Generate):
        model_output = ModelOutput.from_content(
            model=state.model.name, content=state.messages[-1].content
        )
        state.output = model_output
        return state

    return solve        

And now the whole task would look like this:

@task
def binary_classifier():
    dataset = json_dataset("data/binary_classifier/dataset.json")
    system = "data/binary_classifier/system.txt"
    return Task(
        dataset=dataset,
        solver=[
            system_message(system),
            use_tools(json_output()),
            generate(tool_calls="single"),
            format_output(),
        ],
        scorer=match(),
    )        

Evaluation

Provide the task, model, and model parameters and run. If you're using other LLMs, check the doc and see how to enable reasoning.

eval(
        binary_classifier(),
        model="openai-api/bedrock/openai.gpt-oss-20b-1:0",
        reasoning_effort="low",
    )        

Go to the log viewer to check the overall score and details of each sample.

Article content
Log of the binary classifier evaluation with two samples

Next step

Provide your own dataset and system prompt, then evaluate on different models.

To view or add a comment, sign in

More articles by Corvus L.

Others also viewed

Explore content categories