Evaluation-driven development of a LLM binary classifier using Inspect

Corvus L.

Published Jul 8, 2025

This is a tutorial showing how to use Inspect by the UK AI Security Institute to systematically create a binary classifier using LLMs. Full code is at https://github.com/corvuslee/inspect-ai/blob/main/classifier.py

Introduction

LLM application development requires a systematic approach to perform and keep track of experiments. It usually starts with curating an evaluation dataset, deciding on the evaluation metrics, then fitting the dataset to a current draft of the task. The development process is iterative and driven by the evaluation result.

Article content — Workflow of evaluation-driven prompt development

To illustrate how to do this using Inspect, I'll use an example scenario of a binary classification – provided recommendation of a cookery class based on an existing guideline.

Installation

Follow the instructions at https://inspect.aisi.org.uk/#getting-started. To do it using uv:

# Install libraries in pyproject.toml
uv sync

It will install extra dependencies to invoke OpenAI API compatible models - like gpt-oss in Amazon Bedrock.

The VSCode extension is not mandatory, but the log viewer is quite handy.

Evaluation Dataset

The dataset class supports CSV, JSON, and JSONL format. I'm using ChatMessage in a JSON file as it's simple to create multi-turn conversation as input. The example has two records, but you may want to have 10-100 for development.

[
    {
        "input": [
            {
                "role": "user",
                "content": "I want to open the best steakhouse in the city"
            }
        ],
        "target": "True"
    },
    {
        "input": [
            {
                "role": "user",
                "content": "I want to look cool"
            }
        ],
        "target": "False"
    }
]

Evaluation metrics

The scorer class is customisable. I'm using the built-in match scorer, which provides the default metrics: accuracy and standard error of the mean. The code snippet of the task now looks like this:

@task
def binary_classifier():
    dataset = json_dataset("data/binary_classifier/dataset.json")
    return Task(
        dataset=dataset,
        scorer=match(),
    )

Task

After dataset and scorer, the remaining components are:

System prompt
Tool definition
Output formatter for evaluation

Recommended by LinkedIn

What is Retrieval Augmented Generation?

Ranvir Raol 2 years ago

What OpenAI’s o3 Model Means for the Enterprise

Sudhir Gajre 1 year ago

README Prompt Injection

Matthew Macri 2 months ago

1. System prompt

For a binary classifier, I usually write role, goal, and guideline in the system prompt as shown below. Copy & paste from an existing guideline is usually a good start as modern LLMs (as of July 2025) are quite good at parsing long document.

You're a evidence-based AI decision maker. Your goal is assess if the applicant can benefit from the cookery class based on this guideline with solid reasons.

---
Cookery class recommendation guideline
1. Strong motivation
2. Potential of becoming a professional chef
3. Don't ask for any additional information from the applicant

2. Tool definition

Automation application requires a consistent output format – in this case a JSON with the field `recommend` and value as either True or False. The `reason` field is absent as I want to use the LLM's reasoning feature to write them as free text instead. The example here uses custom tools instead of structured output because the former one has wider LLM support.

Write a simple function (which does nothing) with a meaningful name, docstring, input, and output. Inspect will turn the function into the right tool format for the target LLMs.

@tool
def json_output():
    async def execute(recommend: bool):
        """
        Format the output to the predefined JSON schema

        Args:
            recommend: whether or not you would recommend a cookery class
        """
        return str(recommend)

    return execute

3. Output formatter

As the built-in scorer will only look at the last assistant message, I've to use this hacky function to extract the value from tool use.

@solver
def format_output():
    async def solve(state: TaskState, generate: Generate):
        model_output = ModelOutput.from_content(
            model=state.model.name, content=state.messages[-1].content
        )
        state.output = model_output
        return state

    return solve

And now the whole task would look like this:

@task
def binary_classifier():
    dataset = json_dataset("data/binary_classifier/dataset.json")
    system = "data/binary_classifier/system.txt"
    return Task(
        dataset=dataset,
        solver=[
            system_message(system),
            use_tools(json_output()),
            generate(tool_calls="single"),
            format_output(),
        ],
        scorer=match(),
    )

Evaluation

Provide the task, model, and model parameters and run. If you're using other LLMs, check the doc and see how to enable reasoning.

eval(
        binary_classifier(),
        model="openai-api/bedrock/openai.gpt-oss-20b-1:0",
        reasoning_effort="low",
    )

Go to the log viewer to check the overall score and details of each sample.

Evaluation-driven development of a LLM binary classifier using Inspect

Corvus L.

Introduction

Installation

Evaluation Dataset

Evaluation metrics

Task

Recommended by LinkedIn

1. System prompt

2. Tool definition

3. Output formatter

Evaluation

Next step

More articles by Corvus L.

Others also viewed

Agents in The Terminal: Exploring AI-Powered CLIs

About AppSec: AI generated content, SAST and Fuzzing

Comparing Llama 3.1 405B vs. 8B vs. GPT-4o Mini: Which Model Comes Out on Top?

Introduction to LangChain

Running LLMs Locally with Ollama: A Practical Setup Guide

Claude Sonnet 4.5 vs. GPT-5

OpenAI's o1 Outperforms Other LLMs By "Stopping To Think," & More

A Bug, OpenAI Codex, and My Unexpected Morning Breakthrough: How AI Changed My Code (and My Mind)

We Have a Deployment Problem

How to Make LLM Output More Human-Like

How to Perform LLM Text Classification

How to Use Step-by-Step Prompting in LLMs

Automating Model Evaluation Using LLMs

Evaluating AI-Generated Content With LLMs

LLM Prompting Techniques for Non-Programmers

Explore content categories

Introduction

Installation

Evaluation Dataset

Evaluation metrics

Task

Recommended by LinkedIn

1. System prompt

2. Tool definition

3. Output formatter

Evaluation

Next step

More articles by Corvus L.

Product Substitution for Online Supermarkets

Local LLM Inference with llama.cpp on Mac with Metal 4

Evaluation-driven development of a LLM intent classifier using Inspect

Intent Classifier with Function Calling

Robust GenAI – sometimes you don't need a LLM

Robust GenAI – Divide and conquer

Robust GenAI - Don't boil the ocean

Sensitivity Analysis of a Multi-objective Optimization Model

Multi-objective Optimization with Gurobi in Python

Central Limit Theorem with R

Others also viewed

Agents in The Terminal: Exploring AI-Powered CLIs

About AppSec: AI generated content, SAST and Fuzzing

Comparing Llama 3.1 405B vs. 8B vs. GPT-4o Mini: Which Model Comes Out on Top?

Introduction to LangChain

Running LLMs Locally with Ollama: A Practical Setup Guide

Claude Sonnet 4.5 vs. GPT-5

OpenAI's o1 Outperforms Other LLMs By "Stopping To Think," & More

A Bug, OpenAI Codex, and My Unexpected Morning Breakthrough: How AI Changed My Code (and My Mind)

We Have a Deployment Problem

Similar topics

How to Make LLM Output More Human-Like

How to Perform LLM Text Classification

How to Use Step-by-Step Prompting in LLMs

Automating Model Evaluation Using LLMs

Evaluating AI-Generated Content With LLMs

LLM Prompting Techniques for Non-Programmers

Explore content categories