Getting Structured Information with Validation using a State Monad

Casper Bollen

Published Mar 24, 2024

In this article, we will focus on extracting structured information from free text. Once this information is obtained, it undergoes a validation process. Should the validation not succeed, the system will make another attempt to derive a valid response. A second failure results in the answer being disregarded. To efficiently monitor the series of messages, or the dialogue with the LLM, we employ a state monad.

For those curious about what a state monad is and its operational mechanics, I highly recommend exploring the exceptional resources provided by Scott Wlaschin, which include an informative blog and an engaging video. You can find these resources at Scott Wlaschin's website.

First setup the libraries and open up the namespaces

#r "nuget: FSharpPlus"
#r "nuget: Newtonsoft.Json"
#r "nuget: NJsonSchema"

#r "../../Informedica.Utils.Lib/bin/Debug/net8.0/Informedica.Utils.Lib.dll"

#load "../Types.fs"
#load "../Utils.fs"
#load "../Texts.fs"
#load "../Prompts.fs"
#load "../Message.fs"
#load "../OpenAI.fs"
#load "../Fireworks.fs"
#load "../Ollama.fs"


open Newtonsoft.Json

open FSharpPlus
open FSharpPlus.Data
open Informedica.Utils.Lib.BCL

open Informedica.OpenAI.Lib

Installed Packages

FSharpPlus, 1.6.1
Newtonsoft.Json, 13.0.3
NJsonSchema, 11.0.0

Adjust the model settings

Adjustments to the model settings can be made to optimize the extraction of the most accurate information. It is important to note, however, that when employing JSON format for extraction, these adjustments may not significantly impact the outcome. The notable exception is the configuration of the seed, which enhances the reproducibility of the results.

Please note, the validity of the statement about model settings not making much of a difference in JSON format extraction depends on the specific context and the model's design. Generally, model settings can affect how data is processed and interpreted, regardless of the output format. The statement about the seed is accurate, as setting a specific seed value ensures consistent results across multiple runs, which is critical for reproducibility.

Ollama.options.temperature <- 0.
Ollama.options.penalize_newline <- true
Ollama.options.top_k <- 10
Ollama.options.top_p <- 0.95
Ollama.options.seed <- 101

An initial system prompt and a generic method to extract structured data

The systemMsg variable primes the Large Language Model (LLM) to function as a medical expert capable of extracting structured information from free text.

The extract function employs a model (specified by the name of the LLM) and an initial structure, along with the request message, to extract information from free text and output it as JSON. This function utilizes a monad computational expression, enabling the use of a State monad. The State monad is instrumental in tracking the state of the system, which, in this context, refers to maintaining a record of the messages exchanged during the "conversation" with the LLM.

let systemMsg text = [ text |> Texts.systemDoseQuantityExpert2 |> Message.system ]


let inline extract (model: string) zero msg =
    monad {
        // get the current list of messages
        let! msgs = State.get
        // get the structured extraction allong with
        // the updated list of messages
        let msgs, res =
            msg
            |> Ollama.validate2
                model
                msgs
            |> Async.RunSynchronously
            |> function
                | Ok (result, msgs) -> msgs, result
                | Error (_, msgs)   -> msgs, zero
        // refresh the state with the updated list of messages
        do! State.put msgs
        // return the structured extraction
        return res
    }

A General Validation function to validate the LLM extraction

A validation function can be used to check:

Whether the response can be deserialized as a valid extraction structure and
Validate the extracted structure

let unitValidator<'Unit> text get validUnits s =
    let isValidUnit s =
        if validUnits |> List.isEmpty then true
        else
            validUnits
            |> List.exists (String.equalsCapInsens s)
    try
        let un = JsonConvert.DeserializeObject<'Unit>(s)
        match un |> get |> String.split "/" with
        | [u] when u |> isValidUnit -> 
            if text |> String.containsCapsInsens u then Ok s
            else
                $"{u} is not mentionned in the text"
                |> Error 
        | _ -> 
            if validUnits |> List.isEmpty then $"{s} is not a valid unit, the unit should not contain '/'"
            else
                $"""
{s} is not a valid unit, the unit should not contain '/' and the unit should be one of the following:
{validUnits |> String.concat ", "}
"""
            |> Error
    with
    | e ->
        e.ToString()
        |> Error

Functions that extract different pieces of structured information

We can leverage the previous general extraction and validation functions to develop specific functions capable of isolating and validating small segments of structured information. This modular approach divides a large task into more manageable, simpler tasks for the LLM to process.

Each specialized extraction function is associated with a zero structure. This zero structure serves as a fallback mechanism in instances where extraction fails. Additionally, each extraction function is complemented by a validator. This validator is a dedicated function responsible for assessing the integrity and accuracy of the extracted structured information.

Recommended by LinkedIn

EasyRAG

Avinash Dixit 1 year ago

The Hidden Problem with RAG: When Retrieval Isn’t…

Manish Gaur 9 months ago

Making Retrieval Augmented Generation (RAG) work in…

Nikhil Goel 2 years ago

let extractSubstanceUnit model text =
    let zero = {| substanceUnit = "" |}
    let validator = unitValidator text (fun (u: {| substanceUnit: string |}) -> u.substanceUnit)  []

    """
Use the provided schema to extract the unit of measurement (substance unit) from the medication dosage information contained in the text.
Your answer should return a JSON string representing the extracted unit.

Use schema: { substanceUnit: string }

Examples of usage and expected output:
 - For "mg/kg/dag", return: "{ "substanceUnit": "mg" }"
 - For "g/m2/dag", return: "{ "substanceUnit": "g" }"
 - For "IE/m2", return: "{ "substanceUnit": "IE" }"

Respond in JSON
"""
    |> Message.userWithValidator validator
    |> extract model zero


let extractAdjustUnit model text =
    let zero = {| adjustUnit = "" |}
    let validator = 
        ["kg"; "m2"; "mˆ2"]
        |> unitValidator text (fun (u: {| adjustUnit: string |}) -> u.adjustUnit)

    """
Use the provided schema to extract the unit by which a medication dose is adjusted, such as patient weight or body surface area, from the medication dosage information contained in the text.
Your answer should return a JSON string representing the extracted adjustment unit.

Use schema : { adjustUnit: string }

Examples of usage and expected output:
- For "mg/kg/dag", return: "{ "adjustUnit": "kg" }"
- For "mg/kg", return: "{ "adjustUnit": "kg" }"
- For "mg/m2/dag", return: "{ "adjustUnit": "m2" }"
- For "mg/m2", return: "{ "adjustUnit": "m2" }"

Respond in JSON
"""
    |> Message.userWithValidator validator
    |> extract model zero


let extractTimeUnit model text =
    let zero = {| timeUnit = "" |}
    let validator = 
        [
            "dag"
            "week"
            "maand"
        ]
        |> unitValidator text (fun (u: {| timeUnit: string |}) -> u.timeUnit)

    """
Use the provided schema to extract the time unit from the medication dosage information contained in the text.
Your answer should return a JSON string representing the extracted time unit.

Use schema : { timeUnit: string }

Examples of usage and expected output:
- For "mg/kg/dag", return: "{ "timeUnit": "dag" }"
- For "mg/kg", return: "{ "timeUnit": "" }"
- For "mg/m2/week", return: "{ "timeUnit": "week" }"
- For "mg/2 dagen", return: "{ "timeUnit": "2 dagen" }"

Respond in JSON
"""
    |> Message.userWithValidator validator
    |> extract model zero

Combine the extractions to create a larger structured extraction

The individual extraction functions can be integrated to form a comprehensive extraction structure. By utilizing the monad computational expression, the state—represented as the ongoing conversation (a list of messages)—is seamlessly propagated throughout the system. This mechanism ensures that the LLM has access to the "full picture" during its operation. The intent is that this holistic view will enhance the LLM's extraction capabilities, leading to improvements in both the process and the accuracy of the extracted information.

let createDoseUnits model text =
    monad {
        let! substanceUnit = extractSubstanceUnit model text
        let! adjustUnit = extractAdjustUnit model text
        let! timeUnit = extractTimeUnit model text

        return
            {|
                substanceUnit = substanceUnit.substanceUnit
                adjustUnit = adjustUnit.adjustUnit
                timeUnit = timeUnit.timeUnit
            |}
    }

Finally the State monad can run the whole proces of extraction

Running the extraction proces returns the extracted structure and the full list of messages.

let un, msgs =
    let text = Texts.testTexts[3]
    State.run
        (createDoseUnits Ollama.Models.llama2 text)
        (systemMsg text)

printfn $"## The final extracted structure:\n{un}\n\n"

printfn "## The full conversation"
msgs
|> List.iter Message.print

## The final extracted structure: { adjustUnit = "" substanceUnit = "mg" timeUnit = "dag" } ## The full conversation ## System: You are an expert on medication prescribing, preparation and administration. You will give exact answers. If there is no possible answer return an empty string. You have to answer questions about a free text between ''' that describes the dosing of a medication. You will be asked to extract structured information from the following text: ''' amitriptyline 6 jaar tot 18 jaar Startdosering: voor de nacht: 10 mg/dag in 1 dosisOnderhoudsdosering: langzaam ophogen met 10 mg/dag per 4-6 weken naar 10 - 30 mg/dag in 1 dosis. Max: 30 mg/dag. Behandeling met amitriptyline mag niet plotseling worden gestaakt vanwege het optreden van ontwenningsverschijnselen; de dosering moet geleidelijk worden verminderd.Uit de studie van Powers (2017) blijkt dat de werkzaamheid van amitriptyline bij migraine profylaxe niet effectiever is t.o.v. placebo. Desondanks menen experts dat in individuele gevallen behandeling met amitriptyline overwogen kan worden. ''' ONLY respond if the response is present in the text. If the response cannot be extracted respond with an empty string. Respond in JSON ## Question: Use the provided schema to extract the unit of measurement (substance unit) from the medication dosage information contained in the text. Your anser should return a JSON string representing the extracted unit. Use schema: { substanceUnit: string } Examples of usage and expected output: - For "mg/kg/dag", return: "{ "substanceUnit": "mg" }" - For "g/m2/dag", return: "{ "substanceUnit": "g" }" - For "IE/m2", return: "{ "substanceUnit": "IE" }" Respond in JSON ## Answer: {"substanceUnit":"mg"} ## Question: Use the provided schema to extract the unit by which a medication dose is adjusted, such as patient weight or body surface area, from the medication dosage information contained in the text. Your answer should return a JSON string representing the extracted adjustment unit. Use schema : { adjustUnit: string } Examples of usage and expected output: - For "mg/kg/dag", return: "{ "adjustUnit": "kg" }" - For "mg/kg", return: "{ "adjustUnit": "kg" }" - For "mg/m2/dag", return: "{ "adjustUnit": "m2" }" - For "mg/m2", return: "{ "adjustUnit": "m2" }" Respond in JSON ## Answer: {"adjustUnit":"kg"} ## Question: The answer: {"adjustUnit":"kg"} was not correct because of kg is not mentionned in the text. Please try again answering: Use the provided schema to extract the unit by which a medication dose is adjusted, such as patient weight or body surface area, from the medication dosage information contained in the text. Your answer should return a JSON string representing the extracted adjustment unit. Use schema : { adjustUnit: string } Examples of usage and expected output: - For "mg/kg/dag", return: "{ "adjustUnit": "kg" }" - For "mg/kg", return: "{ "adjustUnit": "kg" }" - For "mg/m2/dag", return: "{ "adjustUnit": "m2" }" - For "mg/m2", return: "{ "adjustUnit": "m2" }" Respond in JSON ## Answer: {"adjustUnit":""} ## Question: Use the provided schema to extract the time unit from the medication dosage information contained in the text. Your answer should return a JSON string representing the extracted time unit. Use schema : { timeUnit: string } Examples of usage and expected output: - For "mg/kg/dag", return: "{ "timeUnit": "dag" }" - For "mg/kg", return: "{ "timeUnit": "" }" - For "mg/m2/week", return: "{ "timeUnit": "week" }" - For "mg/2 dagen", return: "{ "timeUnit": "2 dagen" }" Respond in JSON ## Answer: {"timeUnit":"dag"}

Another attempt using a different LLM

Different LLM models can be used to check the response on the same questions.

let un, msgs =
    let text = Texts.testTexts[0]
    State.run
        (createDoseUnits Ollama.Models.openhermes text)
        (systemMsg text)

printfn $"## The final extracted structure:\n{un}\n\n"

printfn "## The full conversation"
msgs
|> List.iter Message.print

## The final extracted structure: { adjustUnit = "kg" substanceUnit = "mg" timeUnit = "dag" } ## The full conversation ## System: You are an expert on medication prescribing, preparation and administration. You will give exact answers. If there is no possible answer return an empty string. You have to answer questions about a free text between ''' that describes the dosing of a medication. You will be asked to extract structured information from the following text: ''' alprazolam 6 jaar tot 18 jaar Startdosering: 0,125 mg/dag, éénmalig. Onderhoudsdosering: Op geleide van klinisch beeld verhogen met stappen van 0,125-0,25 mg/dosis tot max 0,05 mg/kg/dag in 3 doses. Max: 3 mg/dag. Advies inname/toediening: De dagdosis indien mogelijk verdelen over 3 doses.Bij plotselinge extreme slapeloosheid: alleen voor de nacht innemen; dosering op geleide van effect ophogen tot max 0,05 mg/kg, maar niet hoger dan 3 mg/dag.De effectiviteit bij de behandeling van acute angst is discutabel. ''' ONLY respond if the response is present in the text. If the response cannot be extracted respond with an empty string. Respond in JSON ## Question: Use the provided schema to extract the unit of measurement (substance unit) from the medication dosage information contained in the text. Your anser should return a JSON string representing the extracted unit. Use schema: { substanceUnit: string } Examples of usage and expected output: - For "mg/kg/dag", return: "{ "substanceUnit": "mg" }" - For "g/m2/dag", return: "{ "substanceUnit": "g" }" - For "IE/m2", return: "{ "substanceUnit": "IE" }" Respond in JSON ## Answer: {"substanceUnit":"mg"} ## Question: Use the provided schema to extract the unit by which a medication dose is adjusted, such as patient weight or body surface area, from the medication dosage information contained in the text. Your answer should return a JSON string representing the extracted adjustment unit. Use schema : { adjustUnit: string } Examples of usage and expected output: - For "mg/kg/dag", return: "{ "adjustUnit": "kg" }" - For "mg/kg", return: "{ "adjustUnit": "kg" }" - For "mg/m2/dag", return: "{ "adjustUnit": "m2" }" - For "mg/m2", return: "{ "adjustUnit": "m2" }" Respond in JSON ## Answer: {"adjustUnit":"kg"} ## Question: Use the provided schema to extract the time unit from the medication dosage information contained in the text. Your answer should return a JSON string representing the extracted time unit. Use schema : { timeUnit: string } Examples of usage and expected output: - For "mg/kg/dag", return: "{ "timeUnit": "dag" }" - For "mg/kg", return: "{ "timeUnit": "" }" - For "mg/m2/week", return: "{ "timeUnit": "week" }" - For "mg/2 dagen", return: "{ "timeUnit": "2 dagen" }" Respond in JSON ## Answer: {"timeUnit":"dag"}

Testing different LLMs

In this section, we introduce a test function that employs a specific LLM model to iterate over a collection of test texts, each paired with an anticipated outcome. The function performs extraction operations on the texts, outputting the results as JSON, which are then compared against the expected JSON outcomes. The effectiveness of this process is quantified through a success score, which tallies the number of instances where the extracted information precisely matches the expected results.

let test model =
    [
        for (text, exp) in Texts.testUnitTexts do
            let un, _ =
                State.run
                    (createDoseUnits model text)
                    (systemMsg text)
            if un = exp then 1 else 0
    ]
    |> List.sum

Run the tests and see discover which model performs best.

[
    Ollama.Models.llama2
    Ollama.Models.gemma
    Ollama.Models.openhermes
    Ollama.Models.mistral
    Ollama.Models.``llama-pro``
    Ollama.Models.``openchat:7b``
    Ollama.Models.``llama2:13b-chat``
]
|> List.map (fun model -> 
    printf $"- Testing: {model}: "
    let s = model |> test
    printfn $"score: {s}"
    model, s
)
|> List.maxBy snd
|> fun (m, s) -> printfn $"\n\n## And the winner is: {m} with a high score: {s} from {Texts.testUnitTexts |> List.length}"

- Testing: llama2: score: 3 - Testing: gemma: score: 0 - Testing: openhermes: score: 4 - Testing: mistral: score: 3 - Testing: llama-pro: score: 4 - Testing: openchat:7b: score: 4 - Testing: llama2:13b-chat: score: 4 ## And the winner is: openhermes with a high score: 4 from 6

Conclusion

In conclusion, this notebook demonstrates a sophisticated approach to extracting structured information from free text using a Large Language Model (LLM). By defining specific extraction functions, each equipped with a zero structure for fallback and a validator for verifying the integrity of the extracted data, we can tackle complex extraction tasks in a modular, manageable manner. These functions are designed to isolate and validate small pieces of structured information, simplifying the extraction process.

Furthermore, by integrating these functions into a larger extraction framework and utilizing the monad computational expression, we ensure that the state—essentially, the ongoing conversation or list of messages—is consistently passed along. This approach ensures that the LLM maintains a comprehensive context, potentially enhancing the accuracy and validity of the information extracted.

To assess the efficacy of our extraction methodology, we employ a test function that iterates through a series of test texts, comparing the LLM's extractions against predefined expected outcomes. The comparison results in a success score that reflects the number of matches, offering a quantifiable measure of the extraction process's effectiveness.

Overall, this code presents a robust framework for not only extracting structured information from unstructured text but also validating and testing the accuracy of these extractions. Such a system has vast applications, ranging from data analysis to automating information retrieval and processing tasks, highlighting the power of combining LLM capabilities with functional programming and validation techniques.

To view or add a comment, sign in

See all

Getting Structured Information with Validation using a State Monad

Casper Bollen

First setup the libraries and open up the namespaces

Adjust the model settings

An initial system prompt and a generic method to extract structured data

A General Validation function to validate the LLM extraction

Functions that extract different pieces of structured information

Recommended by LinkedIn

Combine the extractions to create a larger structured extraction

Finally the State monad can run the whole proces of extraction

Another attempt using a different LLM

Testing different LLMs

Conclusion

More articles by this author

Others also viewed

Difference between Normal RAG and Agentic RAG

What is Retrieval-Augmented Generation (RAG)?

Thoughts on DeepSeek R1

Model Context Protocol (MCP) Overview and Purpose of MCP

Retrieval Augmented Generation

Stop Using "Naive RAG." Here are 9 Advanced Strategies You Need for Production-Ready AI.

Demystifying Prompt Structure: Experiments with Roles, Rules, and Templates

How Much Data do I Need??

Why PydanticAI Agentic Changes How We Extract Data from Documents

Explore content categories

First setup the libraries and open up the namespaces

Adjust the model settings

An initial system prompt and a generic method to extract structured data

A General Validation function to validate the LLM extraction

Functions that extract different pieces of structured information

Recommended by LinkedIn

Combine the extractions to create a larger structured extraction

Finally the State monad can run the whole proces of extraction

Another attempt using a different LLM

Testing different LLMs

Conclusion

Don't ask what AI can do for you, ask what GenPRES can do for AI.

Sep 30, 2024

Choosing the Right EMR System: Standard Content vs. Modular Content Buffet

May 31, 2024

Met GenPRES: Begrijpbare Voorschriften Creëren

May 11, 2024

Management door Vertrouwen: Lessen uit Linux's Cirkel van Vertrouwen voor Innovatie in Ziekenhuizen

Apr 10, 2024

Using a local LLM to systematically extract structured data

Mar 13, 2024

Nut en Noodzaak van GenPRES

Feb 10, 2024

Introductie van GenPRES: Een Noodzakelijke Doorbraak in Zorgtechnologie

Jan 16, 2024

Introducing CDSS and AI in Medical Practice: The Imperative of FAIR and Open Source

Dec 24, 2023

Medicatie Veilig En Snel

Jul 6, 2022

Others also viewed

Difference between Normal RAG and Agentic RAG

What is Retrieval-Augmented Generation (RAG)?

Thoughts on DeepSeek R1

Model Context Protocol (MCP) Overview and Purpose of MCP

Retrieval Augmented Generation

Stop Using "Naive RAG." Here are 9 Advanced Strategies You Need for Production-Ready AI.

Demystifying Prompt Structure: Experiments with Roles, Rules, and Templates

How Much Data do I Need??

Why PydanticAI Agentic Changes How We Extract Data from Documents

Similar topics

How Llms Process Language

Improving LLM Response Quality Using State-Based Logic

Explore content categories