Getting Structured Information with Validation using a State Monad

Getting Structured Information with Validation using a State Monad

In this article, we will focus on extracting structured information from free text. Once this information is obtained, it undergoes a validation process. Should the validation not succeed, the system will make another attempt to derive a valid response. A second failure results in the answer being disregarded. To efficiently monitor the series of messages, or the dialogue with the LLM, we employ a state monad.

For those curious about what a state monad is and its operational mechanics, I highly recommend exploring the exceptional resources provided by Scott Wlaschin, which include an informative blog and an engaging video. You can find these resources at Scott Wlaschin's website.

First setup the libraries and open up the namespaces

#r "nuget: FSharpPlus"
#r "nuget: Newtonsoft.Json"
#r "nuget: NJsonSchema"

#r "../../Informedica.Utils.Lib/bin/Debug/net8.0/Informedica.Utils.Lib.dll"

#load "../Types.fs"
#load "../Utils.fs"
#load "../Texts.fs"
#load "../Prompts.fs"
#load "../Message.fs"
#load "../OpenAI.fs"
#load "../Fireworks.fs"
#load "../Ollama.fs"


open Newtonsoft.Json

open FSharpPlus
open FSharpPlus.Data
open Informedica.Utils.Lib.BCL

open Informedica.OpenAI.Lib
        

Installed Packages

  • FSharpPlus, 1.6.1
  • Newtonsoft.Json, 13.0.3
  • NJsonSchema, 11.0.0

Adjust the model settings

Adjustments to the model settings can be made to optimize the extraction of the most accurate information. It is important to note, however, that when employing JSON format for extraction, these adjustments may not significantly impact the outcome. The notable exception is the configuration of the seed, which enhances the reproducibility of the results.

Please note, the validity of the statement about model settings not making much of a difference in JSON format extraction depends on the specific context and the model's design. Generally, model settings can affect how data is processed and interpreted, regardless of the output format. The statement about the seed is accurate, as setting a specific seed value ensures consistent results across multiple runs, which is critical for reproducibility.

Ollama.options.temperature <- 0.
Ollama.options.penalize_newline <- true
Ollama.options.top_k <- 10
Ollama.options.top_p <- 0.95
Ollama.options.seed <- 101
        

An initial system prompt and a generic method to extract structured data

The systemMsg variable primes the Large Language Model (LLM) to function as a medical expert capable of extracting structured information from free text.

The extract function employs a model (specified by the name of the LLM) and an initial structure, along with the request message, to extract information from free text and output it as JSON. This function utilizes a monad computational expression, enabling the use of a State monad. The State monad is instrumental in tracking the state of the system, which, in this context, refers to maintaining a record of the messages exchanged during the "conversation" with the LLM.

let systemMsg text = [ text |> Texts.systemDoseQuantityExpert2 |> Message.system ]


let inline extract (model: string) zero msg =
    monad {
        // get the current list of messages
        let! msgs = State.get
        // get the structured extraction allong with
        // the updated list of messages
        let msgs, res =
            msg
            |> Ollama.validate2
                model
                msgs
            |> Async.RunSynchronously
            |> function
                | Ok (result, msgs) -> msgs, result
                | Error (_, msgs)   -> msgs, zero
        // refresh the state with the updated list of messages
        do! State.put msgs
        // return the structured extraction
        return res
    }
        

A General Validation function to validate the LLM extraction

A validation function can be used to check:

  • Whether the response can be deserialized as a valid extraction structure and
  • Validate the extracted structure

let unitValidator<'Unit> text get validUnits s =
    let isValidUnit s =
        if validUnits |> List.isEmpty then true
        else
            validUnits
            |> List.exists (String.equalsCapInsens s)
    try
        let un = JsonConvert.DeserializeObject<'Unit>(s)
        match un |> get |> String.split "/" with
        | [u] when u |> isValidUnit -> 
            if text |> String.containsCapsInsens u then Ok s
            else
                $"{u} is not mentionned in the text"
                |> Error 
        | _ -> 
            if validUnits |> List.isEmpty then $"{s} is not a valid unit, the unit should not contain '/'"
            else
                $"""
{s} is not a valid unit, the unit should not contain '/' and the unit should be one of the following:
{validUnits |> String.concat ", "}
"""
            |> Error
    with
    | e ->
        e.ToString()
        |> Error
        

Functions that extract different pieces of structured information

We can leverage the previous general extraction and validation functions to develop specific functions capable of isolating and validating small segments of structured information. This modular approach divides a large task into more manageable, simpler tasks for the LLM to process.

Each specialized extraction function is associated with a zero structure. This zero structure serves as a fallback mechanism in instances where extraction fails. Additionally, each extraction function is complemented by a validator. This validator is a dedicated function responsible for assessing the integrity and accuracy of the extracted structured information.

let extractSubstanceUnit model text =
    let zero = {| substanceUnit = "" |}
    let validator = unitValidator text (fun (u: {| substanceUnit: string |}) -> u.substanceUnit)  []

    """
Use the provided schema to extract the unit of measurement (substance unit) from the medication dosage information contained in the text.
Your answer should return a JSON string representing the extracted unit.

Use schema: { substanceUnit: string }

Examples of usage and expected output:
 - For "mg/kg/dag", return: "{ "substanceUnit": "mg" }"
 - For "g/m2/dag", return: "{ "substanceUnit": "g" }"
 - For "IE/m2", return: "{ "substanceUnit": "IE" }"

Respond in JSON
"""
    |> Message.userWithValidator validator
    |> extract model zero


let extractAdjustUnit model text =
    let zero = {| adjustUnit = "" |}
    let validator = 
        ["kg"; "m2"; "mˆ2"]
        |> unitValidator text (fun (u: {| adjustUnit: string |}) -> u.adjustUnit)

    """
Use the provided schema to extract the unit by which a medication dose is adjusted, such as patient weight or body surface area, from the medication dosage information contained in the text.
Your answer should return a JSON string representing the extracted adjustment unit.

Use schema : { adjustUnit: string }

Examples of usage and expected output:
- For "mg/kg/dag", return: "{ "adjustUnit": "kg" }"
- For "mg/kg", return: "{ "adjustUnit": "kg" }"
- For "mg/m2/dag", return: "{ "adjustUnit": "m2" }"
- For "mg/m2", return: "{ "adjustUnit": "m2" }"

Respond in JSON
"""
    |> Message.userWithValidator validator
    |> extract model zero


let extractTimeUnit model text =
    let zero = {| timeUnit = "" |}
    let validator = 
        [
            "dag"
            "week"
            "maand"
        ]
        |> unitValidator text (fun (u: {| timeUnit: string |}) -> u.timeUnit)

    """
Use the provided schema to extract the time unit from the medication dosage information contained in the text.
Your answer should return a JSON string representing the extracted time unit.

Use schema : { timeUnit: string }

Examples of usage and expected output:
- For "mg/kg/dag", return: "{ "timeUnit": "dag" }"
- For "mg/kg", return: "{ "timeUnit": "" }"
- For "mg/m2/week", return: "{ "timeUnit": "week" }"
- For "mg/2 dagen", return: "{ "timeUnit": "2 dagen" }"

Respond in JSON
"""
    |> Message.userWithValidator validator
    |> extract model zero
        

Combine the extractions to create a larger structured extraction

The individual extraction functions can be integrated to form a comprehensive extraction structure. By utilizing the monad computational expression, the state—represented as the ongoing conversation (a list of messages)—is seamlessly propagated throughout the system. This mechanism ensures that the LLM has access to the "full picture" during its operation. The intent is that this holistic view will enhance the LLM's extraction capabilities, leading to improvements in both the process and the accuracy of the extracted information.

let createDoseUnits model text =
    monad {
        let! substanceUnit = extractSubstanceUnit model text
        let! adjustUnit = extractAdjustUnit model text
        let! timeUnit = extractTimeUnit model text

        return
            {|
                substanceUnit = substanceUnit.substanceUnit
                adjustUnit = adjustUnit.adjustUnit
                timeUnit = timeUnit.timeUnit
            |}
    }        

Finally the State monad can run the whole proces of extraction

Running the extraction proces returns the extracted structure and the full list of messages.

let un, msgs =
    let text = Texts.testTexts[3]
    State.run
        (createDoseUnits Ollama.Models.llama2 text)
        (systemMsg text)

printfn $"## The final extracted structure:\n{un}\n\n"

printfn "## The full conversation"
msgs
|> List.iter Message.print        
## The final extracted structure: { adjustUnit = "" substanceUnit = "mg" timeUnit = "dag" } ## The full conversation ## System: You are an expert on medication prescribing, preparation and administration. You will give exact answers. If there is no possible answer return an empty string. You have to answer questions about a free text between ''' that describes the dosing of a medication. You will be asked to extract structured information from the following text: ''' amitriptyline 6 jaar tot 18 jaar Startdosering: voor de nacht: 10 mg/dag in 1 dosisOnderhoudsdosering: langzaam ophogen met 10 mg/dag per 4-6 weken naar 10 - 30 mg/dag in 1 dosis. Max: 30 mg/dag. Behandeling met amitriptyline mag niet plotseling worden gestaakt vanwege het optreden van ontwenningsverschijnselen; de dosering moet geleidelijk worden verminderd.Uit de studie van Powers (2017) blijkt dat de werkzaamheid van amitriptyline bij migraine profylaxe niet effectiever is t.o.v. placebo. Desondanks menen experts dat in individuele gevallen behandeling met amitriptyline overwogen kan worden. ''' ONLY respond if the response is present in the text. If the response cannot be extracted respond with an empty string. Respond in JSON ## Question: Use the provided schema to extract the unit of measurement (substance unit) from the medication dosage information contained in the text. Your anser should return a JSON string representing the extracted unit. Use schema: { substanceUnit: string } Examples of usage and expected output: - For "mg/kg/dag", return: "{ "substanceUnit": "mg" }" - For "g/m2/dag", return: "{ "substanceUnit": "g" }" - For "IE/m2", return: "{ "substanceUnit": "IE" }" Respond in JSON ## Answer: {"substanceUnit":"mg"} ## Question: Use the provided schema to extract the unit by which a medication dose is adjusted, such as patient weight or body surface area, from the medication dosage information contained in the text. Your answer should return a JSON string representing the extracted adjustment unit. Use schema : { adjustUnit: string } Examples of usage and expected output: - For "mg/kg/dag", return: "{ "adjustUnit": "kg" }" - For "mg/kg", return: "{ "adjustUnit": "kg" }" - For "mg/m2/dag", return: "{ "adjustUnit": "m2" }" - For "mg/m2", return: "{ "adjustUnit": "m2" }" Respond in JSON ## Answer: {"adjustUnit":"kg"} ## Question: The answer: {"adjustUnit":"kg"} was not correct because of kg is not mentionned in the text. Please try again answering: Use the provided schema to extract the unit by which a medication dose is adjusted, such as patient weight or body surface area, from the medication dosage information contained in the text. Your answer should return a JSON string representing the extracted adjustment unit. Use schema : { adjustUnit: string } Examples of usage and expected output: - For "mg/kg/dag", return: "{ "adjustUnit": "kg" }" - For "mg/kg", return: "{ "adjustUnit": "kg" }" - For "mg/m2/dag", return: "{ "adjustUnit": "m2" }" - For "mg/m2", return: "{ "adjustUnit": "m2" }" Respond in JSON ## Answer: {"adjustUnit":""} ## Question: Use the provided schema to extract the time unit from the medication dosage information contained in the text. Your answer should return a JSON string representing the extracted time unit. Use schema : { timeUnit: string } Examples of usage and expected output: - For "mg/kg/dag", return: "{ "timeUnit": "dag" }" - For "mg/kg", return: "{ "timeUnit": "" }" - For "mg/m2/week", return: "{ "timeUnit": "week" }" - For "mg/2 dagen", return: "{ "timeUnit": "2 dagen" }" Respond in JSON ## Answer: {"timeUnit":"dag"}

Another attempt using a different LLM

Different LLM models can be used to check the response on the same questions.

let un, msgs =
    let text = Texts.testTexts[0]
    State.run
        (createDoseUnits Ollama.Models.openhermes text)
        (systemMsg text)

printfn $"## The final extracted structure:\n{un}\n\n"

printfn "## The full conversation"
msgs
|> List.iter Message.print        
## The final extracted structure: { adjustUnit = "kg" substanceUnit = "mg" timeUnit = "dag" } ## The full conversation ## System: You are an expert on medication prescribing, preparation and administration. You will give exact answers. If there is no possible answer return an empty string. You have to answer questions about a free text between ''' that describes the dosing of a medication. You will be asked to extract structured information from the following text: ''' alprazolam 6 jaar tot 18 jaar Startdosering: 0,125 mg/dag, éénmalig. Onderhoudsdosering: Op geleide van klinisch beeld verhogen met stappen van 0,125-0,25 mg/dosis tot max 0,05 mg/kg/dag in 3 doses. Max: 3 mg/dag. Advies inname/toediening: De dagdosis indien mogelijk verdelen over 3 doses.Bij plotselinge extreme slapeloosheid: alleen voor de nacht innemen; dosering op geleide van effect ophogen tot max 0,05 mg/kg, maar niet hoger dan 3 mg/dag.De effectiviteit bij de behandeling van acute angst is discutabel. ''' ONLY respond if the response is present in the text. If the response cannot be extracted respond with an empty string. Respond in JSON ## Question: Use the provided schema to extract the unit of measurement (substance unit) from the medication dosage information contained in the text. Your anser should return a JSON string representing the extracted unit. Use schema: { substanceUnit: string } Examples of usage and expected output: - For "mg/kg/dag", return: "{ "substanceUnit": "mg" }" - For "g/m2/dag", return: "{ "substanceUnit": "g" }" - For "IE/m2", return: "{ "substanceUnit": "IE" }" Respond in JSON ## Answer: {"substanceUnit":"mg"} ## Question: Use the provided schema to extract the unit by which a medication dose is adjusted, such as patient weight or body surface area, from the medication dosage information contained in the text. Your answer should return a JSON string representing the extracted adjustment unit. Use schema : { adjustUnit: string } Examples of usage and expected output: - For "mg/kg/dag", return: "{ "adjustUnit": "kg" }" - For "mg/kg", return: "{ "adjustUnit": "kg" }" - For "mg/m2/dag", return: "{ "adjustUnit": "m2" }" - For "mg/m2", return: "{ "adjustUnit": "m2" }" Respond in JSON ## Answer: {"adjustUnit":"kg"} ## Question: Use the provided schema to extract the time unit from the medication dosage information contained in the text. Your answer should return a JSON string representing the extracted time unit. Use schema : { timeUnit: string } Examples of usage and expected output: - For "mg/kg/dag", return: "{ "timeUnit": "dag" }" - For "mg/kg", return: "{ "timeUnit": "" }" - For "mg/m2/week", return: "{ "timeUnit": "week" }" - For "mg/2 dagen", return: "{ "timeUnit": "2 dagen" }" Respond in JSON ## Answer: {"timeUnit":"dag"}

Testing different LLMs

In this section, we introduce a test function that employs a specific LLM model to iterate over a collection of test texts, each paired with an anticipated outcome. The function performs extraction operations on the texts, outputting the results as JSON, which are then compared against the expected JSON outcomes. The effectiveness of this process is quantified through a success score, which tallies the number of instances where the extracted information precisely matches the expected results.

let test model =
    [
        for (text, exp) in Texts.testUnitTexts do
            let un, _ =
                State.run
                    (createDoseUnits model text)
                    (systemMsg text)
            if un = exp then 1 else 0
    ]
    |> List.sum        

Run the tests and see discover which model performs best.

[
    Ollama.Models.llama2
    Ollama.Models.gemma
    Ollama.Models.openhermes
    Ollama.Models.mistral
    Ollama.Models.``llama-pro``
    Ollama.Models.``openchat:7b``
    Ollama.Models.``llama2:13b-chat``
]
|> List.map (fun model -> 
    printf $"- Testing: {model}: "
    let s = model |> test
    printfn $"score: {s}"
    model, s
)
|> List.maxBy snd
|> fun (m, s) -> printfn $"\n\n## And the winner is: {m} with a high score: {s} from {Texts.testUnitTexts |> List.length}"        
- Testing: llama2: score: 3 - Testing: gemma: score: 0 - Testing: openhermes: score: 4 - Testing: mistral: score: 3 - Testing: llama-pro: score: 4 - Testing: openchat:7b: score: 4 - Testing: llama2:13b-chat: score: 4 ## And the winner is: openhermes with a high score: 4 from 6

Conclusion

In conclusion, this notebook demonstrates a sophisticated approach to extracting structured information from free text using a Large Language Model (LLM). By defining specific extraction functions, each equipped with a zero structure for fallback and a validator for verifying the integrity of the extracted data, we can tackle complex extraction tasks in a modular, manageable manner. These functions are designed to isolate and validate small pieces of structured information, simplifying the extraction process.

Furthermore, by integrating these functions into a larger extraction framework and utilizing the monad computational expression, we ensure that the state—essentially, the ongoing conversation or list of messages—is consistently passed along. This approach ensures that the LLM maintains a comprehensive context, potentially enhancing the accuracy and validity of the information extracted.

To assess the efficacy of our extraction methodology, we employ a test function that iterates through a series of test texts, comparing the LLM's extractions against predefined expected outcomes. The comparison results in a success score that reflects the number of matches, offering a quantifiable measure of the extraction process's effectiveness.

Overall, this code presents a robust framework for not only extracting structured information from unstructured text but also validating and testing the accuracy of these extractions. Such a system has vast applications, ranging from data analysis to automating information retrieval and processing tasks, highlighting the power of combining LLM capabilities with functional programming and validation techniques.

To view or add a comment, sign in

Others also viewed

Explore content categories