Getting Structured Information with Validation using a State Monad
In this article, we will focus on extracting structured information from free text. Once this information is obtained, it undergoes a validation process. Should the validation not succeed, the system will make another attempt to derive a valid response. A second failure results in the answer being disregarded. To efficiently monitor the series of messages, or the dialogue with the LLM, we employ a state monad.
For those curious about what a state monad is and its operational mechanics, I highly recommend exploring the exceptional resources provided by Scott Wlaschin, which include an informative blog and an engaging video. You can find these resources at Scott Wlaschin's website.
First setup the libraries and open up the namespaces
#r "nuget: FSharpPlus"
#r "nuget: Newtonsoft.Json"
#r "nuget: NJsonSchema"
#r "../../Informedica.Utils.Lib/bin/Debug/net8.0/Informedica.Utils.Lib.dll"
#load "../Types.fs"
#load "../Utils.fs"
#load "../Texts.fs"
#load "../Prompts.fs"
#load "../Message.fs"
#load "../OpenAI.fs"
#load "../Fireworks.fs"
#load "../Ollama.fs"
open Newtonsoft.Json
open FSharpPlus
open FSharpPlus.Data
open Informedica.Utils.Lib.BCL
open Informedica.OpenAI.Lib
Installed Packages
Adjust the model settings
Adjustments to the model settings can be made to optimize the extraction of the most accurate information. It is important to note, however, that when employing JSON format for extraction, these adjustments may not significantly impact the outcome. The notable exception is the configuration of the seed, which enhances the reproducibility of the results.
Please note, the validity of the statement about model settings not making much of a difference in JSON format extraction depends on the specific context and the model's design. Generally, model settings can affect how data is processed and interpreted, regardless of the output format. The statement about the seed is accurate, as setting a specific seed value ensures consistent results across multiple runs, which is critical for reproducibility.
Ollama.options.temperature <- 0.
Ollama.options.penalize_newline <- true
Ollama.options.top_k <- 10
Ollama.options.top_p <- 0.95
Ollama.options.seed <- 101
An initial system prompt and a generic method to extract structured data
The systemMsg variable primes the Large Language Model (LLM) to function as a medical expert capable of extracting structured information from free text.
The extract function employs a model (specified by the name of the LLM) and an initial structure, along with the request message, to extract information from free text and output it as JSON. This function utilizes a monad computational expression, enabling the use of a State monad. The State monad is instrumental in tracking the state of the system, which, in this context, refers to maintaining a record of the messages exchanged during the "conversation" with the LLM.
let systemMsg text = [ text |> Texts.systemDoseQuantityExpert2 |> Message.system ]
let inline extract (model: string) zero msg =
monad {
// get the current list of messages
let! msgs = State.get
// get the structured extraction allong with
// the updated list of messages
let msgs, res =
msg
|> Ollama.validate2
model
msgs
|> Async.RunSynchronously
|> function
| Ok (result, msgs) -> msgs, result
| Error (_, msgs) -> msgs, zero
// refresh the state with the updated list of messages
do! State.put msgs
// return the structured extraction
return res
}
A General Validation function to validate the LLM extraction
A validation function can be used to check:
let unitValidator<'Unit> text get validUnits s =
let isValidUnit s =
if validUnits |> List.isEmpty then true
else
validUnits
|> List.exists (String.equalsCapInsens s)
try
let un = JsonConvert.DeserializeObject<'Unit>(s)
match un |> get |> String.split "/" with
| [u] when u |> isValidUnit ->
if text |> String.containsCapsInsens u then Ok s
else
$"{u} is not mentionned in the text"
|> Error
| _ ->
if validUnits |> List.isEmpty then $"{s} is not a valid unit, the unit should not contain '/'"
else
$"""
{s} is not a valid unit, the unit should not contain '/' and the unit should be one of the following:
{validUnits |> String.concat ", "}
"""
|> Error
with
| e ->
e.ToString()
|> Error
Functions that extract different pieces of structured information
We can leverage the previous general extraction and validation functions to develop specific functions capable of isolating and validating small segments of structured information. This modular approach divides a large task into more manageable, simpler tasks for the LLM to process.
Each specialized extraction function is associated with a zero structure. This zero structure serves as a fallback mechanism in instances where extraction fails. Additionally, each extraction function is complemented by a validator. This validator is a dedicated function responsible for assessing the integrity and accuracy of the extracted structured information.
Recommended by LinkedIn
let extractSubstanceUnit model text =
let zero = {| substanceUnit = "" |}
let validator = unitValidator text (fun (u: {| substanceUnit: string |}) -> u.substanceUnit) []
"""
Use the provided schema to extract the unit of measurement (substance unit) from the medication dosage information contained in the text.
Your answer should return a JSON string representing the extracted unit.
Use schema: { substanceUnit: string }
Examples of usage and expected output:
- For "mg/kg/dag", return: "{ "substanceUnit": "mg" }"
- For "g/m2/dag", return: "{ "substanceUnit": "g" }"
- For "IE/m2", return: "{ "substanceUnit": "IE" }"
Respond in JSON
"""
|> Message.userWithValidator validator
|> extract model zero
let extractAdjustUnit model text =
let zero = {| adjustUnit = "" |}
let validator =
["kg"; "m2"; "mˆ2"]
|> unitValidator text (fun (u: {| adjustUnit: string |}) -> u.adjustUnit)
"""
Use the provided schema to extract the unit by which a medication dose is adjusted, such as patient weight or body surface area, from the medication dosage information contained in the text.
Your answer should return a JSON string representing the extracted adjustment unit.
Use schema : { adjustUnit: string }
Examples of usage and expected output:
- For "mg/kg/dag", return: "{ "adjustUnit": "kg" }"
- For "mg/kg", return: "{ "adjustUnit": "kg" }"
- For "mg/m2/dag", return: "{ "adjustUnit": "m2" }"
- For "mg/m2", return: "{ "adjustUnit": "m2" }"
Respond in JSON
"""
|> Message.userWithValidator validator
|> extract model zero
let extractTimeUnit model text =
let zero = {| timeUnit = "" |}
let validator =
[
"dag"
"week"
"maand"
]
|> unitValidator text (fun (u: {| timeUnit: string |}) -> u.timeUnit)
"""
Use the provided schema to extract the time unit from the medication dosage information contained in the text.
Your answer should return a JSON string representing the extracted time unit.
Use schema : { timeUnit: string }
Examples of usage and expected output:
- For "mg/kg/dag", return: "{ "timeUnit": "dag" }"
- For "mg/kg", return: "{ "timeUnit": "" }"
- For "mg/m2/week", return: "{ "timeUnit": "week" }"
- For "mg/2 dagen", return: "{ "timeUnit": "2 dagen" }"
Respond in JSON
"""
|> Message.userWithValidator validator
|> extract model zero
Combine the extractions to create a larger structured extraction
The individual extraction functions can be integrated to form a comprehensive extraction structure. By utilizing the monad computational expression, the state—represented as the ongoing conversation (a list of messages)—is seamlessly propagated throughout the system. This mechanism ensures that the LLM has access to the "full picture" during its operation. The intent is that this holistic view will enhance the LLM's extraction capabilities, leading to improvements in both the process and the accuracy of the extracted information.
let createDoseUnits model text =
monad {
let! substanceUnit = extractSubstanceUnit model text
let! adjustUnit = extractAdjustUnit model text
let! timeUnit = extractTimeUnit model text
return
{|
substanceUnit = substanceUnit.substanceUnit
adjustUnit = adjustUnit.adjustUnit
timeUnit = timeUnit.timeUnit
|}
}
Finally the State monad can run the whole proces of extraction
Running the extraction proces returns the extracted structure and the full list of messages.
let un, msgs =
let text = Texts.testTexts[3]
State.run
(createDoseUnits Ollama.Models.llama2 text)
(systemMsg text)
printfn $"## The final extracted structure:\n{un}\n\n"
printfn "## The full conversation"
msgs
|> List.iter Message.print
## The final extracted structure: { adjustUnit = "" substanceUnit = "mg" timeUnit = "dag" } ## The full conversation ## System: You are an expert on medication prescribing, preparation and administration. You will give exact answers. If there is no possible answer return an empty string. You have to answer questions about a free text between ''' that describes the dosing of a medication. You will be asked to extract structured information from the following text: ''' amitriptyline 6 jaar tot 18 jaar Startdosering: voor de nacht: 10 mg/dag in 1 dosisOnderhoudsdosering: langzaam ophogen met 10 mg/dag per 4-6 weken naar 10 - 30 mg/dag in 1 dosis. Max: 30 mg/dag. Behandeling met amitriptyline mag niet plotseling worden gestaakt vanwege het optreden van ontwenningsverschijnselen; de dosering moet geleidelijk worden verminderd.Uit de studie van Powers (2017) blijkt dat de werkzaamheid van amitriptyline bij migraine profylaxe niet effectiever is t.o.v. placebo. Desondanks menen experts dat in individuele gevallen behandeling met amitriptyline overwogen kan worden. ''' ONLY respond if the response is present in the text. If the response cannot be extracted respond with an empty string. Respond in JSON ## Question: Use the provided schema to extract the unit of measurement (substance unit) from the medication dosage information contained in the text. Your anser should return a JSON string representing the extracted unit. Use schema: { substanceUnit: string } Examples of usage and expected output: - For "mg/kg/dag", return: "{ "substanceUnit": "mg" }" - For "g/m2/dag", return: "{ "substanceUnit": "g" }" - For "IE/m2", return: "{ "substanceUnit": "IE" }" Respond in JSON ## Answer: {"substanceUnit":"mg"} ## Question: Use the provided schema to extract the unit by which a medication dose is adjusted, such as patient weight or body surface area, from the medication dosage information contained in the text. Your answer should return a JSON string representing the extracted adjustment unit. Use schema : { adjustUnit: string } Examples of usage and expected output: - For "mg/kg/dag", return: "{ "adjustUnit": "kg" }" - For "mg/kg", return: "{ "adjustUnit": "kg" }" - For "mg/m2/dag", return: "{ "adjustUnit": "m2" }" - For "mg/m2", return: "{ "adjustUnit": "m2" }" Respond in JSON ## Answer: {"adjustUnit":"kg"} ## Question: The answer: {"adjustUnit":"kg"} was not correct because of kg is not mentionned in the text. Please try again answering: Use the provided schema to extract the unit by which a medication dose is adjusted, such as patient weight or body surface area, from the medication dosage information contained in the text. Your answer should return a JSON string representing the extracted adjustment unit. Use schema : { adjustUnit: string } Examples of usage and expected output: - For "mg/kg/dag", return: "{ "adjustUnit": "kg" }" - For "mg/kg", return: "{ "adjustUnit": "kg" }" - For "mg/m2/dag", return: "{ "adjustUnit": "m2" }" - For "mg/m2", return: "{ "adjustUnit": "m2" }" Respond in JSON ## Answer: {"adjustUnit":""} ## Question: Use the provided schema to extract the time unit from the medication dosage information contained in the text. Your answer should return a JSON string representing the extracted time unit. Use schema : { timeUnit: string } Examples of usage and expected output: - For "mg/kg/dag", return: "{ "timeUnit": "dag" }" - For "mg/kg", return: "{ "timeUnit": "" }" - For "mg/m2/week", return: "{ "timeUnit": "week" }" - For "mg/2 dagen", return: "{ "timeUnit": "2 dagen" }" Respond in JSON ## Answer: {"timeUnit":"dag"}
Another attempt using a different LLM
Different LLM models can be used to check the response on the same questions.
let un, msgs =
let text = Texts.testTexts[0]
State.run
(createDoseUnits Ollama.Models.openhermes text)
(systemMsg text)
printfn $"## The final extracted structure:\n{un}\n\n"
printfn "## The full conversation"
msgs
|> List.iter Message.print
## The final extracted structure: { adjustUnit = "kg" substanceUnit = "mg" timeUnit = "dag" } ## The full conversation ## System: You are an expert on medication prescribing, preparation and administration. You will give exact answers. If there is no possible answer return an empty string. You have to answer questions about a free text between ''' that describes the dosing of a medication. You will be asked to extract structured information from the following text: ''' alprazolam 6 jaar tot 18 jaar Startdosering: 0,125 mg/dag, éénmalig. Onderhoudsdosering: Op geleide van klinisch beeld verhogen met stappen van 0,125-0,25 mg/dosis tot max 0,05 mg/kg/dag in 3 doses. Max: 3 mg/dag. Advies inname/toediening: De dagdosis indien mogelijk verdelen over 3 doses.Bij plotselinge extreme slapeloosheid: alleen voor de nacht innemen; dosering op geleide van effect ophogen tot max 0,05 mg/kg, maar niet hoger dan 3 mg/dag.De effectiviteit bij de behandeling van acute angst is discutabel. ''' ONLY respond if the response is present in the text. If the response cannot be extracted respond with an empty string. Respond in JSON ## Question: Use the provided schema to extract the unit of measurement (substance unit) from the medication dosage information contained in the text. Your anser should return a JSON string representing the extracted unit. Use schema: { substanceUnit: string } Examples of usage and expected output: - For "mg/kg/dag", return: "{ "substanceUnit": "mg" }" - For "g/m2/dag", return: "{ "substanceUnit": "g" }" - For "IE/m2", return: "{ "substanceUnit": "IE" }" Respond in JSON ## Answer: {"substanceUnit":"mg"} ## Question: Use the provided schema to extract the unit by which a medication dose is adjusted, such as patient weight or body surface area, from the medication dosage information contained in the text. Your answer should return a JSON string representing the extracted adjustment unit. Use schema : { adjustUnit: string } Examples of usage and expected output: - For "mg/kg/dag", return: "{ "adjustUnit": "kg" }" - For "mg/kg", return: "{ "adjustUnit": "kg" }" - For "mg/m2/dag", return: "{ "adjustUnit": "m2" }" - For "mg/m2", return: "{ "adjustUnit": "m2" }" Respond in JSON ## Answer: {"adjustUnit":"kg"} ## Question: Use the provided schema to extract the time unit from the medication dosage information contained in the text. Your answer should return a JSON string representing the extracted time unit. Use schema : { timeUnit: string } Examples of usage and expected output: - For "mg/kg/dag", return: "{ "timeUnit": "dag" }" - For "mg/kg", return: "{ "timeUnit": "" }" - For "mg/m2/week", return: "{ "timeUnit": "week" }" - For "mg/2 dagen", return: "{ "timeUnit": "2 dagen" }" Respond in JSON ## Answer: {"timeUnit":"dag"}
Testing different LLMs
In this section, we introduce a test function that employs a specific LLM model to iterate over a collection of test texts, each paired with an anticipated outcome. The function performs extraction operations on the texts, outputting the results as JSON, which are then compared against the expected JSON outcomes. The effectiveness of this process is quantified through a success score, which tallies the number of instances where the extracted information precisely matches the expected results.
let test model =
[
for (text, exp) in Texts.testUnitTexts do
let un, _ =
State.run
(createDoseUnits model text)
(systemMsg text)
if un = exp then 1 else 0
]
|> List.sum
Run the tests and see discover which model performs best.
[
Ollama.Models.llama2
Ollama.Models.gemma
Ollama.Models.openhermes
Ollama.Models.mistral
Ollama.Models.``llama-pro``
Ollama.Models.``openchat:7b``
Ollama.Models.``llama2:13b-chat``
]
|> List.map (fun model ->
printf $"- Testing: {model}: "
let s = model |> test
printfn $"score: {s}"
model, s
)
|> List.maxBy snd
|> fun (m, s) -> printfn $"\n\n## And the winner is: {m} with a high score: {s} from {Texts.testUnitTexts |> List.length}"
- Testing: llama2: score: 3 - Testing: gemma: score: 0 - Testing: openhermes: score: 4 - Testing: mistral: score: 3 - Testing: llama-pro: score: 4 - Testing: openchat:7b: score: 4 - Testing: llama2:13b-chat: score: 4 ## And the winner is: openhermes with a high score: 4 from 6
Conclusion
In conclusion, this notebook demonstrates a sophisticated approach to extracting structured information from free text using a Large Language Model (LLM). By defining specific extraction functions, each equipped with a zero structure for fallback and a validator for verifying the integrity of the extracted data, we can tackle complex extraction tasks in a modular, manageable manner. These functions are designed to isolate and validate small pieces of structured information, simplifying the extraction process.
Furthermore, by integrating these functions into a larger extraction framework and utilizing the monad computational expression, we ensure that the state—essentially, the ongoing conversation or list of messages—is consistently passed along. This approach ensures that the LLM maintains a comprehensive context, potentially enhancing the accuracy and validity of the information extracted.
To assess the efficacy of our extraction methodology, we employ a test function that iterates through a series of test texts, comparing the LLM's extractions against predefined expected outcomes. The comparison results in a success score that reflects the number of matches, offering a quantifiable measure of the extraction process's effectiveness.
Overall, this code presents a robust framework for not only extracting structured information from unstructured text but also validating and testing the accuracy of these extractions. Such a system has vast applications, ranging from data analysis to automating information retrieval and processing tasks, highlighting the power of combining LLM capabilities with functional programming and validation techniques.