Using a local LLM to systematically extract structured data
Below is an excerpt from an online notebook where I showcase the capability of local Language Models (LLMs) in extracting structured data from free text—a highly promising and powerful application of LLM technology.
To host a whole list of LLMs locally on my machine I use Ollama. A fantastic application that runs locally, so no privacy concerns, no additional hidden costs.
One prime example of its utility is evident in my work with #GenPRES. To execute this project effectively, I require precise dose rules, a significant portion of which typically necessitates manual extraction from unstructured text. However, leveraging an LLM enables the automation of this laborious and error-prone process, significantly enhancing efficiency and accuracy.
Using Ollama to get structured output
To structure text, using a structured output enables better extraction and validation.
First setup the notebook.
#load "load.fsx"
open Informedica.OpenAI.Lib
open Ollama.Operators
let extraction = function
| Ok x -> printfn $"## Extracted:\n{x}"
| Error _ -> printfn "## Extraction failed"
Define a schema and a type for the output
The json function will output the type used as a type parameter. However, due to limitations of the ollama framework you need to add the schema to the prompt as well.
"""
Use schema: { number: int; unit: string }
What is the minimal corrected gestational age mentioned in the text between '''
'''A neonate 28 weeks to 32 weeks corrected gestational age.'''
Reply in JSON."""
|> Message.user
|> Ollama.json<{| number: int; unit: string |}>
Ollama.Models.llama2
[]
|> Async.RunSynchronously
|> extraction
ℹ INFO:
EndPoint: http://localhost:11434/api/chat
Payload:
{"format":"json","messages":[{"content":"\nUse schema: { number: int; unit: string }\nWhat is the minimal corrected gestational age mentioned in the text between '''\n\n'''A neonate 28 weeks to 32 weeks corrected gestational age.'''\n\nReply in JSON.","role":"user"}],"model":"llama2","options":{"num_keep":null,"seed":101,"num_predict":null,"top_k":null,"top_p":null,"tfs_z":null,"typical_p":null,"repeat_last_n":64,"temperature":0.0,"repeat_penalty":null,"presence_penalty":null,"frequency_penalty":null,"mirostat":0,"mirostat_tau":null,"mirostat_eta":null,"penalize_newline":null,"stop":[],"numa":null,"num_ctx":2048,"num_batch":null,"num_gqa":null,"num_gpu":null,"main_gpu":null,"low_vram":null,"f16_kv":null,"vocab_only":null,"use_mmap":null,"use_mlock":null,"rope_frequency_base":null,"rope_frequency_scale":null,"num_thread":null},"response_format":{"schema":{
"$schema": "http://json-schema.org/draft-04/schema#",
"title": "<>f__AnonymousType3408661278OfIntegerAndString",
"type": "object",
"additionalProperties": false,
"properties": {
"number": {
"type": "integer",
"format": "int32"
},
"unit": {
"type": [
"null",
"string"
]
}
}
},"type":"json_object"},"stream":false}
## Extracted:
{ number = 28
unit = "weeks" }
Getting the maximum age as a json structure
The above output is correct, now try and get the maximum age.
"""
Use schema: { number: int; unit: string }
What is the maximum corrected gestational age mentioned in the text between '''
'''A neonate 28 weeks to 32 weeks corrected gestational age.'''
Reply in JSON."""
|> Message.user
|> Ollama.json<{| number: int; unit: string |}>
Ollama.Models.llama2
[]
|> Async.RunSynchronously
|> extraction
ℹ INFO:
EndPoint: http://localhost:11434/api/chat
Payload:
{"format":"json","messages":[{"content":"\nUse schema: { number: int; unit: string }\nWhat is the maximum corrected gestational age mentioned in the text between '''\n\n'''A neonate 28 weeks to 32 weeks corrected gestational age.'''\n\nReply in JSON.","role":"user"}],"model":"llama2","options":{"num_keep":null,"seed":101,"num_predict":null,"top_k":null,"top_p":null,"tfs_z":null,"typical_p":null,"repeat_last_n":64,"temperature":0.0,"repeat_penalty":null,"presence_penalty":null,"frequency_penalty":null,"mirostat":0,"mirostat_tau":null,"mirostat_eta":null,"penalize_newline":null,"stop":[],"numa":null,"num_ctx":2048,"num_batch":null,"num_gqa":null,"num_gpu":null,"main_gpu":null,"low_vram":null,"f16_kv":null,"vocab_only":null,"use_mmap":null,"use_mlock":null,"rope_frequency_base":null,"rope_frequency_scale":null,"num_thread":null},"response_format":{"schema":{
"$schema": "http://json-schema.org/draft-04/schema#",
"title": "<>f__AnonymousType3408661278OfIntegerAndString",
"type": "object",
"additionalProperties": false,
"properties": {
"number": {
"type": "integer",
"format": "int32"
},
"unit": {
"type": [
"null",
"string"
]
}
}
},"type":"json_object"},"stream":false}
## Extracted:
{ number = 28
unit = "weeks" }
Try a different structured output
Somehow, the prompt is misunderstood and the minimum age value is returned instead of the maximum age value.
Let's try again using a more complex structure.
Recommended by LinkedIn
"""
Use schema: { minAge: int; maxAge: int; unit: string }
What is corrected gestational age range mentioned in the text between '''
'''A neonate 28 weeks to 32 weeks corrected gestational age.'''
Reply in JSON."""
|> Message.user
|> Ollama.json<{| minAge: int; maxAge: int; unit: string |}>
Ollama.Models.llama2
[]
|> Async.RunSynchronously
|> extraction
ℹ INFO:
EndPoint: http://localhost:11434/api/chat
Payload:
{"format":"json","messages":[{"content":"\nUse schema: { minAge: int; maxAge: int; unit: string }\nWhat is corrected gestational age range mentioned in the text between '''\n\n'''A neonate 28 weeks to 32 weeks corrected gestational age.'''\n\nReply in JSON.","role":"user"}],"model":"llama2","options":{"num_keep":null,"seed":101,"num_predict":null,"top_k":null,"top_p":null,"tfs_z":null,"typical_p":null,"repeat_last_n":64,"temperature":0.0,"repeat_penalty":null,"presence_penalty":null,"frequency_penalty":null,"mirostat":0,"mirostat_tau":null,"mirostat_eta":null,"penalize_newline":null,"stop":[],"numa":null,"num_ctx":2048,"num_batch":null,"num_gqa":null,"num_gpu":null,"main_gpu":null,"low_vram":null,"f16_kv":null,"vocab_only":null,"use_mmap":null,"use_mlock":null,"rope_frequency_base":null,"rope_frequency_scale":null,"num_thread":null},"response_format":{"schema":{
"$schema": "http://json-schema.org/draft-04/schema#",
"title": "<>f__AnonymousType1996685024OfIntegerAndIntegerAndString",
"type": "object",
"additionalProperties": false,
"properties": {
"maxAge": {
"type": "integer",
"format": "int32"
},
"minAge": {
"type": "integer",
"format": "int32"
},
"unit": {
"type": [
"null",
"string"
]
}
}
},"type":"json_object"},"stream":false}
## Extracted:
{ maxAge = 32
minAge = 28
unit = "weeks" }
Extraction with different units
Now try a Dutch text with different units for the minimum and maximum age.
"""
Use schema: { minAge: int, maxAge: int, minAgeUnit: string, maxAgeUnit: string }
What is age range mentioned in the text between '''
'''
paracetamol
Oraal: Bij milde tot matige pijn en/of koorts: volgens het Kinderformularium van het NKFK bij een leeftijd van 1 maand–18 jaar: 10–15 mg/kg lichaamsgewicht per keer, zo nodig 4×/dag, max. 60 mg/kg/dag en max. 4 g/dag.
'''
Respond in JSON
"""
|> Message.user
|> Ollama.json<{| minAge: int; maxAge: int; minAgeUnit: string; maxAgeUnit: string |}>
Ollama.Models.llama2
[]
|> Async.RunSynchronously
|> extraction
ℹ INFO:
EndPoint: http://localhost:11434/api/chat
Payload:
{"format":"json","messages":[{"content":"\nUse schema: { minAge: int, maxAge: int, minAgeUnit: string, maxAgeUnit: string }\nWhat is age range mentioned in the text between '''\n\n'''\nparacetamol\nOraal: Bij milde tot matige pijn en/of koorts: volgens het Kinderformularium van het NKFK bij een leeftijd van 1 maand–18 jaar: 10–15 mg/kg lichaamsgewicht per keer, zo nodig 4×/dag, max. 60 mg/kg/dag en max. 4 g/dag.\n'''\n\nRespond in JSON\n","role":"user"}],"model":"llama2","options":{"num_keep":null,"seed":101,"num_predict":null,"top_k":null,"top_p":null,"tfs_z":null,"typical_p":null,"repeat_last_n":64,"temperature":0.0,"repeat_penalty":null,"presence_penalty":null,"frequency_penalty":null,"mirostat":0,"mirostat_tau":null,"mirostat_eta":null,"penalize_newline":null,"stop":[],"numa":null,"num_ctx":2048,"num_batch":null,"num_gqa":null,"num_gpu":null,"main_gpu":null,"low_vram":null,"f16_kv":null,"vocab_only":null,"use_mmap":null,"use_mlock":null,"rope_frequency_base":null,"rope_frequency_scale":null,"num_thread":null},"response_format":{"schema":{
"$schema": "http://json-schema.org/draft-04/schema#",
"title": "<>f__AnonymousType2405761160OfIntegerAndStringAndIntegerAndString",
"type": "object",
"additionalProperties": false,
"properties": {
"maxAge": {
"type": "integer",
"format": "int32"
},
"maxAgeUnit": {
"type": [
"null",
"string"
]
},
"minAge": {
"type": "integer",
"format": "int32"
},
"minAgeUnit": {
"type": [
"null",
"string"
]
}
}
},"type":"json_object"},"stream":false}
## Extracted:
{ maxAge = 18
maxAgeUnit = "dag"
minAge = 1
minAgeUnit = "maand" }
Use a more explicit structure
A more explicit structure also has more semantic meaning. The below structure is an explicit range structure with a min and a max object containing an age structure.
"""
Use schema: { ageRange : { minAge: { age: int, unit: string }, maxAge: { age: int, unit: string } } }
What is age range mentioned in the text between '''
'''
paracetamol
Oraal: Bij milde tot matige pijn en/of koorts: volgens het Kinderformularium van het NKFK bij een leeftijd van 1 maand–18 jaar: 10–15 mg/kg lichaamsgewicht per keer, zo nodig 4×/dag, max. 60 mg/kg/dag en max. 4 g/dag.
'''
Respond in JSON
"""
|> Message.user
|> Ollama.json<{| ageRange : {| minAge: {| age: int; unit: string |}; maxAge: {| age: int; unit: string |} |} |} >
Ollama.Models.llama2
[]
|> Async.RunSynchronously
|> extraction
ℹ INFO:
EndPoint: http://localhost:11434/api/chat
Payload:
{"format":"json","messages":[{"content":"\nUse schema: { ageRange : { minAge: { age: int, unit: string }, maxAge: { age: int, unit: string } } }\nWhat is age range mentioned in the text between '''\n\n'''\nparacetamol\nOraal: Bij milde tot matige pijn en/of koorts: volgens het Kinderformularium van het NKFK bij een leeftijd van 1 maand–18 jaar: 10–15 mg/kg lichaamsgewicht per keer, zo nodig 4×/dag, max. 60 mg/kg/dag en max. 4 g/dag.\n'''\n\nRespond in JSON\n","role":"user"}],"model":"llama2","options":{"num_keep":null,"seed":101,"num_predict":null,"top_k":null,"top_p":null,"tfs_z":null,"typical_p":null,"repeat_last_n":64,"temperature":0.0,"repeat_penalty":null,"presence_penalty":null,"frequency_penalty":null,"mirostat":0,"mirostat_tau":null,"mirostat_eta":null,"penalize_newline":null,"stop":[],"numa":null,"num_ctx":2048,"num_batch":null,"num_gqa":null,"num_gpu":null,"main_gpu":null,"low_vram":null,"f16_kv":null,"vocab_only":null,"use_mmap":null,"use_mlock":null,"rope_frequency_base":null,"rope_frequency_scale":null,"num_thread":null},"response_format":{"schema":{
"$schema": "http://json-schema.org/draft-04/schema#",
"title": "<>f__AnonymousType535326819Of<>f__AnonymousType2485855837Of<>f__AnonymousType3030620811OfIntegerAndStringAnd<>f__AnonymousType3030620811OfIntegerAndString",
"type": "object",
"additionalProperties": false,
"properties": {
"ageRange": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/definitions/OfF__AnonymousType2485855837OfOfF__AnonymousType3030620811OfIntegerAndStringAndOfF__AnonymousType3030620811OfIntegerAndString"
}
]
}
},
"definitions": {
"OfF__AnonymousType2485855837OfOfF__AnonymousType3030620811OfIntegerAndStringAndOfF__AnonymousType3030620811OfIntegerAndString": {
"type": "object",
"additionalProperties": false,
"properties": {
"maxAge": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/definitions/OfF__AnonymousType3030620811OfIntegerAndString"
}
]
},
"minAge": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/definitions/OfF__AnonymousType3030620811OfIntegerAndString"
}
]
}
}
},
"OfF__AnonymousType3030620811OfIntegerAndString": {
"type": "object",
"additionalProperties": false,
"properties": {
"age": {
"type": "integer",
"format": "int32"
},
"unit": {
"type": [
"null",
"string"
]
}
}
}
}
},"type":"json_object"},"stream":false}
## Extracted:
{ ageRange = { maxAge = { age = 18
unit = "jaar" }
minAge = { age = 1
unit = "maand" } } }
A more demanding extraction with different units
Try to extract 6 months - 1 year. So, naively 6 > 1 but with units of course not so!
"""
Use schema: { ageRange : { minAge: { age: int, unit: string }, maxAge: { age: int, unit: string } } }
What is age range mentioned in the text between '''
'''
paracetamol
Oraal: Bij milde tot matige pijn en/of koorts: volgens het Kinderformularium van het NKFK bij een leeftijd van 6 maanden – 1 jaar: 10–15 mg/kg lichaamsgewicht per keer, zo nodig 4×/dag, max. 60 mg/kg/dag en max. 4 g/dag.
'''
Respond in JSON
"""
|> Message.user
|> Ollama.json<{| ageRange : {| minAge: {| age: int; unit: string |}; maxAge: {| age: int; unit: string |} |} |} >
Ollama.Models.llama2
[]
|> Async.RunSynchronously
|> extraction
ℹ INFO:
EndPoint: http://localhost:11434/api/chat
Payload:
{"format":"json","messages":[{"content":"\nUse schema: { ageRange : { minAge: { age: int, unit: string }, maxAge: { age: int, unit: string } } }\nWhat is age range mentioned in the text between '''\n\n'''\nparacetamol\nOraal: Bij milde tot matige pijn en/of koorts: volgens het Kinderformularium van het NKFK bij een leeftijd van 6 maanden – 1 jaar: 10–15 mg/kg lichaamsgewicht per keer, zo nodig 4×/dag, max. 60 mg/kg/dag en max. 4 g/dag.\n'''\n\nRespond in JSON\n","role":"user"}],"model":"llama2","options":{"num_keep":null,"seed":101,"num_predict":null,"top_k":null,"top_p":null,"tfs_z":null,"typical_p":null,"repeat_last_n":64,"temperature":0.0,"repeat_penalty":null,"presence_penalty":null,"frequency_penalty":null,"mirostat":0,"mirostat_tau":null,"mirostat_eta":null,"penalize_newline":null,"stop":[],"numa":null,"num_ctx":2048,"num_batch":null,"num_gqa":null,"num_gpu":null,"main_gpu":null,"low_vram":null,"f16_kv":null,"vocab_only":null,"use_mmap":null,"use_mlock":null,"rope_frequency_base":null,"rope_frequency_scale":null,"num_thread":null},"response_format":{"schema":{
"$schema": "http://json-schema.org/draft-04/schema#",
"title": "<>f__AnonymousType535326819Of<>f__AnonymousType2485855837Of<>f__AnonymousType3030620811OfIntegerAndStringAnd<>f__AnonymousType3030620811OfIntegerAndString",
"type": "object",
"additionalProperties": false,
"properties": {
"ageRange": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/definitions/OfF__AnonymousType2485855837OfOfF__AnonymousType3030620811OfIntegerAndStringAndOfF__AnonymousType3030620811OfIntegerAndString"
}
]
}
},
"definitions": {
"OfF__AnonymousType2485855837OfOfF__AnonymousType3030620811OfIntegerAndStringAndOfF__AnonymousType3030620811OfIntegerAndString": {
"type": "object",
"additionalProperties": false,
"properties": {
"maxAge": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/definitions/OfF__AnonymousType3030620811OfIntegerAndString"
}
]
},
"minAge": {
"oneOf": [
{
"type": "null"
},
{
"$ref": "#/definitions/OfF__AnonymousType3030620811OfIntegerAndString"
}
]
}
}
},
"OfF__AnonymousType3030620811OfIntegerAndString": {
"type": "object",
"additionalProperties": false,
"properties": {
"age": {
"type": "integer",
"format": "int32"
},
"unit": {
"type": [
"null",
"string"
]
}
}
}
}
},"type":"json_object"},"stream":false}
## Extracted:
{ ageRange = { maxAge = { age = 12
unit = "maanden" }
minAge = { age = 6
unit = "maanden" } } }
Surprise! It figured out that 1 year = 12 months, so the max age is indeed 12 months, i.e. 1 year. At first glance I actually thought the LLM got it wrong ;-)
Extract a dose structure
Let's try to extract a dose from a text.
"""
Use schema: { maxDose: float, unit: string }
What is the max dose mentioned in the text between '''
'''
paracetamol
Oraal: Bij milde tot matige pijn en/of koorts: volgens het Kinderformularium van het NKFK bij een leeftijd van 1 maand–18 jaar: 10–15 mg/kg lichaamsgewicht per keer, zo nodig 4×/dag, max. 60 mg/kg/dag en max. 4 g/dag.
'''
Respond in JSON
"""
|> Message.user
|> Ollama.json<{| maxDose: int; unit: string |}>
Ollama.Models.llama2
[]
|> Async.RunSynchronously
|> extraction
ℹ INFO:
EndPoint: http://localhost:11434/api/chat
Payload:
{"format":"json","messages":[{"content":"\nUse schema: { maxDose: float, unit: string }\nWhat is the max dose mentioned in the text between '''\n\n'''\nparacetamol\nOraal: Bij milde tot matige pijn en/of koorts: volgens het Kinderformularium van het NKFK bij een leeftijd van 1 maand–18 jaar: 10–15 mg/kg lichaamsgewicht per keer, zo nodig 4×/dag, max. 60 mg/kg/dag en max. 4 g/dag.\n'''\n\nRespond in JSON\n","role":"user"}],"model":"llama2","options":{"num_keep":null,"seed":101,"num_predict":null,"top_k":null,"top_p":null,"tfs_z":null,"typical_p":null,"repeat_last_n":64,"temperature":0.0,"repeat_penalty":null,"presence_penalty":null,"frequency_penalty":null,"mirostat":0,"mirostat_tau":null,"mirostat_eta":null,"penalize_newline":null,"stop":[],"numa":null,"num_ctx":2048,"num_batch":null,"num_gqa":null,"num_gpu":null,"main_gpu":null,"low_vram":null,"f16_kv":null,"vocab_only":null,"use_mmap":null,"use_mlock":null,"rope_frequency_base":null,"rope_frequency_scale":null,"num_thread":null},"response_format":{"schema":{
"$schema": "http://json-schema.org/draft-04/schema#",
"title": "<>f__AnonymousType3097151138OfIntegerAndString",
"type": "object",
"additionalProperties": false,
"properties": {
"maxDose": {
"type": "integer",
"format": "int32"
},
"unit": {
"type": [
"null",
"string"
]
}
}
},"type":"json_object"},"stream":false}
## Extracted:
{ maxDose = 60
unit = "mg/kg/day" }