Using a local LLM to systematically extract structured data
The Ollama application

Using a local LLM to systematically extract structured data

Below is an excerpt from an online notebook where I showcase the capability of local Language Models (LLMs) in extracting structured data from free text—a highly promising and powerful application of LLM technology.

To host a whole list of LLMs locally on my machine I use Ollama. A fantastic application that runs locally, so no privacy concerns, no additional hidden costs.

One prime example of its utility is evident in my work with #GenPRES. To execute this project effectively, I require precise dose rules, a significant portion of which typically necessitates manual extraction from unstructured text. However, leveraging an LLM enables the automation of this laborious and error-prone process, significantly enhancing efficiency and accuracy.

Using Ollama to get structured output

To structure text, using a structured output enables better extraction and validation.

First setup the notebook.

#load "load.fsx"

open Informedica.OpenAI.Lib
open Ollama.Operators

let extraction = function
| Ok x -> printfn $"## Extracted:\n{x}"
| Error _ -> printfn "## Extraction failed"
        

Define a schema and a type for the output

The json function will output the type used as a type parameter. However, due to limitations of the ollama framework you need to add the schema to the prompt as well.

"""
Use schema: { number: int; unit: string }
What is the minimal corrected gestational age mentioned in the text between '''

'''A neonate 28 weeks to 32 weeks corrected gestational age.'''

Reply in JSON."""
|> Message.user
|> Ollama.json<{| number: int; unit: string |}>
    Ollama.Models.llama2
    []
|> Async.RunSynchronously
|> extraction
        
ℹ INFO: 
EndPoint: http://localhost:11434/api/chat
Payload:
{"format":"json","messages":[{"content":"\nUse schema: { number: int; unit: string }\nWhat is the minimal corrected gestational age mentioned in the text between '''\n\n'''A neonate 28 weeks to 32 weeks corrected gestational age.'''\n\nReply in JSON.","role":"user"}],"model":"llama2","options":{"num_keep":null,"seed":101,"num_predict":null,"top_k":null,"top_p":null,"tfs_z":null,"typical_p":null,"repeat_last_n":64,"temperature":0.0,"repeat_penalty":null,"presence_penalty":null,"frequency_penalty":null,"mirostat":0,"mirostat_tau":null,"mirostat_eta":null,"penalize_newline":null,"stop":[],"numa":null,"num_ctx":2048,"num_batch":null,"num_gqa":null,"num_gpu":null,"main_gpu":null,"low_vram":null,"f16_kv":null,"vocab_only":null,"use_mmap":null,"use_mlock":null,"rope_frequency_base":null,"rope_frequency_scale":null,"num_thread":null},"response_format":{"schema":{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "title": "<>f__AnonymousType3408661278OfIntegerAndString",
  "type": "object",
  "additionalProperties": false,
  "properties": {
    "number": {
      "type": "integer",
      "format": "int32"
    },
    "unit": {
      "type": [
        "null",
        "string"
      ]
    }
  }
},"type":"json_object"},"stream":false}

## Extracted:
{ number = 28
  unit = "weeks" }
        

Getting the maximum age as a json structure

The above output is correct, now try and get the maximum age.

"""
Use schema: { number: int; unit: string }
What is the maximum corrected gestational age mentioned in the text between '''

'''A neonate 28 weeks to 32 weeks corrected gestational age.'''

Reply in JSON."""
|> Message.user
|> Ollama.json<{| number: int; unit: string |}>
    Ollama.Models.llama2
    []
|> Async.RunSynchronously
|> extraction
        
ℹ INFO: 
EndPoint: http://localhost:11434/api/chat
Payload:
{"format":"json","messages":[{"content":"\nUse schema: { number: int; unit: string }\nWhat is the maximum corrected gestational age mentioned in the text between '''\n\n'''A neonate 28 weeks to 32 weeks corrected gestational age.'''\n\nReply in JSON.","role":"user"}],"model":"llama2","options":{"num_keep":null,"seed":101,"num_predict":null,"top_k":null,"top_p":null,"tfs_z":null,"typical_p":null,"repeat_last_n":64,"temperature":0.0,"repeat_penalty":null,"presence_penalty":null,"frequency_penalty":null,"mirostat":0,"mirostat_tau":null,"mirostat_eta":null,"penalize_newline":null,"stop":[],"numa":null,"num_ctx":2048,"num_batch":null,"num_gqa":null,"num_gpu":null,"main_gpu":null,"low_vram":null,"f16_kv":null,"vocab_only":null,"use_mmap":null,"use_mlock":null,"rope_frequency_base":null,"rope_frequency_scale":null,"num_thread":null},"response_format":{"schema":{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "title": "<>f__AnonymousType3408661278OfIntegerAndString",
  "type": "object",
  "additionalProperties": false,
  "properties": {
    "number": {
      "type": "integer",
      "format": "int32"
    },
    "unit": {
      "type": [
        "null",
        "string"
      ]
    }
  }
},"type":"json_object"},"stream":false}

## Extracted:
{ number = 28
  unit = "weeks" }
        

Try a different structured output

Somehow, the prompt is misunderstood and the minimum age value is returned instead of the maximum age value.

Let's try again using a more complex structure.

"""
Use schema: { minAge: int; maxAge: int; unit: string }
What is corrected gestational age range mentioned in the text between '''

'''A neonate 28 weeks to 32 weeks corrected gestational age.'''

Reply in JSON."""
|> Message.user
|> Ollama.json<{| minAge: int; maxAge: int; unit: string |}>
    Ollama.Models.llama2
    []
|> Async.RunSynchronously
|> extraction
        
ℹ INFO: 
EndPoint: http://localhost:11434/api/chat
Payload:
{"format":"json","messages":[{"content":"\nUse schema: { minAge: int; maxAge: int; unit: string }\nWhat is corrected gestational age range mentioned in the text between '''\n\n'''A neonate 28 weeks to 32 weeks corrected gestational age.'''\n\nReply in JSON.","role":"user"}],"model":"llama2","options":{"num_keep":null,"seed":101,"num_predict":null,"top_k":null,"top_p":null,"tfs_z":null,"typical_p":null,"repeat_last_n":64,"temperature":0.0,"repeat_penalty":null,"presence_penalty":null,"frequency_penalty":null,"mirostat":0,"mirostat_tau":null,"mirostat_eta":null,"penalize_newline":null,"stop":[],"numa":null,"num_ctx":2048,"num_batch":null,"num_gqa":null,"num_gpu":null,"main_gpu":null,"low_vram":null,"f16_kv":null,"vocab_only":null,"use_mmap":null,"use_mlock":null,"rope_frequency_base":null,"rope_frequency_scale":null,"num_thread":null},"response_format":{"schema":{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "title": "<>f__AnonymousType1996685024OfIntegerAndIntegerAndString",
  "type": "object",
  "additionalProperties": false,
  "properties": {
    "maxAge": {
      "type": "integer",
      "format": "int32"
    },
    "minAge": {
      "type": "integer",
      "format": "int32"
    },
    "unit": {
      "type": [
        "null",
        "string"
      ]
    }
  }
},"type":"json_object"},"stream":false}

## Extracted:
{ maxAge = 32
  minAge = 28
  unit = "weeks" }
        

Extraction with different units

Now try a Dutch text with different units for the minimum and maximum age.

"""
Use schema: { minAge: int, maxAge: int, minAgeUnit: string, maxAgeUnit: string }
What is age range mentioned in the text between '''

'''
paracetamol
Oraal: Bij milde tot matige pijn en/of koorts: volgens het Kinderformularium van het NKFK bij een leeftijd van 1 maand–18 jaar: 10–15 mg/kg lichaamsgewicht per keer, zo nodig 4×/dag, max. 60 mg/kg/dag en max. 4 g/dag.
'''

Respond in JSON
"""
|> Message.user
|> Ollama.json<{| minAge: int; maxAge: int; minAgeUnit: string; maxAgeUnit: string |}>
    Ollama.Models.llama2
    []
|> Async.RunSynchronously
|> extraction
        
ℹ INFO: 
EndPoint: http://localhost:11434/api/chat
Payload:
{"format":"json","messages":[{"content":"\nUse schema: { minAge: int, maxAge: int, minAgeUnit: string, maxAgeUnit: string }\nWhat is age range mentioned in the text between '''\n\n'''\nparacetamol\nOraal: Bij milde tot matige pijn en/of koorts: volgens het Kinderformularium van het NKFK bij een leeftijd van 1 maand–18 jaar: 10–15 mg/kg lichaamsgewicht per keer, zo nodig 4×/dag, max. 60 mg/kg/dag en max. 4 g/dag.\n'''\n\nRespond in JSON\n","role":"user"}],"model":"llama2","options":{"num_keep":null,"seed":101,"num_predict":null,"top_k":null,"top_p":null,"tfs_z":null,"typical_p":null,"repeat_last_n":64,"temperature":0.0,"repeat_penalty":null,"presence_penalty":null,"frequency_penalty":null,"mirostat":0,"mirostat_tau":null,"mirostat_eta":null,"penalize_newline":null,"stop":[],"numa":null,"num_ctx":2048,"num_batch":null,"num_gqa":null,"num_gpu":null,"main_gpu":null,"low_vram":null,"f16_kv":null,"vocab_only":null,"use_mmap":null,"use_mlock":null,"rope_frequency_base":null,"rope_frequency_scale":null,"num_thread":null},"response_format":{"schema":{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "title": "<>f__AnonymousType2405761160OfIntegerAndStringAndIntegerAndString",
  "type": "object",
  "additionalProperties": false,
  "properties": {
    "maxAge": {
      "type": "integer",
      "format": "int32"
    },
    "maxAgeUnit": {
      "type": [
        "null",
        "string"
      ]
    },
    "minAge": {
      "type": "integer",
      "format": "int32"
    },
    "minAgeUnit": {
      "type": [
        "null",
        "string"
      ]
    }
  }
},"type":"json_object"},"stream":false}

## Extracted:
{ maxAge = 18
  maxAgeUnit = "dag"
  minAge = 1
  minAgeUnit = "maand" }
        

Use a more explicit structure

A more explicit structure also has more semantic meaning. The below structure is an explicit range structure with a min and a max object containing an age structure.

"""
Use schema: { ageRange : { minAge: { age: int, unit: string }, maxAge: { age: int, unit: string } } }
What is age range mentioned in the text between '''

'''
paracetamol
Oraal: Bij milde tot matige pijn en/of koorts: volgens het Kinderformularium van het NKFK bij een leeftijd van 1 maand–18 jaar: 10–15 mg/kg lichaamsgewicht per keer, zo nodig 4×/dag, max. 60 mg/kg/dag en max. 4 g/dag.
'''

Respond in JSON
"""
|> Message.user
|> Ollama.json<{| ageRange : {| minAge: {| age: int; unit: string |}; maxAge: {| age: int; unit: string |} |} |} >
    Ollama.Models.llama2
    []
|> Async.RunSynchronously
|> extraction
        
ℹ INFO: 
EndPoint: http://localhost:11434/api/chat
Payload:
{"format":"json","messages":[{"content":"\nUse schema: { ageRange : { minAge: { age: int, unit: string }, maxAge: { age: int, unit: string } } }\nWhat is age range mentioned in the text between '''\n\n'''\nparacetamol\nOraal: Bij milde tot matige pijn en/of koorts: volgens het Kinderformularium van het NKFK bij een leeftijd van 1 maand–18 jaar: 10–15 mg/kg lichaamsgewicht per keer, zo nodig 4×/dag, max. 60 mg/kg/dag en max. 4 g/dag.\n'''\n\nRespond in JSON\n","role":"user"}],"model":"llama2","options":{"num_keep":null,"seed":101,"num_predict":null,"top_k":null,"top_p":null,"tfs_z":null,"typical_p":null,"repeat_last_n":64,"temperature":0.0,"repeat_penalty":null,"presence_penalty":null,"frequency_penalty":null,"mirostat":0,"mirostat_tau":null,"mirostat_eta":null,"penalize_newline":null,"stop":[],"numa":null,"num_ctx":2048,"num_batch":null,"num_gqa":null,"num_gpu":null,"main_gpu":null,"low_vram":null,"f16_kv":null,"vocab_only":null,"use_mmap":null,"use_mlock":null,"rope_frequency_base":null,"rope_frequency_scale":null,"num_thread":null},"response_format":{"schema":{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "title": "<>f__AnonymousType535326819Of<>f__AnonymousType2485855837Of<>f__AnonymousType3030620811OfIntegerAndStringAnd<>f__AnonymousType3030620811OfIntegerAndString",
  "type": "object",
  "additionalProperties": false,
  "properties": {
    "ageRange": {
      "oneOf": [
        {
          "type": "null"
        },
        {
          "$ref": "#/definitions/OfF__AnonymousType2485855837OfOfF__AnonymousType3030620811OfIntegerAndStringAndOfF__AnonymousType3030620811OfIntegerAndString"
        }
      ]
    }
  },
  "definitions": {
    "OfF__AnonymousType2485855837OfOfF__AnonymousType3030620811OfIntegerAndStringAndOfF__AnonymousType3030620811OfIntegerAndString": {
      "type": "object",
      "additionalProperties": false,
      "properties": {
        "maxAge": {
          "oneOf": [
            {
              "type": "null"
            },
            {
              "$ref": "#/definitions/OfF__AnonymousType3030620811OfIntegerAndString"
            }
          ]
        },
        "minAge": {
          "oneOf": [
            {
              "type": "null"
            },
            {
              "$ref": "#/definitions/OfF__AnonymousType3030620811OfIntegerAndString"
            }
          ]
        }
      }
    },
    "OfF__AnonymousType3030620811OfIntegerAndString": {
      "type": "object",
      "additionalProperties": false,
      "properties": {
        "age": {
          "type": "integer",
          "format": "int32"
        },
        "unit": {
          "type": [
            "null",
            "string"
          ]
        }
      }
    }
  }
},"type":"json_object"},"stream":false}

## Extracted:
{ ageRange = { maxAge = { age = 18
                          unit = "jaar" }
               minAge = { age = 1
                          unit = "maand" } } }
        

A more demanding extraction with different units

Try to extract 6 months - 1 year. So, naively 6 > 1 but with units of course not so!

"""
Use schema: { ageRange : { minAge: { age: int, unit: string }, maxAge: { age: int, unit: string } } }
What is age range mentioned in the text between '''

'''
paracetamol
Oraal: Bij milde tot matige pijn en/of koorts: volgens het Kinderformularium van het NKFK bij een leeftijd van 6 maanden – 1 jaar: 10–15 mg/kg lichaamsgewicht per keer, zo nodig 4×/dag, max. 60 mg/kg/dag en max. 4 g/dag.
'''

Respond in JSON
"""
|> Message.user
|> Ollama.json<{| ageRange : {| minAge: {| age: int; unit: string |}; maxAge: {| age: int; unit: string |} |} |} >
    Ollama.Models.llama2
    []
|> Async.RunSynchronously
|> extraction
        
ℹ INFO: 
EndPoint: http://localhost:11434/api/chat
Payload:
{"format":"json","messages":[{"content":"\nUse schema: { ageRange : { minAge: { age: int, unit: string }, maxAge: { age: int, unit: string } } }\nWhat is age range mentioned in the text between '''\n\n'''\nparacetamol\nOraal: Bij milde tot matige pijn en/of koorts: volgens het Kinderformularium van het NKFK bij een leeftijd van 6 maanden – 1 jaar: 10–15 mg/kg lichaamsgewicht per keer, zo nodig 4×/dag, max. 60 mg/kg/dag en max. 4 g/dag.\n'''\n\nRespond in JSON\n","role":"user"}],"model":"llama2","options":{"num_keep":null,"seed":101,"num_predict":null,"top_k":null,"top_p":null,"tfs_z":null,"typical_p":null,"repeat_last_n":64,"temperature":0.0,"repeat_penalty":null,"presence_penalty":null,"frequency_penalty":null,"mirostat":0,"mirostat_tau":null,"mirostat_eta":null,"penalize_newline":null,"stop":[],"numa":null,"num_ctx":2048,"num_batch":null,"num_gqa":null,"num_gpu":null,"main_gpu":null,"low_vram":null,"f16_kv":null,"vocab_only":null,"use_mmap":null,"use_mlock":null,"rope_frequency_base":null,"rope_frequency_scale":null,"num_thread":null},"response_format":{"schema":{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "title": "<>f__AnonymousType535326819Of<>f__AnonymousType2485855837Of<>f__AnonymousType3030620811OfIntegerAndStringAnd<>f__AnonymousType3030620811OfIntegerAndString",
  "type": "object",
  "additionalProperties": false,
  "properties": {
    "ageRange": {
      "oneOf": [
        {
          "type": "null"
        },
        {
          "$ref": "#/definitions/OfF__AnonymousType2485855837OfOfF__AnonymousType3030620811OfIntegerAndStringAndOfF__AnonymousType3030620811OfIntegerAndString"
        }
      ]
    }
  },
  "definitions": {
    "OfF__AnonymousType2485855837OfOfF__AnonymousType3030620811OfIntegerAndStringAndOfF__AnonymousType3030620811OfIntegerAndString": {
      "type": "object",
      "additionalProperties": false,
      "properties": {
        "maxAge": {
          "oneOf": [
            {
              "type": "null"
            },
            {
              "$ref": "#/definitions/OfF__AnonymousType3030620811OfIntegerAndString"
            }
          ]
        },
        "minAge": {
          "oneOf": [
            {
              "type": "null"
            },
            {
              "$ref": "#/definitions/OfF__AnonymousType3030620811OfIntegerAndString"
            }
          ]
        }
      }
    },
    "OfF__AnonymousType3030620811OfIntegerAndString": {
      "type": "object",
      "additionalProperties": false,
      "properties": {
        "age": {
          "type": "integer",
          "format": "int32"
        },
        "unit": {
          "type": [
            "null",
            "string"
          ]
        }
      }
    }
  }
},"type":"json_object"},"stream":false}

## Extracted:
{ ageRange = { maxAge = { age = 12
                          unit = "maanden" }
               minAge = { age = 6
                          unit = "maanden" } } }
        

Surprise! It figured out that 1 year = 12 months, so the max age is indeed 12 months, i.e. 1 year. At first glance I actually thought the LLM got it wrong ;-)

Extract a dose structure

Let's try to extract a dose from a text.

"""
Use schema: { maxDose: float, unit: string }
What is the max dose mentioned in the text between '''

'''
paracetamol
Oraal: Bij milde tot matige pijn en/of koorts: volgens het Kinderformularium van het NKFK bij een leeftijd van 1 maand–18 jaar: 10–15 mg/kg lichaamsgewicht per keer, zo nodig 4×/dag, max. 60 mg/kg/dag en max. 4 g/dag.
'''

Respond in JSON
"""
|> Message.user
|> Ollama.json<{| maxDose: int; unit: string |}>
    Ollama.Models.llama2
    []
|> Async.RunSynchronously
|> extraction
        
ℹ INFO: 
EndPoint: http://localhost:11434/api/chat
Payload:
{"format":"json","messages":[{"content":"\nUse schema: { maxDose: float, unit: string }\nWhat is the max dose mentioned in the text between '''\n\n'''\nparacetamol\nOraal: Bij milde tot matige pijn en/of koorts: volgens het Kinderformularium van het NKFK bij een leeftijd van 1 maand–18 jaar: 10–15 mg/kg lichaamsgewicht per keer, zo nodig 4×/dag, max. 60 mg/kg/dag en max. 4 g/dag.\n'''\n\nRespond in JSON\n","role":"user"}],"model":"llama2","options":{"num_keep":null,"seed":101,"num_predict":null,"top_k":null,"top_p":null,"tfs_z":null,"typical_p":null,"repeat_last_n":64,"temperature":0.0,"repeat_penalty":null,"presence_penalty":null,"frequency_penalty":null,"mirostat":0,"mirostat_tau":null,"mirostat_eta":null,"penalize_newline":null,"stop":[],"numa":null,"num_ctx":2048,"num_batch":null,"num_gqa":null,"num_gpu":null,"main_gpu":null,"low_vram":null,"f16_kv":null,"vocab_only":null,"use_mmap":null,"use_mlock":null,"rope_frequency_base":null,"rope_frequency_scale":null,"num_thread":null},"response_format":{"schema":{
  "$schema": "http://json-schema.org/draft-04/schema#",
  "title": "<>f__AnonymousType3097151138OfIntegerAndString",
  "type": "object",
  "additionalProperties": false,
  "properties": {
    "maxDose": {
      "type": "integer",
      "format": "int32"
    },
    "unit": {
      "type": [
        "null",
        "string"
      ]
    }
  }
},"type":"json_object"},"stream":false}

## Extracted:
{ maxDose = 60
  unit = "mg/kg/day" }
        

To view or add a comment, sign in

More articles by Casper Bollen

Others also viewed

Explore content categories