Streamlining Intelligent Document Processing: A Guide to Amazon Bedrock Data Automation

Streamlining Intelligent Document Processing: A Guide to Amazon Bedrock Data Automation

In today’s data-driven world, developers face challenges in extracting insights from unstructured content such as documents, images, audio, and videos. Working with foundation models, optimizing performance, and managing multiple AI systems often requires significant time and resources. Additionally, custom intelligent document processing (IDP) solutions and data pipelines are becoming harder to maintain and deploy. As technology evolves rapidly, many struggle to keep up with these changes and ensure their systems remain efficient.

Enter Amazon Bedrock Data Automation (BDA), a new feature within Amazon Bedrock that streamlines the development of generative AI applications and automates workflows involving unstructured multimodal content. This innovative capability offers a unified experience for developers of all skill levels, enabling them to effortlessly automate the extraction, transformation, and generation of insights from their data.

In this article, we'll delve into its key features and benefits, and an example on how to use Bedrock Data Automation with the AWS SDK for Python (Boto3) for intelligent document processing workflows.

Key Features of Amazon Bedrock Data Automation

🔄 Unified Experience

  • Simplify your multimodal data processing with a single, powerful interface.
  • Process text, images, audio, and video seamlessly through the AWS management console or SDK.
  • Save time and resources by eliminating the need for multiple models and complex orchestration pipelines.
  • Leverage seamless integration with Amazon Bedrock Knowledge Bases to easily parse visually rich documents, extract meaningful information from unstructured content, and enhance your Retrieval Augmented Generation (RAG) workflows for more relevant responses.

🛠️ Customization

  • Achieve precise, tailored results for your use case.
  • Gain instant insights with pre-configured, modality-specific standard outputs (i.e. detection of visual and audible toxic content, explanations of document charts, etc.).
  • Generate custom outputs effortlessly using pre-defined catalog blueprints or custom blueprints for unique file types.
  • Define blueprints with specific fields to extract, data formats, and data transformation and normalization instructions.
  • Have full control over your output, ensuring seamless integration with existing applications and workflows.

🧠 Intelligent Model Orchestration and Output Validation

  • Attain industry-leading accuracy at scale across diverse data types.
  • Bedrock Data Automation automatically selects and combines state-of-the-art foundation models and task-specific models for optimal performance.
  • Ensure reliable outputs with built-in features like visual grounding (bounding boxes) and confidence scoring, enabling easy output auditing and enhancing result trustworthiness.

Intelligent Document Processing with Amazon Bedrock Data Automation

Bedrock Data Automation allows you to automate IDP workflows at scale, without needing to orchestrate complex document processing tasks such as classification, extraction, normalization, or validation.

In many cases, IDP workflows are used against a variety of different document types. For example, an insurance provider may want to automate the processing of a claims packet to streamline their pipeline and improve the accuracy of claims processing. Amazon Bedrock Data Automation simplifies the automation of complex IDP tasks such as document splitting, classification, data extraction, output format normalization and data validation.

Let's use the example of a claims packet which contains the following two documents: a prescription label, and a hospital discharge summary.

Article content
Example documents from a sample claims packet

A Bedrock Data Automation workflow consists of the following steps:

1. Define blueprints to generate custom outputs

Custom outputs use blueprints that specify output requirements using natural language or a schema editor. Blueprints can be created using the console or the API, and you can either use a pre-defined catalog blueprint for common documents types (i.e invoices, payslips, W2s, etc.) or create a custom one for your use case.

To create a custom blueprint using the API, you can use the CreateBlueprint operation using the Amazon Bedrock Data Automation Client. The following example shows the hospital name, hospital contact, visit details, patient details, provider details, assessment details and discharge summary being defined as properties passed to CreateBlueprint, to be extracted from the claims packet. By defining instructions for the properties, we are also leveraging key normalization, guiding BDA to recognize and map different representations of the same fields to a standardized key.

bda_create_blueprint_response = bedrock_data_automation_client.create_blueprint(
    blueprintName='hospital-discharge-report',
    type='DOCUMENT',
    blueprintStage='LIVE',
    schema=json.dumps({
        "$schema": "http://json-schema.org/draft-07/schema#",
        "description": "A standard discharge summary report used by hospital containing details of the patient, medical provider and key facts on visit, medical assessment and a summary of discharge.",
        "class": "Hospital Discharge Summary",
        "type": "object",
        "definitions": {
                "VisitDetails": {
                    "type": "object",
                    "properties": {
                            "admitted_date": {
                                "type": "string",
                                "inferenceType": "explicit",
                                "instruction": "Date of admission in MM-DD-YYYY format"
                            },
                        "discharged_date": {
                                "type": "string",
                                "inferenceType": "explicit",
                                "instruction": "Date of discharge in MM-DD-YYYY format"
                            },
                        "discharged_to": {
                                "type": "string",
                                "inferenceType": "explicit",
                                "instruction": "Where the patient was discharged to"
                        }
                    }
                },
            "PatientDetails": {
                    "type": "object",
                    "properties": {
                            "name": {
                                "type": "string",
                                "inferenceType": "explicit",
                                "instruction": "Name of the patient"
                            },
                        "gender": {
                                "type": "string",
                                "inferenceType": "explicit",
                                "instruction": "Gender of the patient"
                            },
                        "patient_id": {
                                "type": "string",
                                "inferenceType": "explicit",
                                "instruction": "Unique id of the patient"
                        }
                    }
                },
            "ProviderDetails": {
                    "type": "object",
                    "properties": {
                            "name": {
                                "type": "string",
                                "inferenceType": "explicit",
                                "instruction": "Name of the provider"
                            },
                        "provider_id": {
                                "type": "string",
                                "inferenceType": "explicit",
                                "instruction": "Unique id of the provider"
                            }
                    }
                },
            "AssessmentDetails": {
                    "type": "object",
                    "properties": {
                            "reported_symptoms": {
                                "type": "string",
                                "inferenceType": "explicit",
                                "instruction": "Reported symptoms and history of present illness"
                            }
                    }
                }
        },
        "properties": {
            "hospital_name": {
                "type": "string",
                "inferenceType": "explicit",
                        "instruction": "Name of the hospital"
            },
            "hospital_contact": {
                "type": "string",
                "inferenceType": "explicit",
                        "instruction": "Contact details of the hospital"
            },
            "visit_details": {
                "$ref": "#/definitions/VisitDetails"
            },
            "patient_details": {
                "$ref": "#/definitions/PatientDetails"
            },
            "provider_details": {
                "$ref": "#/definitions/ProviderDetails"
            },
            "assessment_details": {
                "$ref": "#/definitions/AssessmentDetails"
            },
            "discharge_summary": {
                "type": "string",
                "inferenceType": "explicit",
                        "instruction": "Summary of discharge instructions"
            }
        }
    }),
)        

The CreateBlueprint response returns the blueprintArn for the discharge summary's custom blueprint:

'blueprintArn: arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:blueprint/<BLUEPRINT_ID>'        

2. Create a Data Automation Project

A project is a grouping of both standard and custom output configurations. This allows you to use a single resource for multiple file types. When processing documents, you may want to use multiple blueprints for different kinds of documents that are passed to your project. If you choose to pass a file that contains multiple documents, you can enable BDA to automatically split the file into individual documents and match each one to the correct blueprint for processing. This allows you to process different types of documents within the same project, each with its own custom extraction logic.

To create a project using the API, you invoke the CreateDataAutomationProject operation. The following is an example of how you can configure custom output using the custom blueprint for the hospital discharge summary we created above and the existing sample blueprint for prescription labels.

bda_bedrock_automation_create_project_response = bedrock_data_automation_client.create_data_automation_project(
    projectName='TEST_PROJECT',
    projectDescription='test BDA project',
    projectStage='LIVE',
    standardOutputConfiguration={
        'document': {
            'outputFormat': {
                'textFormat': {
                    'types': ['PLAIN_TEXT']
                },
                'additionalFileFormat': {
                    'state': 'ENABLED',
                }
            }
        },
    },
    customOutputConfiguration={
        'blueprints': [
            {
                'blueprintArn': 'arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:blueprint/<BLUEPRINT_ID>'
            },
            {
                'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-prescription-label'
            },
        ],
    },
    overrideConfiguration={
        'document': {
            'splitter': {
                'state': 'ENABLED'
            }
        }
    },
)
        

In order to process different types of documents within a single document package using multiple blueprints in a single project, the splitter configuration must be enabled through the API, which is done at the bottom of above API call.

overrideConfiguration={
    'document': {
        'splitter': {
            'state': 'ENABLED' | 'DISABLED'
        }
    }
},        

The API validates the input configuration, creates a new project, and returns the projectArn.

'arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:data-automation-project/<PROJECT_ID>'        

3. Invoking the Analysis

Once the project is set up, you can start processing your files. The console allows you to easily test and preview the insights that can be extracted from your content, but these tests can only be performed on one document at a time. To process multiple documents, you can use the InvokeDataAutomationAsync API, which initiates the asynchronous processing of your files in a specified S3 bucket, using the configuration defined in the project by passing the project's ARN. You can specify the input configuration (in this case, the S3 bucket where the claim's packet resides), and the output configuration (where you want the results and metadata stored).

bda_invoke_data_automation_async_response = bedrock_data_automation_runtime_client.invoke_data_automation_async(
    inputConfiguration={'s3Uri': '<S3_URI>'},
    outputConfiguration={'s3Uri': '<S3_URI>'},
    dataAutomationProfileARN=‘arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:data-automation-profile/bda.v1:0‘
    dataAutomationConfiguration={
        ‘dataAutomationProjectArn’: ‘arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:data-automation-project/<PROJECT_ID>‘,
        'stage': 'LIVE'
    }
)        

This API call will return the invocationARN:

'arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:data-automation-invocation/<INVOCATION_ID>'        

4. Retrieving the results

Since the processing is asynchronous, you can use the GetDataAutomationStatus API to check the status of the processing job, using the InvocationArn from above.

bda_invoke_data_automation_async_response = bedrock_data_automation_runtime_client.get_data_automation_status(
    invocationArn='arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:data-automation-invocation/<INVOCATION_ID>'
)        

Once the job is completed, the results of the file processing are stored in the S3 bucket defined in the output configuration. The output includes unique structures depending on both the file modality and the operation types specified in the call to invoke the job.

In this example, BDA associated the hospital discharge summary document (2nd page) with the custom discharge summary blueprint with a high level of confidence:

"matched_blueprint": {"arn": "<BLUEPRINT_ARN>", "name": "hospital-discharge-report", "confidence": 0.9238664}        

Using the matched blueprint, BDA was able to accurately extract each field that was defined in the custom blueprint:

"inference_result": {
        "provider_details": {
            "name": "Mateo Jackson, Phd",
            "provider_id": "00988277891"
        },
        "discharge_summary": "Some activity restrictions suggested, full course of antibiotics, check back with physican in case of relapse, strict diet",
        "assessment_details": {
            "reported_symptoms": "35 yo M c/o stomach problems since 2 montsh ago. Patient reports epigastric abdominal pain non-radiating. Pain is described as gnawing and burning, intermitent lasting 1-2 hours, and gotten progressively worse. Antacids used to alleviate pain but not anymore; nothing exhacerbates pain. Pain unrelate"
        },
        "visit_details": {
            "discharged_date": "09-08-2020",
            "admitted_date": "09-07-2020",
            "discharged_to": "Home with support services"
        },
        "hospital_contact": "(999)-(888)-(1234)",
        "hospital_name": "Not a Memorial Hospital Of Collier Reg: PN/S/11011. Non-Profit",
        "patient_details": {
            "gender": "Male",
            "patient_id": "NARH-36640",
            "name": "John Doe"
        }
    },        

The output also includes explainability information which returns a confidence score and bounding information for each extracted field, which can be used for auditing and validation:

"explainability_info": [
        {
            "provider_details": {
                "name": {
                    "success": true,
                    "confidence": 0.93359375,
                    "geometry": [
                        {
                            "boundingBox": {
                                "top": 0.14783521583575837,
                                "left": 0.12684581227421549,
                                "width": 0.197725693303263,
                                "height": 0.011893643038294155
                            },
                            "vertices": [
                                {
                                    "x": 0.12684869245229458,
                                    "y": 0.14783521583575837
                                },
                                {
                                    "x": 0.3245715055774785,
                                    "y": 0.14799948879388436
                                },
                                {
                                    "x": 0.32456936528329655,
                                    "y": 0.15972885887405253
                                },
                                {
                                    "x": 0.12684581227421549,
                                    "y": 0.159564673184095
                                }
                            ],
                            "page": 8
                        }
                    ],
                    "type": "string",
                    "value": "Mateo Jackson, Phd"
                },
...        

This example showcases the power of Amazon Bedrock Data Automation in revolutionizing Intelligent Document Processing (IDP) workflows. By leveraging this service, organizations can effortlessly automate intricate document handling tasks, including document classification, data extraction, standardization, and validation. BDA significantly reduces the complexity of operations while enhancing processing efficiency. In the context of medical claims processing, this translates to increased capacity to handle larger claim volumes, minimized error rates, and overall operational optimization.

Conclusion

Amazon Bedrock Data Automation represents a significant leap forward in AI-powered data processing, addressing key challenges in managing unstructured data at scale. Its unified approach and customization capabilities make it a versatile tool across industries.

By democratizing access to advanced AI capabilities, Amazon Bedrock Data Automation enables organizations of all sizes to leverage generative AI without extensive technical expertise. For organizations looking to enhance their data strategy, BDA offers a compelling solution worth exploring.

To view or add a comment, sign in

Others also viewed

Explore content categories