Streamlining Intelligent Document Processing: A Guide to Amazon Bedrock Data Automation
In today’s data-driven world, developers face challenges in extracting insights from unstructured content such as documents, images, audio, and videos. Working with foundation models, optimizing performance, and managing multiple AI systems often requires significant time and resources. Additionally, custom intelligent document processing (IDP) solutions and data pipelines are becoming harder to maintain and deploy. As technology evolves rapidly, many struggle to keep up with these changes and ensure their systems remain efficient.
Enter Amazon Bedrock Data Automation (BDA), a new feature within Amazon Bedrock that streamlines the development of generative AI applications and automates workflows involving unstructured multimodal content. This innovative capability offers a unified experience for developers of all skill levels, enabling them to effortlessly automate the extraction, transformation, and generation of insights from their data.
In this article, we'll delve into its key features and benefits, and an example on how to use Bedrock Data Automation with the AWS SDK for Python (Boto3) for intelligent document processing workflows.
Key Features of Amazon Bedrock Data Automation
🔄 Unified Experience
🛠️ Customization
🧠 Intelligent Model Orchestration and Output Validation
Intelligent Document Processing with Amazon Bedrock Data Automation
Bedrock Data Automation allows you to automate IDP workflows at scale, without needing to orchestrate complex document processing tasks such as classification, extraction, normalization, or validation.
In many cases, IDP workflows are used against a variety of different document types. For example, an insurance provider may want to automate the processing of a claims packet to streamline their pipeline and improve the accuracy of claims processing. Amazon Bedrock Data Automation simplifies the automation of complex IDP tasks such as document splitting, classification, data extraction, output format normalization and data validation.
Let's use the example of a claims packet which contains the following two documents: a prescription label, and a hospital discharge summary.
A Bedrock Data Automation workflow consists of the following steps:
1. Define blueprints to generate custom outputs
Custom outputs use blueprints that specify output requirements using natural language or a schema editor. Blueprints can be created using the console or the API, and you can either use a pre-defined catalog blueprint for common documents types (i.e invoices, payslips, W2s, etc.) or create a custom one for your use case.
To create a custom blueprint using the API, you can use the CreateBlueprint operation using the Amazon Bedrock Data Automation Client. The following example shows the hospital name, hospital contact, visit details, patient details, provider details, assessment details and discharge summary being defined as properties passed to CreateBlueprint, to be extracted from the claims packet. By defining instructions for the properties, we are also leveraging key normalization, guiding BDA to recognize and map different representations of the same fields to a standardized key.
bda_create_blueprint_response = bedrock_data_automation_client.create_blueprint(
blueprintName='hospital-discharge-report',
type='DOCUMENT',
blueprintStage='LIVE',
schema=json.dumps({
"$schema": "http://json-schema.org/draft-07/schema#",
"description": "A standard discharge summary report used by hospital containing details of the patient, medical provider and key facts on visit, medical assessment and a summary of discharge.",
"class": "Hospital Discharge Summary",
"type": "object",
"definitions": {
"VisitDetails": {
"type": "object",
"properties": {
"admitted_date": {
"type": "string",
"inferenceType": "explicit",
"instruction": "Date of admission in MM-DD-YYYY format"
},
"discharged_date": {
"type": "string",
"inferenceType": "explicit",
"instruction": "Date of discharge in MM-DD-YYYY format"
},
"discharged_to": {
"type": "string",
"inferenceType": "explicit",
"instruction": "Where the patient was discharged to"
}
}
},
"PatientDetails": {
"type": "object",
"properties": {
"name": {
"type": "string",
"inferenceType": "explicit",
"instruction": "Name of the patient"
},
"gender": {
"type": "string",
"inferenceType": "explicit",
"instruction": "Gender of the patient"
},
"patient_id": {
"type": "string",
"inferenceType": "explicit",
"instruction": "Unique id of the patient"
}
}
},
"ProviderDetails": {
"type": "object",
"properties": {
"name": {
"type": "string",
"inferenceType": "explicit",
"instruction": "Name of the provider"
},
"provider_id": {
"type": "string",
"inferenceType": "explicit",
"instruction": "Unique id of the provider"
}
}
},
"AssessmentDetails": {
"type": "object",
"properties": {
"reported_symptoms": {
"type": "string",
"inferenceType": "explicit",
"instruction": "Reported symptoms and history of present illness"
}
}
}
},
"properties": {
"hospital_name": {
"type": "string",
"inferenceType": "explicit",
"instruction": "Name of the hospital"
},
"hospital_contact": {
"type": "string",
"inferenceType": "explicit",
"instruction": "Contact details of the hospital"
},
"visit_details": {
"$ref": "#/definitions/VisitDetails"
},
"patient_details": {
"$ref": "#/definitions/PatientDetails"
},
"provider_details": {
"$ref": "#/definitions/ProviderDetails"
},
"assessment_details": {
"$ref": "#/definitions/AssessmentDetails"
},
"discharge_summary": {
"type": "string",
"inferenceType": "explicit",
"instruction": "Summary of discharge instructions"
}
}
}),
)
The CreateBlueprint response returns the blueprintArn for the discharge summary's custom blueprint:
'blueprintArn: arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:blueprint/<BLUEPRINT_ID>'
2. Create a Data Automation Project
A project is a grouping of both standard and custom output configurations. This allows you to use a single resource for multiple file types. When processing documents, you may want to use multiple blueprints for different kinds of documents that are passed to your project. If you choose to pass a file that contains multiple documents, you can enable BDA to automatically split the file into individual documents and match each one to the correct blueprint for processing. This allows you to process different types of documents within the same project, each with its own custom extraction logic.
To create a project using the API, you invoke the CreateDataAutomationProject operation. The following is an example of how you can configure custom output using the custom blueprint for the hospital discharge summary we created above and the existing sample blueprint for prescription labels.
Recommended by LinkedIn
bda_bedrock_automation_create_project_response = bedrock_data_automation_client.create_data_automation_project(
projectName='TEST_PROJECT',
projectDescription='test BDA project',
projectStage='LIVE',
standardOutputConfiguration={
'document': {
'outputFormat': {
'textFormat': {
'types': ['PLAIN_TEXT']
},
'additionalFileFormat': {
'state': 'ENABLED',
}
}
},
},
customOutputConfiguration={
'blueprints': [
{
'blueprintArn': 'arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:blueprint/<BLUEPRINT_ID>'
},
{
'blueprintArn': 'arn:aws:bedrock:us-west-2:aws:blueprint/bedrock-data-automation-public-prescription-label'
},
],
},
overrideConfiguration={
'document': {
'splitter': {
'state': 'ENABLED'
}
}
},
)
In order to process different types of documents within a single document package using multiple blueprints in a single project, the splitter configuration must be enabled through the API, which is done at the bottom of above API call.
overrideConfiguration={
'document': {
'splitter': {
'state': 'ENABLED' | 'DISABLED'
}
}
},
The API validates the input configuration, creates a new project, and returns the projectArn.
'arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:data-automation-project/<PROJECT_ID>'
3. Invoking the Analysis
Once the project is set up, you can start processing your files. The console allows you to easily test and preview the insights that can be extracted from your content, but these tests can only be performed on one document at a time. To process multiple documents, you can use the InvokeDataAutomationAsync API, which initiates the asynchronous processing of your files in a specified S3 bucket, using the configuration defined in the project by passing the project's ARN. You can specify the input configuration (in this case, the S3 bucket where the claim's packet resides), and the output configuration (where you want the results and metadata stored).
bda_invoke_data_automation_async_response = bedrock_data_automation_runtime_client.invoke_data_automation_async(
inputConfiguration={'s3Uri': '<S3_URI>'},
outputConfiguration={'s3Uri': '<S3_URI>'},
dataAutomationProfileARN=‘arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:data-automation-profile/bda.v1:0‘
dataAutomationConfiguration={
‘dataAutomationProjectArn’: ‘arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:data-automation-project/<PROJECT_ID>‘,
'stage': 'LIVE'
}
)
This API call will return the invocationARN:
'arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:data-automation-invocation/<INVOCATION_ID>'
4. Retrieving the results
Since the processing is asynchronous, you can use the GetDataAutomationStatus API to check the status of the processing job, using the InvocationArn from above.
bda_invoke_data_automation_async_response = bedrock_data_automation_runtime_client.get_data_automation_status(
invocationArn='arn:aws:bedrock:us-west-2:<AWS_ACCOUNT_ID>:data-automation-invocation/<INVOCATION_ID>'
)
Once the job is completed, the results of the file processing are stored in the S3 bucket defined in the output configuration. The output includes unique structures depending on both the file modality and the operation types specified in the call to invoke the job.
In this example, BDA associated the hospital discharge summary document (2nd page) with the custom discharge summary blueprint with a high level of confidence:
"matched_blueprint": {"arn": "<BLUEPRINT_ARN>", "name": "hospital-discharge-report", "confidence": 0.9238664}
Using the matched blueprint, BDA was able to accurately extract each field that was defined in the custom blueprint:
"inference_result": {
"provider_details": {
"name": "Mateo Jackson, Phd",
"provider_id": "00988277891"
},
"discharge_summary": "Some activity restrictions suggested, full course of antibiotics, check back with physican in case of relapse, strict diet",
"assessment_details": {
"reported_symptoms": "35 yo M c/o stomach problems since 2 montsh ago. Patient reports epigastric abdominal pain non-radiating. Pain is described as gnawing and burning, intermitent lasting 1-2 hours, and gotten progressively worse. Antacids used to alleviate pain but not anymore; nothing exhacerbates pain. Pain unrelate"
},
"visit_details": {
"discharged_date": "09-08-2020",
"admitted_date": "09-07-2020",
"discharged_to": "Home with support services"
},
"hospital_contact": "(999)-(888)-(1234)",
"hospital_name": "Not a Memorial Hospital Of Collier Reg: PN/S/11011. Non-Profit",
"patient_details": {
"gender": "Male",
"patient_id": "NARH-36640",
"name": "John Doe"
}
},
The output also includes explainability information which returns a confidence score and bounding information for each extracted field, which can be used for auditing and validation:
"explainability_info": [
{
"provider_details": {
"name": {
"success": true,
"confidence": 0.93359375,
"geometry": [
{
"boundingBox": {
"top": 0.14783521583575837,
"left": 0.12684581227421549,
"width": 0.197725693303263,
"height": 0.011893643038294155
},
"vertices": [
{
"x": 0.12684869245229458,
"y": 0.14783521583575837
},
{
"x": 0.3245715055774785,
"y": 0.14799948879388436
},
{
"x": 0.32456936528329655,
"y": 0.15972885887405253
},
{
"x": 0.12684581227421549,
"y": 0.159564673184095
}
],
"page": 8
}
],
"type": "string",
"value": "Mateo Jackson, Phd"
},
...
This example showcases the power of Amazon Bedrock Data Automation in revolutionizing Intelligent Document Processing (IDP) workflows. By leveraging this service, organizations can effortlessly automate intricate document handling tasks, including document classification, data extraction, standardization, and validation. BDA significantly reduces the complexity of operations while enhancing processing efficiency. In the context of medical claims processing, this translates to increased capacity to handle larger claim volumes, minimized error rates, and overall operational optimization.
Conclusion
Amazon Bedrock Data Automation represents a significant leap forward in AI-powered data processing, addressing key challenges in managing unstructured data at scale. Its unified approach and customization capabilities make it a versatile tool across industries.
By democratizing access to advanced AI capabilities, Amazon Bedrock Data Automation enables organizations of all sizes to leverage generative AI without extensive technical expertise. For organizations looking to enhance their data strategy, BDA offers a compelling solution worth exploring.
Very insightful! 🤓