Many companies have large volumes of paper or electronic documents that contain untapped business intelligence. With the advancement of generative AI, various large language models can be used to accurately extract relevant data from these documents. This post demonstrates an intelligent document processing pipeline that consists of both on-demand inference and batch inference options on Amazon Bedrock to enable the flexibility on the document processing time and cost. For time-sensitive requests, one can use the on-demand inference option, while the batch inference option is most cost optimized. It also explains how to dynamically specify the large language model and prompts at the document level, enabling you to extract data from multiple types of documents using the same pipelines.

Solution overview

If you, like one of our customers, have hundreds of millions land lease documents in scanned PDF format (PDF that contains only images without editable text, e.g. in this case, scanned land lease saved as PDF) in the backlog, and new documents are still piling up every day, this is a solution you can use to effectively extract data from these documents. As shown in the following diagram, this solution builds two inference pipelines, on-demand and batch, with a mechanism to invoke them dynamically. By using effectively designed prompts managed in Amazon Bedrock Prompt Management, the data can be extracted and standardized from scan PDFs, which often have varying formats and conventions, or from text files.