According to a recent press release, “Today, Amazon Web Services, Inc. (AWS), an Amazon.com company, announced the general availability of Amazon Textract, a fully managed service that uses machine learning to automatically extract text and data, including from tables and forms, in virtually any document without the need for manual review, custom code, or machine learning experience. Amazon Textract goes beyond simple optical character recognition (OCR) to identify the contents of fields in forms, information stored in tables, and the context in which the information is presented, such as a name or social security number from a tax form or the product SKU or quantity in a warehouse from an inventory report. The extracted text and data can be easily used to build smart searches on large archives of documents, or can be loaded into a database for use by applications, such as accounting, auditing, and compliance software.”
The release goes on, “Amazon Textract’s API supports multiple image formats like scans, PDFs, and photos, and customers can use it with database and analytics services like Amazon Elasticsearch Service, Amazon DynamoDB, and Amazon Athena and other machine learning services like Amazon Comprehend, Amazon Comprehend Medical, Amazon Translate, and Amazon SageMaker to derive deeper meaning from the extracted text and data. To get started with Amazon Textract, visit https://aws.amazon.com/textract.”
It continues, “Many companies extract text and data from files such as contracts, expense reports, mortgage guarantees, fund prospectuses, tax documents, hospital claims, and patient forms through manual data entry or simple OCR software. This is a time-consuming and often inaccurate process that produces an output requiring extensive post-processing before it can be put in a format that is usable by other applications. That’s because existing OCR technologies are unable to recognize common layouts like forms and tables, and only generate a lengthy and often inaccurate text dump. What organizations want instead is the ability to accurately identify and extract text and data from forms and tables in documents of any format and from a variety of file types and templates.”
Read more at Business Wire.
Image used under license from Shutterstock.com