A step-by-step guide to OCR form processing
December 15, 2021
|
6 min
OCR
DATA-EXTRACTION
MACHINE-LEARNING
IDP

Optical Character Recognition(OCR) is a technology widely used to convert handwritten, typed, scanned text, or text inside images to machine-relatable text. Because of its ability, the technology is used to process various forms amongst other document types. Based on the form use-case, different OCR solutions are used - for structured forms, template-based OCR is the answer, whereas for semi-structured and unstructured forms, a more sophisticated data extraction solution is required. 

What is OCR form processing? How does it work? Let’s find out in this blog. 

Let’s jump right into it:-

Table of Contents:-

Lenders, insurers, and other industries need to process numerous forms in their day-to-day operations. These forms can be divided into two categories:- 

i) Structured Forms

ii) Semi-structured Forms

The division is made on the basis of structure, template, and layout of different forms. This classification is important as it affects how these forms are processed.

What’s the difference between structured and semi-structured forms?

Let’s have a look at both types of forms one by one:-

Structured forms

Structured forms are made up of clearly defined text-blocks with fields that are always in the same place. They only change in terms of the information populated in each field. OCR works well with structured forms because the data remains at the same place on each page.

This fixed structure of forms allows for higher data extraction accuracy. However, there may be other factors that can affect the OCR accuracy negatively when information is typed over the lines of the documents. For example, if “1” is typed over a field and the lines get too close, the OCR engine may not capture the number “1” at all.

Semi-structured forms

For semi-structured forms, the location of key identifiers and checkboxes vary along with the data fields. This poses a problem for template-based OCR software as it may capture incorrect data which might be located somewhere else on the page.

Data extraction from semi-structured forms relies upon the use of business rules to locate the 'position information' for a data point. These rules rely upon the fact that the data to be extracted is always in the same relative position to a defining characteristic.

Form processing : Use-cases

Let’s take a look at some of the most common use-cases of form processing for lenders and insurers:-

IRS Forms
Sample IRS Froms

IRS forms are used by individuals and businesses to report their financial activities to the federal government to calculate their tax liability. Some of the most common IRS tax forms are:-

  • Form 1040 - US Individual Income Tax Return
  • Form W-4 - Employee's Withholding Certificate
  • Form W-9 - Request for Taxpayer Identification Number (TIN) and Certification
  • Form 941 - Employer's Quarterly Federal Tax Return.
  • Form W-2 - Wage and Tax Statement
ACORD forms
Sample Acord Forms

ACORD is the acronym for Association for Operations Research and Development. They help create universal language and documentation that all insurance agencies utilize throughout the USA. 

ACORD forms are available in different formats, including eForms, PDF files, and electronic fillables. Here are some of them:-

Insurance/Loan Application Forms

These forms are used to collect applicant’s personal information for underwriting and claims purposes.

How does OCR form processing work?

OCR can only process digitized forms, that's why to extract data from paper forms, they must first be scanned and converted into images. Even the pdf forms are first converted to images for the OCR data capture solution to process. 

Let’s take a dive into steps involved in OCR form processing:-

Format detection

As the first step of OCR form processing, the format of the file is identified. It is done to change other formats into images which is essential to perform OCR.

Image pre-processing

In this step, the quality of the scanned image is improved with noise reduction. Noise is a random variation of brightness or color in an image that makes it difficult to identify the text from the background.  Blurring or Smoothing of the image is also performed at this step that removes “outlier” pixels that may be noise in the image.

Data Extraction

Structured or semi-structured tables, both include key-value pairs and tables in some form. In this section, we discuss how OCR is used to extract line-item data and key-value pairs:-

Tables

OCR form processing software detects the lines and other visual features in order to perform a proper table extraction. A simple character recognition is not enough for table extraction, and that’s why it’s one of the biggest challenges in document capture. To provide context to extracted data, computer vision and machine learning algorithms are used.

Key-Value Pair Mapping

Key-Value Pairs are essentially two data items -a key and a value linked together as one. Template-based OCR is able to extract key-value pairs efficiently from structured forms as key and values have defined position references in these documents. 

To extract key-value pairs from semi-structured forms, the solution needs to find ways beyond zonal OCR. OCR is coupled with business and document based rules to define the ‘position information’ for values to be extracted for required keys. 

Limitations of OCR form processing

OCR is the fundamental data extraction technology but nowhere close to being perfect. Let’s have a look at some of its limitations when it comes to form processing:-

  • Font size - OCR may find it difficult to convert characters with very large or very small font sizes. 
  • Uni-Dimensional - OCR identifies and extracts characters horizontally, that’s why a character is before or after a character, not under or above it
  • Case sensitive for editing - The use of spell checking to correct OCR text will typically not permit the case of the letters to be considered, e.g., ‘abc’ and ‘ABC’ will be treated alike. 

Intelligent Document Processing : Alternative to OCR form processing

Intelligent Document Processing (IDP) is a better alternative to OCR as it helps overcome the limitations of OCR. Benefits of Machine Learning and Artificial Intelligence-based form processing include:-

1. Scalability - As a business, you can process more forms as compared to manual form processing. IDP solutions can adapt to any layout/template changes so you don’t need to retrain the solution for the most recent form version.

2. Growth - Extract data from forms automatically and help people concentrate on more important tasks. Grow your team as you don’t need to hire people for manual data entry. 

3. Accuracy - 99%+ field level accuracy for form processing which is not possible manually. Docsumo’s document AI solution offers over 95% Straight Through Processing that means you don’t even have to look at 95% of the total forms you process, and they get processed automatically. 

4. Analytics - With Docsumo's automated form processing APIs, you get better data quality using document level data validation.  Data validation against your database adds to this accuracy.

If you’re looking to automate form processing and digitize business workflows to offer better services to your customers, schedule a free demo with Docsumo, now.

Written by
Pankaj Tripathi
Share this Blog:
  • I agree and understand that Docsumo may send me marketing communication via email. I may opt out at any time.

A step-by-step guide to OCR form processing
OCR
|
December 7, 2021
|
6 min
Share this article

Blog

Explore more