Analysis and Benchmarking of OCR Accuracy for Data Extraction Models
October 21, 2022
|
OCR
DATA-EXTRACTION
INTELLIGENT DOCUMENT PROCESSING
arrow

Optical Character Recognition (OCR) is the technology to convert an image of text into machine-readable text. It is the underlying technology for various data extraction solutions including Intelligent Document Processing. However, OCR is not smart enough to figure out the context in a document - it works simply by distinguishing text pixels from the background and finding a pattern. This limitation could cause inaccuracy in captured data that could directly impact the output of your data extraction model.

In this article, we discuss how OCR works, metrics to measure OCR accuracy, limitations of OCR models, and how to overcome the limitations of OCR. To start with, we discuss how OCR works. 

So, let’s jump right into it:-

How does OCR work?

The OCR engine or OCR software works by using the following steps:

Step 1 - Image acquisition

OCR uses a scanner to process the physical form of a document. Once all pages are copied, OCR generates a black-and-white (two-color/one-bit) version of the color or grayscale scanned document. OCR is essentially a binary process: it recognizes things that are either there or not. If the original scanned image is perfect, any black it contains will be part of a character that needs to be recognized while any white will be part of the background. Reducing the image to black and white is therefore the first stage in figuring out the text that needs processing.

Step 2 - Preprocessing

The OCR software first cleans the image and removes errors to prepare it for reading. These are some of its cleaning techniques:

  • Deskewing or tilting the scanned document slightly to fix alignment issues during the scan.
  • Removing any digital image spots or smoothing the edges of text images.
  • Cleaning up boxes and lines in the image.
  • Script recognition for multi-language OCR technology
Step 3 - Text recognition

This stage typically involves targeting one character, word, or block of text at a time. The two main types of processes that OCR uses for identifying characters are pattern recognition and feature extraction. Let’s look at these in turn:

A) Pattern recognition

Pattern recognition works by isolating a character image, called a glyph, and comparing it with a similarly stored glyph. Pattern recognition works only if the stored glyph has a similar font and scale to the input glyph. This method works well with scanned images of documents that have been typed in a known font.

Pattern Recognition
B) Feature extraction

Feature extraction breaks down or decomposes the glyphs into features such as lines, closed loops, line direction, and line intersections. It then uses these features to find the best match or the nearest neighbor among its various stored glyphs. Most modern Omni font OCR programs (ones that recognize printed text in any font) work by feature extraction rather than pattern recognition. Most of them use Artificial Intelligence.

Step 4 - Post-processing

An OCR program also analyzes the structure of a document image. It divides the page into elements such as blocks of text, tables, or images. The lines are divided into words and then into characters. Once the characters have been singled out, the program does text recognition. After processing all likely matches, the program presents you with the recognized text. Depending on different factors which we discuss later in the article, OCR output can potentially have errors.

How do we clean up this text before feeding it into the next stage of the pipeline?

One approach is to run the text through a spell checker which identifies misspellings and suggests some alternatives. More recent approaches use AI architectures to train word/sub-word-based language models, which are in turn used for correcting OCR text output based on the context. This step increases the OCR accuracy.

What is OCR accuracy?

As we have seen, OCR is executed in multiple steps and every one of them influences the accuracy level that is achieved at the end of the process. Since the text from OCR is fed to the next stage of any data extraction model built over it, OCR accuracy is important.

OCR Accuracy is defined as the process of comparing the output of OCR with the original version of the same (ground truth) text. Let’s say a document had 100 characters (ground truth). If the OCR output text correctly identified 99 of them, the character level OCR accuracy is 99%.

Metrics to measure OCR accuracy

When it comes to OCR accuracy, two objective metrics are used to evaluate how reliable OCR is: Character Error Rate (CER) and Word Error Rate (WER). Let’s look at these in turn.

1. Character Error Rate (CER)

CER calculation is based on the concept of Levenshtein distance, where we count the minimum number of character-level operations required to transform the ground truth text into the OCR output.

Let’s look at an example:

  • Ground truth text: 619375128
  • OCR output text: 61g375Z8

Transformations required to transform OCR output into the ground truth are,

  1. g instead of 9
  2. Missing 1
  3. Z instead of 2

Number of transforms (T) = 1+1+1 = 3

Number of correct characters (C) = 6

CER  = T/(T+C) * 100%

          = 3/9 *100% = 33.33%

What is a good CER value?

There is no single benchmark for defining a good CER value, as it highly depends on the use case. Different scenarios and complexity (e.g. printed vs. handwritten, type of content, etc.) can result in varying OCR performances.

An article published in 2009 on the review of OCR accuracy in large-scale Australian newspaper digitization programs came up with these benchmarks (for printed text):

  • Good OCR accuracy: CER 1-2% (i.e. 98-99% accurate)
  • Average OCR accuracy: CER 2-10%
  • Poor OCR accuracy: CER > 10% (i.e. below 90% accurate)

For complex cases involving handwritten texts with highly heterogeneous and out-of-vocabulary content (e.g., application forms), a CER value as high as around 20% can be considered satisfactory.

2. Word Error Rate (WER)

WER calculation is also based on the concept of Levenshtein distance, where we count the minimum number of word-level operations required to transform the ground truth text into the OCR output.

WER is generally well-correlated with CER (provided error rates are not excessively high), although the absolute WER value is expected to be higher than the CER value.

Let’s look at an example:

  • Ground truth text: Docsumo is a document AI company.
  • OCR output text: Docsumo iz document AI campany.

Transformations required to transform OCR output into the ground truth are,

  1. is instead of iz
  2. Missing a
  3. company instead of campany

Number of transforms (T) = 1+1+1 = 3

Number of correct words (C) = 3

WER  = T/(T+C) * 100%

          = 3/6 *100% = 50%

While CER and WER are handy, they are not bulletproof performance indicators of OCR models. This is because the quality and condition of the original documents (e.g., handwriting legibility, image DPI, etc.) play an equally (if not more) important role than the OCR model itself.

What affects OCR accuracy and how to improve it?

Even the best OCR engine fails to produce good results when the input image/document quality is too bad. In this section, we’ll understand the importance of source image quality and the techniques to improve it, improving OCR accuracy.

Things that affect OCR accuracy:-

1. Quality of original document

If the original document is:

  • Wrinkled, torn, or otherwise damaged,
  • Faded or otherwise aged,
  • Discolored,
  • Noisy,
  • Smudged (or the text is otherwise obfuscated or distorted),
  • Printed with low-contrast or colored ink (purple, blue, and red provide low contrast; black ink provides the highest contrast),
  • Rendered with nonstandard fonts or in human handwriting, or
  • Printed on specific types of paper that decrease crispness and contrast between the background and foreground in the resulting scan,
2. Quality of scanned image

Any scanned image of such a document (regardless of the quality of the scan) can lead to an extra burden to the OCR engine in recognizing text from the scan. In a good quality scanned image:-

  • Characters should be distinguishable from the background: Sharp character borders, High Contrast
  • Characters / Words Alignment: Good alignment ensures proper character, word, and line segmentation
  • Good image resolution and alignment
  • There should be less Noise

Above mentioned features make the document quality better from the OCR perspective. 

Let’s now deep dive into the possible issues that are related to image quality and the ways to tackle them.

How to improve OCR accuracy?

Everything is not doomed even if we don’t have high-quality documents, to begin with. Here are a few steps that can be taken to improve the accuracy of OCR data extraction:-

1. Scaling of the image

Image rescaling is important for OCR. For most OCR engines, images with 200-300 DPI (Dots Per Inch) work best. Keeping DPI lower than 200 can produce inaccurate results while keeping it above 600 unnecessarily increases the size of the output image without improving the quality of the image.

2. Increase contrast

Low contrast can result in poor OCR. We need to increase the contrast and density before carrying out the data extraction. Contrast and density are vital factors to be considered before scanning an image for OCR as they bring out more clarity in the output.

3. Binarising the image

Binarization means converting a colored image into an image that consists of only black and white pixels (Black pixel value=0 and White pixel value=255). 

There are several algorithms to convert a color image to a binarized image, ranging from simple thresholding to adaptive thresholding (different threshold values for different regions). 

This step helps the engine to understand the data well as any black pixels it contains make the part of characters that need to be recognized while any white pixel is the part of the background. Binarizing an image can also help in decreasing the size of the input.

4. Noise removal

Noise can drastically reduce the quality of information retrieved. The main objective of the Noise removal stage is to smoothen the image by removing small dots/patches which have a higher intensity than the rest of the image. This process is also called the denoising of an image.

5. Skew correction

Skewed images directly impact the line segmentation of an OCR engine, reducing its accuracy. Scanned documents often become skewed (images aligned at a certain angle with horizontal) during scanning because of human negligence or other alignment errors. 

Deskewing is a process wherein skew is removed by rotating an image by the same amount as its skew but in the opposite direction. If the image is skewed to any side, deskew it by rotating it clockwise or anti-clockwise direction.

The OCR engine gives the output not only the text but the position of the text in the document. This output is then used by IDP to extract key-value information and tables. So, the skew correction is an important step as it down the line affects the accuracy of IDP.

How IDP is more accurate than OCR

There are three major approaches organizations have taken to automatically extract data from their documents. These are Manual processing, OCR, and Rules and template-based extraction. They either use one or a combination of all three. 

It is hard to process documents using existing tools, because:-

  • Rules and workflows for each type of document often need to be hard-coded and updated with changes to the specific format or when dealing with multiple formats. 
  • These documents may come from third-party sources, so their format is out of the control to our organization and it can be very diverse. 

None of those three systems can deal with the variety and complexity of documents coming from diverse sources, and they struggle to provide consistency in the process.

Intelligent Document Processing uses document AI models and algorithms designed to automatically classify, extract, structuralize, and analyze information from business documents, accelerating automated document processing workflows. 

Previously document AI models relied on leveraging either pretrained CV models or NLP models but did not consider a joint training of textual and layout information resulting in relatively low accuracy. Along with the text information, layout and style information is vital for document image understanding. Today with the advancement of Artificial Intelligence, more specifically the combined research of Computer Vision(CV) and Natural Language Processing (NLP) for extraction, we get highly accurate State of the Art (SOTA) results with IDP. 

IDP combines model interactions between text and layout information across scanned document images for information extraction. Because of this, we can get highly accurate information extraction.

Want to see IDP in action? Try it now for free.

Written by
Amit Timalsina
Share this Blog:
  • I agree and understand that Docsumo may send me marketing communication via email. I may opt out at any time.

Analysis and Benchmarking of OCR Accuracy for Data Extraction Models
OCR
|
September 28, 2022
|
Share this article

Blog

Explore more