Analysis and Benchmarking of OCR Accuracy for Data Extraction Models
This article on OCR accuracy discusses how the best OCR tools work, metrics to measure OCR accuracy, factors affecting OCR models, and how to overcome its limitations.
Optical Character Recognition (OCR) is the technology that converts an image of text into machine-readable text. It is used for various data extraction solutions, including Intelligent Document Processing.
However, OCR is not smart enough to figure out the context in a document—it works simply by distinguishing text pixels from the background and finding a pattern. This limitation could cause inaccuracy in captured data that could directly impact the output of your data extraction model.
What is OCR Accuracy?
OCR accuracy is typically measured as a percentage, representing the ratio of correctly recognized characters to the total number of characters in the source material. High OCR accuracy means fewer errors in the recognized text, which leads to more reliable and usable digital documents.
Reliable OCR results are important for efficient document management for several reasons:
- Data Integrity: High OCR accuracy ensures the integrity of digitized data. Accurate conversion minimizes the risk of errors that can occur during manual data entry.
- Searchability and Retrieval: Accurate OCR results enhance the searchability of digital documents. When text is correctly recognized, users can perform keyword searches, making locating specific information quickly within large volumes of documents easier.
- Automation and Workflow Efficiency: Many document management systems rely on OCR technology to automate workflows. High OCR accuracy enables automation, such as automatic document indexing and routing. The aim is to reduce manual intervention and speed up processing times.
- Compliance and Legal Requirements: Accurate OCR is important for maintaining compliance in industries with strict regulatory requirements. It ensures that digitized documents are accurate representations of their original forms. The tasks include audits of legal proceedings and record-keeping.
Why is OCR Accuracy Important?
OCR accuracy is important for any enterprise looking for the best OCR tools. Take a look at some reasons why OCR accuracy is important:
1. Data Integrity
- Accuracy and Reliability: High OCR accuracy ensures that the text extracted from documents is a true and reliable representation of the original content.
- Error Reduction: Accurate OCR reduces the occurrence of errors that can arise from manual data entry.
2. Decision-Making
- Informed Decisions: Accurate and reliable data is important to effective decision-making.
- Efficiency in Analysis: High OCR accuracy makes analyzing large volumes of data efficient.
3. Cost Savings
- Reduced Manual Intervention: Accurate OCR reduces the need for manual data correction and validation.
- Operational Efficiency: Data entry, document, and information retrieval processes become more efficient with high OCR accuracy.
- Compliance Costs: Maintaining accurate digital records is essential for regulatory compliance.
4. Time Savings
- Faster Processing: High OCR accuracy accelerates the document processing workflow.
- Improved Search and Retrieval: Accurate OCR results increase the searchability of digital documents.
- Streamlined Workflows: High OCR accuracy supports the automation of various tasks, such as indexing, classification, and data extraction.
How Does an OCR Model Work to Ensure Accuracy?
The OCR engine or OCR software or the best OCR tools works by using the following steps:
Step 1: Image acquisition
OCR uses a scanner to process the physical form of a document. Once all pages are copied, OCR generates a black-and-white (two-color/one-bit) version of the color or grayscale scanned document.
OCR is essentially a binary process: it recognizes things that are either there or not. If the original scanned image is perfect, any black it contains will be part of a character that needs to be recognized, while any white will be part of the background.
Reducing the image to black and white is the first stage in figuring out the text that needs processing.
Step 2: Preprocessing
The OCR software first cleans the image and removes errors to prepare it for reading. These are some of its cleaning techniques:
- Deskewing or tilting the scanned document slightly to fix alignment issues during the scan.
- Removing any digital image spots or smoothing the edges of text images.
- Cleaning up boxes and lines in the image.
- Script recognition for multi-language OCR technology
Step 3: Text recognition
This stage typically involves targeting one character, word, or text block at a time. Pattern recognition and feature extraction are the two main processes that the best OCR tools use to identify characters. Let’s look at these in turn:
A) Pattern recognition
Pattern recognition works by isolating a character image, called a glyph, and comparing it with a similarly stored glyph.
Pattern recognition works only if the stored glyph has a font and scale similar to the input glyph. This method works well with scanned images of documents typed in a known font.
B) Feature extraction
Feature extraction breaks down or decomposes the glyphs into features such as lines, closed loops, line direction, and line intersections. It then uses these features to find the best match or the nearest neighbor among its stored glyphs.
Most modern Omni font OCR programs (ones that recognize printed text in any font) work by feature extraction rather than pattern recognition. Most of them use Artificial Intelligence.
Step 4: Post-processing
An OCR program also analyzes the structure of a document image. It divides the page into elements such as blocks of text, tables, or images. The lines are divided into words and then into characters.
Once the characters have been singled out, the program does text recognition. After processing all likely matches, the program presents you with the recognized text. Depending on the factors we discuss later in the article, OCR output can potentially have errors.
How do we clean up this text before feeding it into the next pipeline stage?
One approach is to run the text through a spell checker that identifies misspellings and suggests alternatives.
More recent approaches use AI architectures to train word/sub-word-based language models, which are used for correcting OCR text output based on the context. This step increases the OCR accuracy.
Factors Affecting OCR Accuracy
1. Quality of original document
If the original document is:
- Wrinkled, torn, or otherwise damaged,
- Faded or otherwise aged,
- Discolored,
- Noisy,
- Smudged (or the text is otherwise obfuscated or distorted),
- Printed with low-contrast or colored ink (purple, blue, and red provide low contrast; black ink provides the highest contrast),
- Rendered with nonstandard fonts or in human handwriting, or
- Printed on specific types of paper that decrease crispness and contrast between the background and foreground in the resulting scan,
2. Quality of scanned image
Any scanned image of such a document (regardless of the quality of the scan) can lead to an extra burden to the OCR engine in recognizing text from the scan. In a good quality scanned image:
- Characters should be distinguishable from the background: Sharp character borders, High Contrast
- Characters / Words Alignment: Good alignment ensures proper character, word, and line segmentation
- Good image resolution and alignment
- There should be less noise
3. Quality of input documents
- High Resolution: Scanning documents at a higher resolution (typically 300 DPI or higher) can capture finer details of the text, leading to better OCR accuracy.
- Consistency: Maintaining a consistent resolution across all scanned documents ensures uniform quality and improves the overall performance of OCR systems.
- Optimal Contrast: High contrast between text and background enhances OCR accuracy. Documents with clear, dark text on a light background or vice versa are easier for OCR software to process.
- Avoiding Shadows and Glare: Scanning documents in well-lit environments and avoiding shadows or glare can help maintain optimal contrast and improve OCR results.
4. Font type and size
- Standard Fonts: OCR software performs better with standard, well-defined fonts such as Arial, Times New Roman, and Calibri.
- Consistency: Using consistent fonts throughout a document can improve OCR accuracy, as the software can better predict and recognize the characters.
- Readable Size: Larger font sizes are generally easier for OCR software to recognize. Very small fonts may be difficult to read, leading to errors in character recognition.
- Minimum Size: Maintaining a minimum font size of 10-12 points can help ensure better OCR accuracy.
5. Language and Character Recognition Capabilities
- Multilingual OCR: OCR software with support for multiple languages can accurately recognize text in various languages.
- Special Characters and Accents: Languages with special characters, accents, or diacritics require OCR software that can accurately recognize these elements to maintain text integrity.
- Printed vs. Handwritten Text: OCR software is generally more accurate with printed text. Handwriting recognition is more challenging and requires advanced OCR systems trained specifically for this purpose.
- Training Data: Using OCR systems that have been trained on a diverse set of handwriting samples can improve accuracy in recognizing handwritten text.
6. Technological Factors
- Advanced Algorithms: Using OCR software with advanced recognition algorithms and machine learning capabilities can significantly improve accuracy.
- Regular Updates: Keeping OCR software updated with the latest improvements and bug fixes ensures optimal performance and accuracy.
- Image Enhancement: Utilizing image enhancement tools to preprocess documents before OCR can improve text clarity and recognition accuracy.
- Segmentation: Accurate segmentation of text from images and other elements in the document helps the OCR software focus on the relevant text, improving accuracy.
By addressing these factors and implementing best practices, organizations can enhance OCR accuracy and improve the overall efficiency of their document processing workflows.
How to improve OCR Accuracy?
1. Preprocessing techniques
- Binarization: Convert grayscale images to black-and-white to increase contrast between text and background. This process helps OCR software differentiate between characters and the background.
- Deskewing: Correct any skew or tilt in scanned documents to ensure text is properly aligned. Skewed text can cause OCR software to misinterpret characters.
- Denoising: Remove random noise or artifacts from the image that can interfere with text recognition. Removal of Background Patterns: Eliminate background patterns or textures that can confuse OCR software.
2. Training OCR models with diverse data sets
- Comprehensive Training: Train OCR models on a diverse set of fonts and sizes to improve their ability to recognize different text styles. Including a wide range of fonts in training data helps the model generalize better.
- Include Handwriting Samples: Incorporate samples of handwritten text to improve the model's ability to recognize handwriting, especially if the OCR application involves processing handwritten documents.
- Multilingual Training: Use training data that includes multiple languages and characters, including those with special symbols, accents, and diacritics.
3. Utilizing advanced algorithms and Machine Learning
- Convolutional Neural Networks (CNNs): CNNs can improve the recognition of complex characters and patterns. They are particularly effective for image-based tasks and can significantly enhance OCR accuracy.
- Recurrent Neural Networks (RNNs): Employ RNNs, especially Long-Short-Term Memory (LSTM) networks, to recognize sequences of characters. RNNs are adept at handling sequential data, making them suitable for text recognition.
- Pre-trained Models: Utilize models trained on large datasets and fine-tune them on specific OCR tasks.
4. Implementing post-processing verification techniques
- Automated Correction: Use spell checkers and grammar correction tools to identify and correct errors in the recognized text.
- Contextual Analysis: Implement contextual analysis to verify the accuracy of recognized words within the context of the surrounding text.
- Manual Verification: Incorporate a human review process to verify and correct OCR results manually.
- Feedback Loops: Create feedback loops where corrections made by human reviewers are fed back into the OCR system to improve future accuracy. This iterative process helps the OCR model learn from its mistakes and improve over time.
Metrics to Measure OCR Accuracy
When it comes to OCR accuracy, some major objective metrics that are used to evaluate how reliable OCR is given below.
1. Character Error Rate (CER)
CER calculation is based on the Levenshtein distance concept, where we count the minimum number of character-level operations required to transform the ground truth text into the OCR output.
Let’s look at an example:
- Ground truth text: 619375128
- OCR output text: 61g375Z8
Transformations required to transform OCR output into the ground truth are,
- g instead of 9
- Missing 1
- Z instead of 2
Number of transforms (T) = 1+1+1 = 3
Number of correct characters (C) = 6
CER = T/(T+C) * 100%
= 3/9 *100% = 33.33%
What is a good CER value?
There is no single benchmark for defining a good CER value, as it depends on the use case. Different scenarios and complexity (e.g., printed vs. handwritten, type of content, etc.) can result in varying OCR performances.
An article published in 2009 on the review of OCR accuracy in large-scale Australian newspaper digitization programs came up with these benchmarks (for printed text):
- Good OCR accuracy: CER 1-2% (i.e. 98-99% accurate)
- Average OCR accuracy: CER 2-10%
- Poor OCR accuracy: CER > 10% (i.e. below 90% accurate)
For complex cases involving handwritten texts with highly heterogeneous and out-of-vocabulary content (e.g., application forms), a CER value as high as around 20% can be considered satisfactory.
2. Word Error Rate (WER)
WER calculation is also based on the concept of Levenshtein distance, where we count the minimum number of word-level operations required to transform the ground truth text into the OCR output.
WER is generally well-correlated with CER (provided error rates are not excessively high), although the absolute WER value is expected to be higher than the CER value.
Let’s look at an example:
- Ground truth text: Docsumo is a document AI company.
- OCR output text: Docsumo iz document AI campany.
Transformations required to transform OCR output into the ground truth are,
- is instead of iz
- Missing a
- company instead of campany
Number of transforms (T) = 1+1+1 = 3
Number of correct words (C) = 3
WER = T/(T+C) * 100%
= 3/6 *100% = 50%
3. Field-Level Accuracy
Field-level accuracy measures the correctness of each data field extracted by the OCR system. This metric is particularly important for structured and semi-structured documents where specific fields (e.g., names, dates, account numbers) must be accurately recognized.
High field-level accuracy indicates that the OCR system reliably extracts each piece of relevant information without errors.
4. Confusion Matrix
A confusion matrix is a useful tool for evaluating the performance of an OCR system. It provides a detailed breakdown of true positives, true negatives, false positives, and false negatives for each character or word recognized.
5. Recall and Precision
Recall: Recall measures the proportion of actual positive instances correctly identified by the OCR system.
Recall = True Positives/(True Positives + False Negatives)
Precision: Precision measures the proportion of correct identifications. High precision means the system's output is mostly correct.
Precision = True Positives/(True Positives + False Positives)
- Processing Time: Processing time refers to the duration it takes for the OCR system to convert a document into machine-readable text. This metric is critical for evaluating the efficiency and scalability of the OCR solution. Lower processing times indicate a more efficient OCR system, which can handle larger datasets and deliver quicker results.
- Error Rate: The error rate measures the frequency of inaccuracies in the OCR output. It can be calculated as the percentage of incorrect characters or words relative to the total number of characters or words processed. A lower error rate signifies higher OCR accuracy and reliability.
By monitoring these metrics, organizations can effectively evaluate and enhance the performance of their best OCR tools, ensuring high accuracy, efficiency, and reliability in their document processing workflows.
How IDP is more accurate than OCR
Organizations have taken three major approaches to automatically extract data from their documents: Manual processing, OCR, and Rules and template-based extraction. They either use one or a combination of all three.
It is hard to process documents using existing tools because:
- Rules and workflows for each document type often need to be hard-coded and updated with changes to the specific format or when dealing with multiple formats.
- These documents may come from third-party sources, so their format is outside our organization's control, and it can be very diverse.
None of those three systems can deal with the variety and complexity of documents coming from diverse sources, and they struggle to provide consistency in the process.
Intelligent Document Processing uses document AI models and algorithms designed to automatically classify, extract, structuralize, and analyze information from business documents, accelerating automated document processing workflows.
Previously, document AI models relied on leveraging either pre trained CV or NLP models but did not consider a joint training of textual and layout information, resulting in relatively low accuracy. Layout and style information are vital for document image understanding along with the text information.
Today, with the advancement of Artificial Intelligence, more specifically the combined research of Computer Vision(CV) and Natural Language Processing (NLP) for extraction, we get highly accurate State of the Art (SOTA) results with IDP.
IDP combines model interactions between text and layout information across scanned document images for information extraction. Because of this, we can extract information with high accuracy.
Case Study: How Docsumo Helped Arbor Automate Insurance Compliance with 99% Accuracy
Arbor Realty Trust, Inc. is a real estate investment trust and direct lender, specializing in loan origination and servicing for multifamily, seniors housing, healthcare, and other commercial real estate assets across the United States.
Before Using Docsumo
Arbor's team manually extracted data from unstructured documents like Acord forms and flood certificates, which was time-consuming and error-prone. Elevation certificates, in particular, required extensive processing. There was little to no validation of captured data, necessitating double manual entry to ensure accuracy.
After Implementing Docsumo
Docsumo’s intelligent document processing software automated data extraction, reducing manual intervention to exception handling only. Certificates are now digitized with key data extracted automatically. Docsumo's algorithms classify documents and validate data in real-time with custom rules, achieving a 95%+ straight-through processing (STP) rate and 99%+ data extraction accuracy. Processing time for unstructured data improved by 10x, with 95% touchless processing.
The Challenge
Arbor received insurance claim documents in various formats and needed to manually capture data for analysis. Extracting data from semi-structured Acord forms and elevation certificates was labor-intensive, with a 20% error rate. The varying structures of elevation certificates added complexity, and the lack of validation procedures increased the risk of errors.
The Docsumo Solution
Docsumo provided API-based integration to ingest documents, pre-process them, and queue them for data extraction. Their OCR module handled various fonts, layouts, and resolutions, accurately extracting data from tables.
Proprietary NLP-based classification categorized key-value pairs, while a rule-based validation engine ensured data accuracy. The extracted data, formatted in JSON, was easily integrated into downstream software for analysis.
“Amongst others, the biggest advantage of partnering with Docsumo is the data capture accuracy they’re able to deliver. We’re witnessing a 95%+ STP rate, that means we don’t even have to look at risk assessment documents 95 out of 100 times, and the extracted data is directly pushed into the database.”
— Howard Leiner, CTO, Arbor Realty Trust
Conclusion: How Does OCR Accuracy Improve Document Processing
OCR accuracy ensures data integrity, reduces manual intervention, and enhances decision-making processes. High OCR accuracy minimizes errors, saving costs and time and improving overall operational efficiency.
AI and machine learning advancements are paving the way for more sophisticated OCR systems. Future trends include improved handwriting recognition, multilingual support, real-time processing, and the integration of OCR with other AI technologies to enhance document processing capabilities.
Optimizing OCR accuracy requires a combination of high-quality input documents, advanced preprocessing techniques, robust OCR models, and effective post-processing verification. Use Docsumo to refine these elements continually. You can achieve superior accuracy and efficiency in your document workflows.
Schedule a demo with Docsumo right now!
Frequently Asked Questions
What types of documents can OCR technology process?
OCR can process various documents, including printed text, handwritten notes, and structured forms. It can handle various formats like PDFs, scanned images, and digital photographs.
How does OCR handle different languages and scripts?
Advanced OCR systems support multiple languages and scripts, including those with special characters and accents. Some OCR tools are specifically trained to recognize and process various languages accurately.
What are common challenges faced by OCR technology?
Poor document quality, complex layouts, and unusual fonts can affect OCR accuracy. Handwritten text and low-contrast images can also pose significant challenges for OCR systems.