Portable Document Format, commonly known as PDF files have become ubiquitous since it was introduced in 1993. PDF was designed by Adobe in the 90s with the goal to make any file look exactly the same no matter what screen you see it on. And this had massive advantages in the day when the main objective was to be able to send documents digitally where the receiving party would be able to see the exact same document when printed.
Why is it necessary to extract data from PDF files?
Businesses exchange a lot of information with each other via PDF files
- Purchase orders
- Packing lists
- Bank statements
- Pay stubs &
Most of these documents are generated digitally using some software and shared via email as PDF files. The problem arises when the receiving business needs to consume these documents digitally. The only option people and enterprises are left with is manually copying text from pdf files and paste it to MS Word or Excel and take it from there. This process is not completely foolproof and is prone to all kinds of errors. That's why, enterprises, often have to outsource document processing or install document data capture software within their premise.
This has created a massive $30Bn document data capture software industry and a much larger data entry BPO industry, both of which specialize in getting data out of unstructured formats (PDF, paper or images) and to structured formats (JSON/XML/CSV/Excel).
Why is extracting data from PDF files so difficult?
The main issue is that a PDF document carries no markup or hierarchy of data.
A PDF file stores characters without any information of what that data represents (eg. “Invoice No: 12345” where “Invoice No” represents the “invoice_number_key” and “12345” represents “invoice_number_value”).
The problem is even more complicated when it comes to images (PNG or JPG) or images converted to PDF files. In the case of images, the character level data is also lost and needs to be recovered using OCR which is never 100% accurate.
In both PDF and images, the information about what the data represents needs to be interpreted in order to convert it into a structured format. This has led to the rise of advanced computer vision and deep learning softwares (including our software Docsumo) that try to classify data as key-value pairs, tables and entities.
How to extract data from PDF to excel?
There are 3 main options - manually enter data, outsource to data entry BPO or use an automated data extraction software such as Docsumo.
1. Manually enter data
If you have only a few PDF files and this is a one time task, the best option is to type it out yourself or find a virtual assistant on Upwork to do it for you.
If you have text based PDF files, you should be able to copy and paste most of the text. For tables, you can use Tabula which is an open source software.
2. Outsource to a data entry BPO
If you need to extract data on a regular basis, you can look at outsourcing to data entry providers in a country like India. They hire low cost (~$4 to $6/hour as of 2019) data entry operators who would manually open each file and then type the corresponding data in excel. Outsourcing comes with its own challenges, since you would need to spending time hiring the right provider and then managing the process.
3. Automated data extraction software (Automated PDF Scraper)
Due to advancements in AI, you can now train an intelligent OCR solution such as Docsumo that can automatically capture data from PDF files. The steps to setup up a production ready system are:-
a. Training from samples - Upload documents and annotate the data you want to capture. Usually for repeating formats, the software learns with just a few samples.
b. Email integration/DMS integration to send data- Setup forwarding rules on your email client to automatically send emails with PDF attachments. You also use Zapier or API integration to push data.
c. CSV download or API integration to push data - You can use API to send the extracted data to other software or database.
At Docsumo, we use a combination of neural networks and reverse image search to extract data from documents. For repeating formats, reverse image search works the best as it finds repeating patterns in the document and is more robust than Zonal OCR. Zonal OCR fails when the document say has a longer table or is rotated or extra text on some lines. For varying formats such as invoices, neural networks work better since they are able to generalize different representation of key value pairs.
Morover, Docsumo comes with an amazing edit and review tool, which makes it every easy to specify the fields that you want to capture. You can see a short demo below:-
No signup or credit card required to use these tools. So, go ahead, use them, and let us know how you liked them.
Hi, I’m Rushabh.
Everyday I speak to people who use our product to automate their workflow. Contact us and we will be happy to see how we can improve your processes.
Download PDF File
We’d love to show you how you can increase your productivity, process your documents faster and save operations cost!
A guide to automating data capture from reports, payroll or any other HR-related document into actionable format Accuracy?
In today’s dynamic business world, filing and archiving official documents in the digital form makes it handy, and works wonders in the future or in unforeseen circumstances.