Portable Document Format, better known as PDF files have become ubiquitous since it was introduced in 1993. PDF was by designed by Adobe in the 90s with the goal to make any file look exactly the same no matter what screen you viewed it on. And this had massive advantages in the day when the main objective was to be able to send documents digitally where the receiving party would be able to see the exact same document when printed.

Why is it necessary to extract data from PDF files?

Businesses exchange a lot of information with each other via PDF files

- Invoices
- Purchase orders
- Packing lists
- Forms
- Bank statements
- Pay stubs &
- Contracts

Most of these documents are generated digitally using some software and shared via email as PDF files. The problem arises when the receiving business needs to consume these documents digitally. This has created a massive $30Bn document data capture software industry and a much larger data entry BPO industry, both of which specialize in getting data out of unstructured formats (PDF, paper or images) and to structured formats (JSON/XML/CSV/Excel).

Why is extracting data from PDF files so difficult?

The main issue is that a PDF document carries no markup or hierarchy of data. A PDF file stores characters without any information of what that data represents (eg. “Invoice No: 12345” where “Invoice No” represents the “invoice_number_key” and “12345” represents “invoice_number_value”).

The problem is even more complicated when it comes to images (PNG or JPG) or images converted to PDF files. In the case of images, the character level data is also lost and needs to be recovered using OCR which is never 100% accurate.

In both PDF and images, the information about what the data represents needs to be interpreted in order to convert it into a structured format. This has led to the rise of advanced computer vision and deep learning softwares (including our software Docsumo) that try to classify data as key-value pairs, tables and entities.

How to extract data from PDF to excel?

There are 3 main options - manually enter data, outsource to data entry BPO or use an automated data extraction software such as Docsumo.

  1. Manually enter data

If you have a few PDF files and this is a one time task, then the best option is to type it out yourself or find a virtual assistant on Upwork to do it for you. If you have text based PDF files, you should be able to copy and paste most of the text. For tables, you can use Tabula which is an open source software.

2. Outsource to a data entry BPO

If you need to extract data on a regular basis, you can look at outsourcing to data entry providers in a country like India. They hire low cost (~$4 to $6/hour as of 2019) data entry operators who would manually open each file and then type the corresponding data in excel. Outsourcing comes with its own challenges, since you would need to spending time hiring the right provider and then managing the process.

3. Automated data extraction software

Due to advancements in AI, you can now train an intelligent OCR solution such as Docsumo that can automatically capture data from PDF files. The steps to setup up a production ready system are:

a. Training from samples - Upload documents and annotate the data you want to capture. Usually for repeating formats, the software learns with just a few samples.

b. Email integration/DMS integration to send data- Setup forwarding rules on your email client to automatically send emails with PDF attachments. You also use Zapier or API integration to push data.

c. CSV download or API integration to push data - You can use API to send the extracted data to other software or database.

Training Docsumo to specify fields that need to be captured

At Docsumo, we use a combination of neural networks and reverse image search to extract data from documents. For repeating formats, reverse image search works the best as it finds repeating patterns in the document and is more robust than Zonal OCR. Zonal OCR fails when the document say has a longer table or is rotated or extra text on some lines. For varying formats such as invoices, neural networks work better since they are able to generalize different representation of key value pairs.

Morover, Docsumo comes with an amazing edit and review tool, which makes it every easy to specify the fields that you want to capture. You can see a short demo below and if you like it try it for free.