Extracting raw data from PDF’s

Extracting data from PDF’s is a challenge because PDF’s were designed as ‘electronic paper’. Their objective is so make the contents look the same across computers. PDF’s loose the raw content definition, it recognizes characters, shapes, precise coordinates of its contents.

Tools for extracting data from the PDF’s

1) Tabula
One interesting tool I have come across while searching to extract data out of PDF’s is Tabula. This tool is mainly meant to extract tables in a pdf document. The good thing I have found about this tool is that it is free and open source and it is quick to get started with.

To use Tabula you can download it from http://tabula.technology/. We need java runtime installed to run it. It runs on localhost and it’s gui is quite intuitive.
There are two extracting methods in this tool a) Stream and b) Lattice.
Stream looks for white spaces btw columns whereas Lattice looks for boundary lines.

The pdf I am trying to extract has lines separating columns so Lattice seems to work better than Stream.

Insights on this tool.

1) If I need to extract paragraphs of text instead of a table, I doesn’t work accurately. It treats all data as tables. But then the name of the tool is Tabula so it is meant only for tables. I have look for a different tool for this.

2) There is a auto detect table feature to extract tables, I tried it with Stream and Lattice options but it doesn’t seem to work accurately. It is usually missing out of the last column in the table. One possible reason it seems might be how it marks the table. It is fairly close to the ending border, so it might be missing on it.

I have edited the auto markings of tables to include more terminal spaces on sides and top. Lets  see if it works better this time. (waiting.. it takes time not that fast. Another thing to note about it.)

Yeah it is working better with the above editing of table selection on pdf. To use this tool I would be doing auto detection of tables then editing the selection on sides. It would be more time consuming than auto detection but less if I have to select all tables.

So this tool seems good to me for tables, I would now be searching for a tool to extract text from pdf.

 

2) GhostScript

I have stumbled on this stackoverflow link of text extraction from pdf http://stackoverflow.com/questions/3650957/how-to-extract-text-from-a-pdf. It contains many tool listings. Some look good to check further.

I tried text extraction with GhostScript with the below command.
Windows:
gswin64c -sDEVICE=txtwrite -o output.txt input.pdf
Linux:
gs -sDEVICE=txtwrite -o output.txt input.pdf

 

Sources:
1) https://www.propublica.org/nerds/item/heart-of-nerd-darkness-why-dollars-for-docs-was-so-difficult

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s