Musings

Day 1:

In Windows environment variables there is no space after semicolon in the variable values. Strange I haven’t noticed it before.

I came across the error “Unbound Classpath Container in Eclipse” while trying to import an old project. I haven’t seen this error before also. It was due to the jre version the project was referring to. Updated the jre version and this error is resolved.

Now it seems things are up and running lets see what all new things I come across in this project.

I have run into this library XLLoop in my project. It is for programming with excel. I am yet not sure what all it does. Trying to find a good documentation for it.  (http://xlloop.sourceforge.net/,  http://xlloop.sourceforge.net/javadoc/index.html).

 

 

PostScript & GhostScript

 

PostScript is a page description language (PDL) developed by Adobe Systems. It is primarily a language for printing documents on laser printers, but it can be adapted to produce images on other types of devices. PostScript is the standard for desktop publishing.

All major printer manufacturers make printers that contain or can be loaded with Postscript software, which also runs on all major operating system platforms. A Postscript file can be identified by its “.ps” suffix.

Users can convert Postscript files to the Adobe Portable Document Format ( PDF )

PostScript is an object-oriented language, meaning that it treats images, including fonts, as collections of geometrical objects rather than as bit maps.

The principal advantage of object-oriented (vector) graphics over bit-mapped graphics is that object-oriented images take advantage of high-resolution output devices whereas bit-mapped images do not. A PostScript drawing looks much better when printed on a 600-dpiprinter than on a 300-dpi printer. A bit-mapped image looks the same on both printers.

 

Ghostscript is an interpreter for PostScript and Portable Document Format (PDF) files. Ghostscript can read a PostScript or PDF file and display the results on the screen or convert them into a form you can print on a non-PostScript printer.

Text extraction from pdf file with GhostScript

Windows:
gswin64c -sDEVICE=txtwrite -o output.txt input.pdf
Linux:
gs -sDEVICE=txtwrite -o output.txt input.pdf

Extracting raw data from PDF’s

Extracting data from PDF’s is a challenge because PDF’s were designed as ‘electronic paper’. Their objective is so make the contents look the same across computers. PDF’s loose the raw content definition, it recognizes characters, shapes, precise coordinates of its contents.

Tools for extracting data from the PDF’s

1) Tabula
One interesting tool I have come across while searching to extract data out of PDF’s is Tabula. This tool is mainly meant to extract tables in a pdf document. The good thing I have found about this tool is that it is free and open source and it is quick to get started with.

To use Tabula you can download it from http://tabula.technology/. We need java runtime installed to run it. It runs on localhost and it’s gui is quite intuitive.
There are two extracting methods in this tool a) Stream and b) Lattice.
Stream looks for white spaces btw columns whereas Lattice looks for boundary lines.

The pdf I am trying to extract has lines separating columns so Lattice seems to work better than Stream.

Insights on this tool.

1) If I need to extract paragraphs of text instead of a table, I doesn’t work accurately. It treats all data as tables. But then the name of the tool is Tabula so it is meant only for tables. I have look for a different tool for this.

2) There is a auto detect table feature to extract tables, I tried it with Stream and Lattice options but it doesn’t seem to work accurately. It is usually missing out of the last column in the table. One possible reason it seems might be how it marks the table. It is fairly close to the ending border, so it might be missing on it.

I have edited the auto markings of tables to include more terminal spaces on sides and top. Lets  see if it works better this time. (waiting.. it takes time not that fast. Another thing to note about it.)

Yeah it is working better with the above editing of table selection on pdf. To use this tool I would be doing auto detection of tables then editing the selection on sides. It would be more time consuming than auto detection but less if I have to select all tables.

So this tool seems good to me for tables, I would now be searching for a tool to extract text from pdf.

 

2) GhostScript

I have stumbled on this stackoverflow link of text extraction from pdf http://stackoverflow.com/questions/3650957/how-to-extract-text-from-a-pdf. It contains many tool listings. Some look good to check further.

I tried text extraction with GhostScript with the below command.
Windows:
gswin64c -sDEVICE=txtwrite -o output.txt input.pdf
Linux:
gs -sDEVICE=txtwrite -o output.txt input.pdf

 

Sources:
1) https://www.propublica.org/nerds/item/heart-of-nerd-darkness-why-dollars-for-docs-was-so-difficult