Skip to main content

Tabular data from PDFs with the help of Ruby

Sometimes I need to get data out of tables buried in PDFs, so I've written a tiny adaptable Ruby hack for when copying and pasting doesn't cut it - exploiting a couple of excellent libraries. Here's an example.
Loading...

(Admittedly not the most elegant of scripts, but hey! it gets the job done.)

What I do first here is just load up the relevant libraries. (The pdf-reader I use is this one: https://github.com/yob/pdf-reader - installable with a simple command line 'sudo gem install pdf-reader' if you are using Ruby 1.9. (Drop the sudo if you're on Windows.))

Next up is to instantiate input pdf and output table. Straight forward enough.

By examining the pdf I found the data I was interested in on pages 42 through 69, so a simple call to pages give me those - pdf_reader.pages[42..69]. I'll go through the text on each of them line by line and decide if they contain what I want based on regular expressions. These are my simple rules (in this example):
  1. If a line contains only letters, ie /^[a-z|\s]*$/i  (the regular expression is non-case sensitive due to the added 'i') and whitespaces - it means, in this case, that it is a geographic location, and I update the current area variable.
  2. If not, I try to read the line as a data line. A data line in my case looks something like this:
    Country Name12.3(10.0-15.0)12.3(10.0-15.0)12.3(10.0-15.0)12.3(10.0-15.0)
    To get the country name I split the line at the first digit with a line.split(/[0-9]/) and take the first element in the result with first. I remove this element from the line (line.sub(country,'')) so that my data line now looks like this:
    12.3(10.0-15.0)12.3(10.0-15.0)12.3(10.0-15.0)12.3(10.0-15.0)
    This is easy to split into the values and the confidence intervals with a split(/[\(|\)]/) - that simply splits on either opening or closing parenthesis. Since I want to keep all the data I just prepend the area and country name  to the resulting array with an unshift(country).unshift(area) before writing the line to the output file.
Voila! - suddenly I have lots and lots of data in a format that can be manipulated and played around with. Next time I'll need data out of a PDF I'll just change the rules above in 1 and 2 depending on the formatting used.

(PS: This code will not work on Ruby 1.8 unless you install the faster-csv gem - and even then I'm not sure...)

Comments

Popular posts from this blog

Fix your rapid blinking Marantz SR-6004 using nothing but 3 fingers - and a thumb

A couple of years ago my (most of the time excellent) Maranz SR6004 acted up. It did't want to turn itself on. Properly. Just stood there and blinked rapidly. Its little red light that is. At me. The solution was so simple that I didn't bother to write it down as I was sure to remember it. Alas, no. Some weeks ago it did it again. (Can it be the heat?) Just stood there blinking rapidly at me. The manual just said - as it said last time around - that it was time to return the unit to it's maker. Or similar. Some googling led me to this page:  http://www.allquests.com/question/4056803/Marantz-XXX4-Series-Failure-Issues.html  The technical term for what I had experienced seems to be "The Pop of Death". Aïe. But!, humongous letters said: YOU CAN SOMETIMES RESET THE UNIT BY PRESSING SURR MODE, CLEAR AND EXIT SIMULTANEOUSLY And so I did. And so it was fixed. And all was well. (And now I have written it down for the next time.)

Using a Raspberry Pi as a MIDI USB/5-pin bridge

In my constant... need... to get everything music instrument related to communicate with each other, I wanted to look into ways to get some of my keyboards/synths with only MIDI over USB to talk to devices with regular good old-fashioned 5-pin MIDI ports from the eighties. Cables! First I had a quick look at off the shelf solutions. The most interesting one being the Kenton MIDI USB Host – providing MIDI host functionality for USB devices as well as regular MIDI in and out in a small box. Unfortunately it is rather expensive (~125 €) and a reliable online source warned me that it was not entirely stable in collaboration with my OP-1, so I started thinking of more... home-grown solutions. I decided to try to use my old Raspberry Pi and see if that would serve as a USB host with a borrowed MIDI USB adapter. (Thanks Simon.) A cheaper, and, as an added boon, a nerdier solution. Step 1: Get the USB MIDI device up and running This was the easy part. The device I have been lent ...

Fix upside down Skype video in Ubuntu 12.10 [UPDATED]

When launching Skype in 64-bit Ubuntu 12.10 on my Asus U35J the webcam image was all topsy-turvy. Since I don't live in Australia, or something (tsk-tsk), this was not really cutting it for me.  Some quick googling led me to this forum post:  http://forums.pcpitstop.com/index.php?/topic/198236-why-is-my-skype-video-showing-upside-down/   After making sure that the necessary packages was installed (notably  libv4l-0) I adapted the command from the forum post to: LD_PRELOAD=/usr/lib/i386-linux-gnu/libv4l/v4l1compat.so skype and voila, the image was OK. Next step is for this to be set to default, which seems to be outlined here (in steps 2 and 3):  http://pc-freak.net/blog/how-to-fix-upside-down-inverted-web-camera-laptop-asus-k51ac-issue-on-ubuntu-linux-and-debian-gnu-linux/  (Actually this post seems to cover most of what is useful from the forum post above...) UPDATE (19/04/2013): Since my laptop was working fine, I decided it was abou...