Sometimes I need to get data out of tables buried in PDFs, so I've written a tiny adaptable Ruby hack for when copying and pasting doesn't cut it - exploiting a couple of excellent libraries. Here's an example.
(Admittedly not the most elegant of scripts, but hey! it gets the job done.)
What I do first here is just load up the relevant libraries. (The pdf-reader I use is this one: https://github.com/yob/pdf-reader - installable with a simple command line 'sudo gem install pdf-reader' if you are using Ruby 1.9. (Drop the sudo if you're on Windows.))
Next up is to instantiate input pdf and output table. Straight forward enough.
By examining the pdf I found the data I was interested in on pages 42 through 69, so a simple call to pages give me those - pdf_reader.pages[42..69]. I'll go through the text on each of them line by line and decide if they contain what I want based on regular expressions. These are my simple rules (in this example):
Loading...
(Admittedly not the most elegant of scripts, but hey! it gets the job done.)
What I do first here is just load up the relevant libraries. (The pdf-reader I use is this one: https://github.com/yob/pdf-reader - installable with a simple command line 'sudo gem install pdf-reader' if you are using Ruby 1.9. (Drop the sudo if you're on Windows.))
Next up is to instantiate input pdf and output table. Straight forward enough.
By examining the pdf I found the data I was interested in on pages 42 through 69, so a simple call to pages give me those - pdf_reader.pages[42..69]. I'll go through the text on each of them line by line and decide if they contain what I want based on regular expressions. These are my simple rules (in this example):
- If a line contains only letters, ie /^[a-z|\s]*$/i (the regular expression is non-case sensitive due to the added 'i') and whitespaces - it means, in this case, that it is a geographic location, and I update the current area variable.
- If not, I try to read the line as a data line. A data line in my case looks something like this:
Country Name12.3(10.0-15.0)12.3(10.0-15.0)12.3(10.0-15.0)12.3(10.0-15.0)
To get the country name I split the line at the first digit with a line.split(/[0-9]/) and take the first element in the result with first. I remove this element from the line (line.sub(country,'')) so that my data line now looks like this:
12.3(10.0-15.0)12.3(10.0-15.0)12.3(10.0-15.0)12.3(10.0-15.0)
This is easy to split into the values and the confidence intervals with a split(/[\(|\)]/) - that simply splits on either opening or closing parenthesis. Since I want to keep all the data I just prepend the area and country name to the resulting array with an unshift(country).unshift(area) before writing the line to the output file.
Voila! - suddenly I have lots and lots of data in a format that can be manipulated and played around with. Next time I'll need data out of a PDF I'll just change the rules above in 1 and 2 depending on the formatting used.
(PS: This code will not work on Ruby 1.8 unless you install the faster-csv gem - and even then I'm not sure...)
Comments
Post a Comment