PDF to Excel
testing links:
For programmers:
Blog post intro
Please don't hit this install with automated scrapers yet. If you want to do that, put it on your own site
This is also suitable for glueing into a crowdsource thing.
Source code
json of the raw parse is dumped into all the excel comment fields where there's stuff in the text - you can just ignore the actual content
if you're careful, you can, possibly, try getting with a full size bounding box for page 0 and it might work (for some PDFs), which means you can automate this in 2 get requests. We'll work on this simple. The code you get back is the MD5 hex checksum of the pdf you ask it for.
It requires the pdftohtml to be in /usr/local/bin (the one with the -xml option - pdf2html wont do), and the following perl modules:
- use CGI qw/param/;
- use LWP::Simple;
- use File::Temp;
- use Digest::MD5 qw/md5_hex/;
- use XML::Simple;
- use JSON qw/to_json/;
- use Spreadsheet::WriteExcel;
If you like this, you may want to sign up for notifications from the LinkedGov project run by friends .
.