On the topic of OCR, you could also check out specialized Optical
Music Recognition systems
(
). In the
university department where I wrote my dimploma thesis gamera
(
) was used for this task.
Cheers, Daniel
2010/4/3 Luke Peterson <luke.peterson(a)gmail.com>om>:
PDFsam -- PDF Split-And-Merge is a handy open-source
tool.
(
http://www.pdfsam.org/)
But its title is its featureset, for the most part. It allows you to reorder
PDFs, pull pages out, add pages in, rotate pages 90, 180, 270 degrees, etc.
Command-line driven but there's also a gui console.
It's got a windows installer, but should run anywhere Java is available.
Sounds like on top of the scanning and organizing solution, you need to
figure out some OCR application to extract metadata from each of the PDFs in
a large-scale way.
If you're planning to put these out for public consumption, you can use
Google to assist you in your scanning and indexing:
http://www.labnol.org/software/convert-scanned-pdf-images-to-text-with-goog…
Alternatively, the open-source OCR world is getting better fast. Check out
OCRopus (
http://en.wikipedia.org/wiki/OCRopus) -- it's a linux-based
command-line OCR tool. You should be able to incorporate this into a
workflow, it'll spit out what it thinks your PDF says in htmlish (specified
here:
http://docs.google.com/View?docid=dfxcv4vc_67g844kf).
I could see a workflow on your end that creates four rotations of each page
scanned, then attempts to OCR them in each degree of rotation with OCRopus,
compares the results, and persists in your datastore the one with the
highest combination of recognized characters and recognition score. I
suppose this is only really helpful if a) your PDFs often get scanned
upside-down or sideways, and b) all your PDFs have some amount of digital
typography on them.
Anyway, a couple ideas.
-----
Luke Peterson