[LAU] How to handle large amount of pdf files (music sheets) for collaborative usage

Daniel Appelt daniel.appelt at gmail.com
Sat Apr 3 18:37:39 EDT 2010


On the topic of OCR, you could also check out specialized Optical
Music Recognition systems
(http://en.wikipedia.org/wiki/Optical_music_recognition). In the
university department where I wrote my dimploma thesis gamera
(http://gamera.informatik.hsnr.de/) was used for this task.

Cheers, Daniel

2010/4/3 Luke Peterson <luke.peterson at gmail.com>:
> PDFsam -- PDF Split-And-Merge is a handy open-source tool.
> (http://www.pdfsam.org/)
>
> But its title is its featureset, for the most part. It allows you to reorder
> PDFs, pull pages out, add pages in, rotate pages 90, 180, 270 degrees, etc.
> Command-line driven but there's also a gui console.
>
> It's got a windows installer, but should run anywhere Java is available.
>
> Sounds like on top of the scanning and organizing solution, you need to
> figure out some OCR application to extract metadata from each of the PDFs in
> a large-scale way.
>
> If you're planning to put these out for public consumption, you can use
> Google to assist you in your scanning and indexing:
>
> http://www.labnol.org/software/convert-scanned-pdf-images-to-text-with-google-ocr/5158/
>
> Alternatively, the open-source OCR world is getting better fast. Check out
> OCRopus (http://en.wikipedia.org/wiki/OCRopus) -- it's a linux-based
> command-line OCR tool. You should be able to incorporate this into a
> workflow, it'll spit out what it thinks your PDF says in htmlish (specified
> here: http://docs.google.com/View?docid=dfxcv4vc_67g844kf).
>
> I could see a workflow on your end that creates four rotations of each page
> scanned, then attempts to OCR them in each degree of rotation with OCRopus,
> compares the results, and persists in your datastore the one with the
> highest combination of recognized characters and recognition score. I
> suppose this is only really helpful if a) your PDFs often get scanned
> upside-down or sideways, and b) all your PDFs have some amount of digital
> typography on them.
>
> Anyway, a couple ideas.
>
> -----
> Luke Peterson
>
>


More information about the Linux-audio-user mailing list