Re: [LAU] How to handle large amount of pdf files (music sheets) for collaborative usage

3 Apr 2010

PDFsam -- PDF Split-And-Merge is a handy open-source tool. (
http://www.pdfsam.org/)
But its title is its featureset, for the most part. It allows you to reorder
PDFs, pull pages out, add pages in, rotate pages 90, 180, 270 degrees, etc.
Command-line driven but there's also a gui console.
It's got a windows installer, but should run anywhere Java is available.
Sounds like on top of the scanning and organizing solution, you need to
figure out some OCR application to extract metadata from each of the PDFs in
a large-scale way.
If you're planning to put these out for public consumption, you can use
Google to assist you in your scanning and indexing:
http://www.labnol.org/software/convert-scanned-pdf-images-to-text-with-goog…
Alternatively, the open-source OCR world is getting better fast. Check out
OCRopus (http://en.wikipedia.org/wiki/OCRopus) -- it's a linux-based
command-line OCR tool. You should be able to incorporate this into a
workflow, it'll spit out what it thinks your PDF says in htmlish (specified
here: http://docs.google.com/View?docid=dfxcv4vc_67g844kf).
I could see a workflow on your end that creates four rotations of each page
scanned, then attempts to OCR them in each degree of rotation with OCRopus,
compares the results, and persists in your datastore the one with the
highest combination of recognized characters and recognition score. I
suppose this is only really helpful if a) your PDFs often get scanned
upside-down or sideways, and b) all your PDFs have some amount of digital
typography on them.
Anyway, a couple ideas.
-----
Luke Peterson
On Sat, Apr 3, 2010 at 11:56 AM, Nils Hammerfest &lt;list(a)nilsgey.de&gt; wrote:
...
   Get a quality
enterprise content management system. There is a
 full-blown FOSS one called Alfresco that you could try out and see how
 it goes:
 http://www.alfresco.com/products/networks/compare/ 
 Thanks for your information. I will have a look at this
  > 2) Do you know any free PDF editor besides
"PDFEdit" (which seemed fine  from screenshots and descriptions, but first
tries were not successful)

 Sorry, have no clue there, the usual advice about editing PDFs is to
 edit the SOURCE document and regenerate the PDF. PDFs really aren't
 intended to be edited - text editing even in Adobe Acrobat is tedious.

 The problem here is that those PDFs are generated by scans. So in fact we
 are dealing with mostly images of notation here. The main tasks here are to
 add/delete pages, rotate single pages or "all uneven" etc. or crop borders
 for a section, selection, all etc. and in the end save them again. I think
 with some scripts it should be possible to do the decrompress/compress thing
 from pdf->images->pdf but still this leaves the point of pagewise editing. I
 don't know for sure but I think gimp or inkscape cannot do this.
 Nils
 _______________________________________________
 Linux-audio-user mailing list
 Linux-audio-user(a)lists.linuxaudio.org
 http://lists.linuxaudio.org/listinfo/linux-audio-user

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [LAU] How to handle large amount of pdf files (music sheets) for collaborative usage