PDF cataloging

Started by Graham Rhind on 8/18/2016
Graham Rhind 8/18/2016 8:06 am
Hi all,

We have a large number of (mainly) large pdf files containing a lot of disparate information which currently nobody is reading and where nobody knows where to look to get important information.

We're looking for a system where we can easily catalogue what is in each pdf and allow searching within the index and the pdfs, and (ideally) allows people to access directly that part of the relevant pdf which contains the information they have found in the catalogue. So, a sort of library system that's easy to use. It needs to be centrally accessible for a team, based on html (locally) or cross-platform and trivially priced.

Have you good people any ideas on this one?

Thanks,

Graham
Slartibartfarst 8/18/2016 9:35 am
Well, quite some time ago, one of my clients had a similar problem and we addressed it satisfactorily with Qiqqa: http://www.outlinersoftware.com/topics/reply/6596/25920

Hard to beat.
Graham Rhind 8/18/2016 12:18 pm
That link doesn't take me to a topic thread, but I'll check qiqqa out - thanks.
MadaboutDana 8/18/2016 4:33 pm
The cheapest option would be to:
- set up a network file share
- store your PDFs in a carefully structured folder hierarchy (the "index")
- use Adobe Reader to carry out Advanced Searches either on the entire folder tree, or on individual folders (and subfolders if you wish)

The cost of that approach would be minimal, while giving you direct access to specific sentences containing the stuff you're looking for (Adobe Reader's Advanced Search is very powerful, providing filename, context, search term highlighting etc.).

If you wanted to speed it up, you could invest in a full version of Adobe Acrobat and generate proper indexes for the PDFs. Adobe Reader can (I believe; at least, it used to be able to) then use this index to run Advanced Searches.

An alternative would be to set up a Mac Mini with DEVONthink on it, and use the latter's web server to publish your PDFs to your local network. But the web server's search function is rudimentary compared to the desktop version of DEVONthink.
Graham Rhind 8/18/2016 5:01 pm
Thanks Madaboutdana.

What I don't see Reader helping with (I haven't looked at Qiqqa yet) is the metadata issue.

If I have a dozen pdfs each with 1000 pages, results of searches for almost anything are going to be full of noise. I want a solution that starts off being useful enough for people to use. So I want the pdfs themselves to be accompanied by a searchable metadata catalogue.

Let's say there's an annual sales documentation and between pages 230 and 240 there's a section related to widget sales in Antarctica in 1989. I would like that information to get stored in a catalogue which would then allow people to search, for example, for widget sales figures for Antarctica but not South Africa in 1989, and this to bring up the "index card" with a link to that section of the pdf, wherever it may be. If I were just to search all the pdfs for "Antarctica and 1989" there would probably be too many results in the collection of pdfs, each which would need checking to ascertain its relevance, and would put people off using the system.

This is basic database stuff which I can do, but only if each pdf gets read (or at least skimmed) and the relevant information entered into a catalogue.

I wondered if there was an easier way.
CRC 8/18/2016 6:21 pm
Acrobat Pro can build a fully searchable full text index for a group of PDFs. It does a good job and the searches are fast and complete.
Dr Andus 8/18/2016 9:06 pm
You could check if any of these academic software could be adopted for your needs:

"Comparison of Docear with Zotero and Mendeley"

http://www.docear.org/software/details/#Comparison_with_Zotero_and_Mendeley
Graham Rhind 8/19/2016 12:55 pm
Thanks Dr Andus.

It looks like Mendeley and Zotero just allow search across documents, which is not ideal as explained previously. Docear looks interesting - that's much more what I'm thinking of .... but it doesn't allow collaboration....


dan7000 8/20/2016 12:34 am
I think qiqqa does what you want. I used it on a project for a while. You can add an annotation to either a document or to a specific location in a document. Then you can search those annotations. You can also tag both documents and locations in documents if you want to use tags instead of free text annotations.

I used a qiqqa "local library" for security reasons - I could not put the material on the cloud. Unfortunately, the software was painfully, painfully slow for me. It might have been my machine or it might have been the local database. I would be interested to know if it performs better with a cloud-based library.

Graham Rhind wrote:
Thanks Madaboutdana.

What I don't see Reader helping with (I haven't looked at Qiqqa yet) is
the metadata issue.

If I have a dozen pdfs each with 1000 pages, results of searches for
almost anything are going to be full of noise. I want a solution that
starts off being useful enough for people to use. So I want the pdfs
themselves to be accompanied by a searchable metadata catalogue.

Let's say there's an annual sales documentation and between pages 230
and 240 there's a section related to widget sales in Antarctica in 1989.
I would like that information to get stored in a catalogue which would
then allow people to search, for example, for widget sales figures for
Antarctica but not South Africa in 1989, and this to bring up the "index
card" with a link to that section of the pdf, wherever it may be. If I
were just to search all the pdfs for "Antarctica and 1989" there would
probably be too many results in the collection of pdfs, each which would
need checking to ascertain its relevance, and would put people off using
the system.

This is basic database stuff which I can do, but only if each pdf gets
read (or at least skimmed) and the relevant information entered into a
catalogue.

I wondered if there was an easier way.