CRIMP Alert: A Compiled List of PDF Managing and Search Tools

Started by Dominik Holenstein on 1/23/2008
Dominik Holenstein 1/23/2008 7:53 am
I have done a short web reasearch for PDF managing and search tools.

You can find the compiled list here. I am sure this list is not complete:

filehand
http://www.filehand.com/
It's free now.

docuquest
http://www.docuquest.com/Index.htm
Pricing reduced from US$ 200 to US$ 30

archivarius
http://www.likasoft.com

Sleuthhound PDF Search
http://www.isleuthhound.com/sleuthhound/download_pdf_search.php

PDF Search Assistant
http://www.search-pdf.com/solutions.htm

Advanced PDF Manager
http://www.manage-pdf.com/
US$ 95

Search Inform Desktop
http://www.searchinform.com/search-site/en/main/full-text-search-products-searchinform-desktop.html

dtSearch
http://www.dtsearch.com/
Expensive: US$ 199

ISYS Desktop
http://www.isys-search.com/products/desktop/index.html

PDF Explorer
http://homepage.oniduo.pt/pdfe/pdfe.html
€ 60


Happy testing!

Dominik


Alexander Deliyannis 2/6/2008 4:27 pm
I can't say that I tried all of the tools listed, but after trialing Archivarius I went ahead and registered it. It's very fast and works very well with Greek (as long as a multitude of other languages).

Archivarius will take quite a bit of your drive because it stores text versions of the files it indexes. The benefit is that you can view its findings from within the program without opening the actual file (or even having access to it, as it may be on a disconected network drive). Once you know what you want, double clicking will open the original file.

Brilliant!
alx

Derek Cornish 2/7/2008 4:20 am
Dominik,

Thanks very much for these; I've been meaning to say that for some time.

It is very difficult to decide how best to handle pdf files - whether to leave them in their Windows folders and index and search them, or to keep them within specialized container programs (many of which don't do a very good job of searching them, but at least keep them tidily in one location).

Derek
Derek Cornish 2/7/2008 5:59 am


Alexander Deliyannis wrote:

Archivarius will take quite a bit of your drive
because it stores text versions of the files it indexes. The benefit is that you can
view its findings from within the program without opening the actual file (or even
having access to it, as it may be on a disconected network drive). Once you know what you
want, double clicking will open the original file.

Brilliant!
alx


This is how Redree's recently retired "Wilbur" freeware indexed search software [http://wilbur.redtree.com/ ] handles pdf files, too.The text versions of the pdf files are kept in a folder under "My Documents". I think it is probably how most programs using pdftotext.exe work, although I can't speak for Archivarius.

Redtree is now developing a new freeware program, named Wilma, which is still in beta [http://wilma.redtree.com/en/help/index.html] To quote the developer: "The current version makes use of the multiplatform capabilities of RealBasic to provide native desktop interfaces for Linux, Mac and Windows machines, while C++ is used for the core functionality."

Wilma does not use the Windows system registry. It doesn't work on WIN98 or ME (but Wilbur still works on WIN9x and XP). Wilma may work on Vista - I don't know. Since the focus of its developer in on Linux, information about how to use pdftotext as one of its external analyzers on Windows tends to be rather sparce. Consequently I am still using Wilbur, which was designed for Windows.

Derek
Alexander Deliyannis 2/7/2008 8:33 am
Derek Cornish wrote:
It is very difficult to decide how best to handle pdf files - whether to leave them in
their Windows folders and index and search them, or to keep them within specialized
container programs (many of which don't do a very good job of searching them, but at
least keep them tidily in one location).


I have little difficulty in deciding myself. I have always opted for leaving them in (a few) permanent windows folders and indexing/linking to them from database programs such as UltraRecall or whatever. The sheer size of the files is such that it would make no sense to include them _within_ a file.

For me PDFs fall into the greater category of media files, which includes image and audio files, as well as video which I barely use. I have a folder called Library (with various subfolders) with all my electronic versions of books and related references. My PDFs alone are several hundreds of megabytes. My audiobooks are several gigabytes.

Apart for the size issue, I believe that files as such will be accesible for quite some time, whereas database programs come and go. Think of the time involved in importing such files to a database and then exporting them to one's next information manager.

That said, information is not knowledge. A library of references makes little sense unless one invests in slowly building their comprehension of the ideas within that material, i.e. their knowledge, whether visually (with mind maps etc) or textually (with a classic outliner). For this I find many of the tools we discuss in this forum absolutely invaluable.

An indexing program complements the building of such an 'idea structure' by helping reference and support themes, once one knows what they are after. Personally, I was attracted to Archivarius by its support for an amazing multitude of file formats, as well as for my own working language which is often unsupported by anglosaxon made/oriented software.

Cheers
alx


Dominik Holenstein 2/7/2008 9:46 am
Derek,

Thank you for your kind words!

I will go for the 'combined approach':
Storing pdfs and other files in the appropriate place in UltraRecall where necessary. Thanks to logical linking I can 'store' oder better, 'link' the files to different places. This allows me to store the original pdf files in one folder on my USB drive.

After evaluating different tools for searching text in files I decided to buy Archivarius 3000. I think it has the best ratio in regard of price and functionality.

Dominik


Derek Cornish 2/7/2008 6:13 pm
Dominik,

I have tended to use a combined approach, too, although I can't claim that the resulting way of working is as efficient as it probably could be. It looks more like a series of workarounds developed over time in order to cope with the limitations of the software I have used.

Yes, Archivarius 3000 looks like a good buy. If I had alx's language requirements, and didn't already use dtSearch (and Wilbur for some tasks), I'd probably choose it. As it is, I am tempted by its claims to index Zoot's zot files - but not quite enough - yet - to buy it.

Derek

Derek Cornish 2/8/2008 1:46 am
Alexander Deliyannis wrote:
I have little
difficulty in deciding myself. I have always opted for leaving them in (a few)
permanent windows folders and indexing/linking to them from database programs such
as UltraRecall or whatever. The sheer size of the files is such that it would make no
sense to include them _within_ a file.

The difficulty in deciding doesn't arise in connection with the notes/ideas manager I use. For example, I use Zoot as my ideas database and as a capture tool for text snippets extracted from the web and from files. Like you, I have tended to store pdf, doc, htm files in organized Windows folders and link them to Zoot items. (This is in any case not a matter of choice as, unlike UltraRecall, Zoot cannot store these types of files as it currently only accepts plain text. OTOH, using Zoot avoids the temptation of loading one's notes/idea manager with wodges of irrelevant information.)

Storing the 'real' files in in the Windows folder system also has the advantage of keeping them available for indexing and searching using any competent desktop search engine. Zoot databases can themselves be indexed and searched by the same software, although they first have to be converted into (large) htm files - unless one is using Archivarius, apparently. (Does Archivarius index and search UR files yet?)

For me, the difficulty in deciding arises at the point of downloading the pdf, htm, etc. files from the web, and this is where the question of whether to store these types of files in the Windows folder system or in dedicated web capture database software comes in. For a long time I used Net Snippets, which uses the Windows folder system to store its files and so allows desktop search engines to index and search them. But if you want to organize your downloaded files in more complex ways - for example, by using keywords or multiple categories - then you may have to look to dedicated web capturing tools like Surfulater or Web Research - even though their content may not be easily accessed by desktop search engines, nor easily linked to one's chosen information manager.

Web Research (WR) is currently my main web capture tool for certain purposes - e.g., for pdf, htm, doc, and image files connected with particular projects, and for files I am keeping for semi-permanent reference purposes (e.g., software specifications, manuals, and so on). I can hyperlink from WR to Zoot and vice versa, so from that point of view it works well. The downside is that my search engine of choice, dtSearch, can only index and search the htm files stored in WR. Maybe I could persuade the Archivarius developers to take a look at WR's database file format...

Given their potential drawbacks, why would one ever want to use dedicated web capture tools in preference to simply downloading files into windows folders and linking to Zoot? I think there are a number of reasons: (1) Quicker real-time saving and categorizing - or dumping first and classifying later; (2) Easy re-organization of imported files via categories/keywords when necessary; (3) Highlighting, metadata; (4) An intermediate store for files en route to the Windows folder system; (5) A useful place in which to browse through files.

Apart for the size issue, I
believe that files as such will be accesible for quite some time, whereas database
programs come and go. Think of the time involved in importing such files to a database
and then exporting them to one's next information manager.

I think this is a good argument for not downloading files into one's notes/ideas manager - where, in any case, they may just clog things up - but less valid in the case of web capture tools as these have some value as both temporary and permanent storage sites for particular projects or purposes - and usually offer quite good bulk exporting features these days.

That said, information
is not knowledge. A library of references makes little sense unless one invests in
slowly building their comprehension of the ideas within that material, i.e. their
knowledge, whether visually (with mind maps etc) or textually (with a classic
outliner). For this I find many of the tools we discuss in this forum absolutely
invaluable.

Absolutely agree on this.


An indexing program complements the building of such an 'idea
structure' by helping reference and support themes, once one knows what they are
after. Personally, I was attracted to Archivarius by its support for an amazing
multitude of file formats, as well as for my own working language which is often
unsupported by anglosaxon made/oriented software.

Can't argue with that :-).

Derek



Ike Washington 2/8/2008 1:42 pm
Derek

I index my zoot databases using dtSearch directly, without converting them first into html files. Works okay - some garbage indexed too. Having searched within dtSearch, I launch the file containing my search term from there; the correct Zoot database opens; I search within it to locate exactly whatever I'm looking for.

Where to store data? I switched from Net Snippets to Scrapbook/Firefox. A real delight to use - html, pdf, txt, doc, jpg, gif. One of the main reasons why I use Firefox. And dtSearch indexes scrapbook files perfectly, Zoot links, perfectly.

Ike

Derek Cornish wrote:
.... Zoot databases can themselves be indexed and searched by the
same software, although they first have to be converted into (large) htm files -
unless one is using Archivarius, apparently.

.... For a long time I used Net Snippets, which uses the Windows folder
system to store its files and so allows desktop search engines to index and search
them. But if you want to organize your downloaded files in more complex ways - for
example, by using keywords or multiple categories - then you may have to look to
dedicated web capturing tools like Surfulater or Web Research - even though their
content may not be easily accessed by desktop search engines, nor easily linked to
one's chosen information manager.

Web Research (WR) is currently my main web
capture tool for certain purposes - e.g., for pdf, htm, doc, and image files connected
with particular projects, and for files I am keeping for semi-permanent reference
purposes (e.g., software specifications, manuals, and so on). I can hyperlink from
WR to Zoot and vice versa, so from that point of view it works well. The downside is that
my search engine of choice, dtSearch, can only index and search the htm files stored in
WR. Maybe I could persuade the Archivarius developers to take a look at WR's database
file format...
Stephen Zeoli 2/8/2008 7:33 pm
I use OneNote to archive my PDFs. This is doable for me for two reasons: I don't have hundreds of PDF files to worry about, and most of the ones I do want to store are single pages (price quotes from printers, for instance). So OneNote is a great solution because I can drop a particular PDF into the appropriate notebook. ON stores the original and optionally includes a searchable printout of the PDF in the notebook. Very handy.

Steve z.
Derek Cornish 2/8/2008 9:01 pm


Ike Washington wrote:

I index my zoot databases using dtSearch directly, without converting them
first into html files. Works okay - some garbage indexed too. Having searched within
dtSearch, I launch the file containing my search term from there; the correct Zoot
database opens; I search within it to locate exactly whatever I'm looking
for.

Yes, I should have emphasized that, since Zoot uses plain-text, its *.zot databases can be directly indexed, and the display is pretty good. In fact, if I'm just after a quotation, I often simply cut-and-paste what I want directly from the dtSearch display, without launching the file in Zoot. I do prefer the html display, however, although it is a bit of a pain to have to keep updating the exported file (and searching is slower, too, because the file is so large).


Where to store data? I switched from Net Snippets to Scrapbook/Firefox. A real
delight to use - html, pdf, txt, doc, jpg, gif. One of the main reasons why I use Firefox.
And dtSearch indexes scrapbook files perfectly, Zoot links,
perfectly.

Agreed, it is a great add-on and has some advantages over Net Snippets - especially the fact that it is still being developed. From my POV the two disadvantages are (i) no means yet - I think - of categorizing or keywording the files, although I'm sure that will change; (ii) although the imported files are stored in the windows filing system the names under which they were originally saved are replaced by numbers - unlike the case for Net Snippets. This means that they tend to be harder to find, consult and browse through outside Scrapbook. (It also raises problems if one wants to sync the Scrapbook folders with Zoot - something one can do with Net Snippets ones.)

I wouldn't want exaggerate the importance of these issues, especially as using Web Research entails a whole lot of other compromises. But it means that for me there are projects where Web Research is a better fit for the things I want to do. I tend to use it as a combination of reference library, and project organizer for downloaded files. OTOH, I tend to use Scrapbook as a temporary holder of miscellaneous files.

Derek

Ike Washington 2/10/2008 4:12 pm
Derek Cornish wrote:

From my POV the two
disadvantages are (i) no means yet - I think - of categorizing or keywording the files,
although I'm sure that will change; (ii) although the imported files are stored in the
windows filing system the names under which they were originally saved are replaced
by numbers - unlike the case for Net Snippets....

I don't worry too much about keywording scraps since I tend to use Scrapbook to store the full text of the pdf or html file. Every word stored there get indexed by dtSearch. I sometimes add an annotation, say a unique project title - this gets indexed too.

Yes, I'm not too keen either on Scrapbook imposing time stamps as Windows folder titles.

Still, quite amazing for a free application.

Anyone interested in finding out more about Scrapbook should read its manual: http://amb.vis.ne.jp/mozilla/scrapbook/ - scroll down the page to the pdf tutorial link. "Tutorial" sells it short. Thinking I didn't need a scrapbook tutorial, I didn't bother with it until recently. But it's full of best practice gems. In particular, check out the section towards the end - "Using ScrapBook in web-based research".

I wouldn't want exaggerate the importance of these issues, especially as
using Web Research entails a whole lot of other compromises. But it means that for me
there are projects where Web Research is a better fit for the things I want to do.

I remember Web Research as being pretty good. But, as I remember, it didn't allow the folder view to be filtered. This became a problem for me since I created a complicated system which tried to cover all aspects of the research task.

Yes, I think it's a good idea to use applications for particular purposes: in my case, Scrapbook as a reference library, Zoot as a project and clips organizer, local wikipedia, among others, for long-term notes.

Perhaps I can find a specific use for Web Research. Always a crimper...

(Not the greatest name in a googleverse full of "web research" software - http://www.macropool.com/en/products/webresearch/professional/index.html takes you to Web Research. Perhaps Macropool should have stuck with "ContentSaver"?)

Ike
Derek Cornish 2/10/2008 7:07 pm

Ike Washington wrote:
I don't worry too much about keywording scraps since I
tend to use Scrapbook to store the full text of the pdf or html file. Every word stored
there get indexed by dtSearch. I sometimes add an annotation, say a unique project
title - this gets indexed too.

I like to use WR's categories as a way of classifying/keywording my imported files in multiple ways. This gets over the limitations of the Windows filing system or WR's tree - in both of which cases, one can only store the file in question in one place at a time.

I remember Web Research as being pretty good. But, as I remember, it
didn't allow the folder view to be filtered. This became a problem for me since I
created a complicated system which tried to cover all aspects of the research
task.

That's right, it doesn't. But you can always hide the folder tree, which only provides a basic way of filing content, and just work with categories via the "categories" view. I don't know if this would allow you to achieve what you want, but it is an option.

Perhaps Macropool should have stuck with "ContentSaver"?

Yes, using common names - "Brainstorm", "Keynote" etc. - does create problems when trying to search for them on the web. Perhaps a good reason after all for names like "Infohesive"? :-)

Derek
Ike Washington 2/23/2008 5:45 pm
Derek Cornish wrote:
I like to use WR's categories as a way of
classifying/keywording my imported files in multiple ways. This gets over the
limitations of the Windows filing system or WR's tree - in both of which cases, one can
only store the file in question in one place at a time. ...
... you
can always hide the folder tree, which only provides a basic way of filing content, and
just work with categories via the "categories" view. I don't know if this would allow
you to achieve what you want, but it is an option.

Yes, I think WR beats Scrapbook on this point. Having to decide which folder a scrap should go to can become tedious as some of my multi-scrapbooks are quite complex, folders inside folders inside folders, and because I've become accustomed to Zoot/EverNote and the idea that a folder is notional, really just a tag. However, I spend much of my time in Firefox; I don't want yet another window to have to worry about. Choices... choices.

Ike
Susanne 2/24/2008 8:48 am
Ike Washington wrote:
.... However, I spend much of my time in Firefox; I don't want
yet another window to have to worry about. Choices... choices.

I too spend much (too much) time in Firefox, so yes, WR is another application, but, one of the things that I really like about it is that when capturing information, like Scrapbook, it allows me to decide _while capturing_ where to put the page or clip and assign any categories. So there is no need to "switch windows" to classify the information right away.
Much as I appreciate other things about Surfulater, this ability of deciding where to put the information while I am in the capture process has been decisive for me (maybe the new Surfulater Version, which is promised to add tags will change that).
And yes, I am aware of the fact that UltraRecall will allow you to at least decide what Folder or item to assign a new clipping to as well - and, while I really like the feature rich UltraRecall (and I do use it, if only for limited Purposes) getting information into UR ist just t o o s l o w. Loading a web page may take 10 Seconds for WR or Surfulater - the same page takes up to 45 seconds in UR - the first page of the day can take up to 1,5 minutes!. Same Notebook, same day. That is just prohibitive.

So, thanks Ike, now you got me posting my 2 cents and have added one woman (white, european) to the forum demographics ;-)
Derek Cornish 2/26/2008 6:36 pm


Susanne wrote:
And yes, I am aware of the fact that UltraRecall will allow you to at least
decide what Folder or item to assign a new clipping to as well - and, while I really like
the feature rich UltraRecall (and I do use it, if only for limited Purposes) getting
information into UR ist just t o o s l o w. Loading a web page may take 10 Seconds for WR or
Surfulater - the same page takes up to 45 seconds in UR - the first page of the day can take
up to 1,5 minutes!. Same Notebook, same day. That is just prohibitive.


Thanks for cutting to the chase on the issues of speed and flexibility. (Beats my earlier stream-of-consciousness ramblings.)

There is no doubt that Surfulater is catching up with WR. Not there yet, though.

Incidentally, and talking about speed, I've complainined from time to time here about the slowness of WR's start-up speed from cold, whether using the browser plug-ins or firing up the main program. Recently I realised to my embarrassment, however, that this was largely because I was loading five enormous archives - the products of years of using ContentSaver/WR - every time I ran the program. Simply unloading the ones I don't use often (equivalent to NOT loading every Zoot database I have when I run Zoot) cured the problem.

I mention this because it's so easy to make hasty judgements about software when the problem is a user one :-).

Derek
Susanne 2/26/2008 7:45 pm
Derek Cornish wrote:
Thanks for cutting to the chase on the issues of speed and flexibility.

Thank YOU Derek, for believing me. After having read a number of seemingly slightly irritated replies on the UR Forum to complaints about speed (it must be the pc, configuration, wrong commands... whatever), I don't even want to try and bring it up there, I just look for alternative products. And yet, that is too bad, because, UR really is a very powerful product...... oh well,

BTW: now that I have started posting here, I hope I know when to stop/shut up again ;-)

Susanne
Derek Cornish 2/27/2008 8:36 pm
Susanne wrote:
BTW: now that I have
started posting here, I hope I know when to stop/shut up again ;-)

Susanne

It is quite addictive, isn't it? No stop rules needed (or heeded) here. People just run out of energy or ideas, and threads naturally tail off into silence...:-)

Derek
Ken Ashworth 2/27/2008 9:55 pm


Susanne wrote:
... After having read a number of
seemingly slightly irritated replies on the UR Forum to complaints about speed (it
must be the pc, configuration, wrong commands... whatever), I don't even want to try
and bring it up there, I just look for alternative products. And yet, that is too bad,
because, UR really is a very powerful product...... oh well,

Yes, it is difficult to understand where this slowdown with UR web capture is coming from - I think that part of it stems from UR having to download the entire page to a temp location before sending it into UR proper, but even with UR configured to use the IE cache any speed increase is barely perceptible.

Even with everything sitting in the cache, after just loading the page in your browser, it's as if UR needs to reload the page just to be sure that something hasn't changed in the few seconds between the original loading of the page and making the decesion to send it to UR.

There's something in the UR web capture workflow that needs addressing but I'll be danged if I know what it is.

'Course, on the other hand, I have experienced some pretty speedy captures but these are mostly from pages that are predominately text.
Thomas 2/27/2008 11:24 pm
UR was quick enough for me when it comes to speed (generally, don't remember if there were some exceptions). However occasionally it just didn't import the page at all. At other times, it was importing to the different database than the one I had currently active (I kept more databases open, but it shall import to the active one). Though it all works now that I have ceased using the Database toolbar and loading more databases.
Alexander Deliyannis 2/29/2008 8:17 pm
Ken Ashworth wrote:
Even with everything sitting in the cache, after just loading the
page in your browser, it's as if UR needs to reload the page just to be sure that
something hasn't changed in the few seconds between the original loading of the page
and making the decesion to send it to UR.

As far as I know, UR indeed 'reloads' the whole page, for the simple reason that the only information it gets from the browser is the address. Contrast this with Surfulater that gets all the page content from the browser itself.

There may be an alternative, though I personally have not tried it; i.e. to use UR itself as the internet browser. In that case, when one wants to capture a page it would be already loaded in UR. I am unaware of the practicalities of such an approach, though I'd say that a browser provides an environment optimised for, well, browsing, whereas UR's browser window does not.

That's why I appreciate Surfulater's (and other similar products) browser integration, which provides a seamless way to focus on the browsing experience, while clipping all the information one considers useful along the way.

alx

Dr Andus 3/8/2013 9:22 am
Debenu PDF Maximus for free today. I got it last time and I found it an overkill for my needs, but for 100% off I don't mind keeping it in the bag, just in case...

http://www.bitsdujour.com/software/debenu-pdf-maximus
MenAgerie 3/9/2013 1:06 pm
trouble is you have to give Rupert Murdoch an email, and a bit of trade... that really rankles