Large databases in UltraRecall? · Outliner Software

Chris Thompson 8/28/2010 12:38 am

Has anyone had any experience with storing large (10,000+) collections of documents in UR? Most of the documents would be PDF documents. Some of the posts on the UR forum seem to suggest that the database file can become unwieldy if UR is allowed to index the text of PDF documents, but it seems to me that turning this off would be a fairly big compromise.

(Normally I'd use DevonThink for this type of thing, but my current project has to use a Windows tool.)

-- Chris

quant 8/28/2010 7:46 pm

Don't know the answer to your question as I try to minimize the amount of "not important" info in my UR databases. The reason is, that UR won't be able to come even close to what dedicated search/indexing soft can provide. On the other hand, if I need to only link the files to some othe items, add tags etc, there's no need to index files themselves.

Can I ask what for do you want to put so many files into (indexed) database?

Thomas 8/28/2010 10:14 pm

I remember I had some problems getting my pdf files indexed in UltraRecall. That was long ago, and I believe it wasn't solved at that time, Kinook were using third party engine for that. But if you are going to rely on the indexing I'd recommend to testing it with few large PDF files whether it indexes them all.

As quant points out, I also moved my pdf's from database and got them indexed by Archivarius 3000, but I only needed indexing not the extra features UR provides thus that made the move easy.

Alexander Deliyannis 8/29/2010 10:16 am

Is there a reason other than indexing, for actually importing (embedding) the files into UltraRecall? If not, I would opt for Archivarius too, as I have done.

As I understand:

- UltraRecall only indexes PDF that are imported; if they are only linked to as external files, they are not indexed. I may be wrong. If I am, then there should be no trouble in terms of the database size.

- If files are imported, the database file becomes huge; this is not just for UltraRecall, but any other program. The size of the database will probably be significantly greater than the sum of the sizes of the individual files. With 10.000 files, even if they are simple text with limited graphics, you are bound to run into performance problems, even solely from the database file's fragmentation in the hard disk.

quant 8/29/2010 1:44 pm

yes, there is a reason, to have the files with you in your database, so that when you move it say to your usb key, you know you're going to have access to all the files that you imported to it.

Also, the files are indexed not according to whether they are linked or stored, but according to whether given file extention is set to index.

Alexander Deliyannis wrote:

Is there a reason other than indexing, for actually importing (embedding) the files
into UltraRecall? If not, I would opt for Archivarius too, as I have done.

As I
understand:

- UltraRecall only indexes PDF that are imported; if they are only
linked to as external files, they are not indexed. I may be wrong. If I am, then there
should be no trouble in terms of the database size.

- If files are imported, the
database file becomes huge; this is not just for UltraRecall, but any other program.
The size of the database will probably be significantly greater than the sum of the
sizes of the individual files. With 10.000 files, even if they are simple text with
limited graphics, you are bound to run into performance problems, even solely from
the database file's fragmentation in the hard disk.

Jon Polish 8/30/2010 12:54 pm

I don't know if this helps, but I have no trouble with my almost 8GB database. It has 12,364 items, with a large percentage of these items PDF files. Most are stored in UR, but some are linked. All non-image pdf files are indexed (I differentiate because pdf's that are scans (no OCR performed) are not indexable). I do not have performance problems other than importing large amounts of data all at once. I have experienced slow import on small and large databases, so I don't think size is a factor.

Jon

Chris Thompson 8/30/2010 3:14 pm

Thanks for the feedback. It sounds like it's doable, though perhaps with a very large database file.

To answer Quant's question, I'm doing some consulting for an organization that has a large, baroque paper filing system (more than 250,000 documents, though I'm primarily concerned with a smaller core set). In order to handle what PIM users would consider "cloning", they would copy a document and file it in multiple places, sometimes five or six (occasionally a dozen) places. The locations of those "clones" are themselves metadata, because someone some time ago had inspected the document and found it to be relevant to the given categories. This system was effective at storage, but it's inefficient to extract meaning from the document set. One can browse files and end up looking at documents multiple times in multiple locations. Thus, search is essential, but I'd like a PIM that supports cloning to mirror the underlying physical filing system.

-- Chris

Alexander Deliyannis 8/30/2010 6:59 pm

Jon Polish wrote:

All non-image pdf files are indexed (I differentiate because
pdf's that are scans (no OCR performed) are not indexable).

A very important point.

For the record, neither Archivarius indexes image PDFs. However, the Evernote Premium service does it, as does Onenote (which I don't use, so I can't tell about its performance).

Regarding Chris' cloning approach, I assume an alternative is to use keywords/tags to replicate the folder structure, as entries can be tagged under multiple categories.

Chris Thompson 8/30/2010 10:13 pm

Unfortunately, OneNote doesn't index PDF attachments. If you place a PDF on a page, it *will* index the bitmap page images that it creates, but then there's no way to actually get the original PDF out again. This is probably my biggest frustration with OneNote.

-- Chris

Alexander Deliyannis wrote:

For the record, neither Archivarius indexes image PDFs. However, the
Evernote Premium service does it, as does Onenote (which I don't use, so I can't tell
about its performance).