Getting it OUT!

Posted by ureadit on 7/9/2004

ureadit 7/9/2004 1:08 am

Aperiodically someone points out that we tend to dump info into a PIM, may not know quite what we dumped (or later forget) and then may have a devil of a time retrieving the info we need, or info that would be useful but we haven't realized that we need it. We might not even know what to ask for.

Some have suggested that we create links among the information, including ones based on concepts. This can take a long time and might not include what we need at some future time.

We need something that can automatically create links and find info that is likely to be useful to us. Programs with boolean and fuzzy and "this term near that term" search functions help, as do programs like Zoot.

I would like to propose that this thread be continued with suggestions for functions/capabilties that a perfect information storage and retrieval program should have. Not a program for writing...just for getting useful information in and out,including getting info out that we might not know or remember is in there.

-sc

zeoli 7/9/2004 6:52 pm

Finding the information I want is always the most frustrating aspect of all these programs. That's mostly because I'm an idiot half the time, forgetting the correct keyword, or which topic I filed the information under in the first place. It's also partially due to the fact that I get impatient. Truly, the only system that I would be satisfied with is to think about the information I want and have it appear... I guess that's what my own brain is for, but it has as many bugs in it as InfoRecall.

Steve Z.

thompson.chris 7/10/2004 12:42 am

This is a topic I find quite interesting. My own belief is that any good system needs to be able to easily re-categorize data after you enter it, perhaps weeks or months later, rather than just at the time when you enter it. It seems that most major applications are moving in this direction. As an example, consider email. For the longest time, email programs have let you file away emails into folders, but this was always a pain in the neck if you later needed a group of emails gathered together using a different organization scheme that you hadn't thought of before. Nowadays most email programs are moving away from the folder concept and towards "virtual folders." If you need a specific grouping of emails, perhaps related to one specific project at work, you set up a query and all the relevant emails appear in a virtual folder, based not on you directly putting them there but based on the result of a search. The next generation of Mac OS X and Windows Longhorn will both support this concept at the operating system level, e.g. you can set up a folder in the OS asking for documents related to the Jorgensen account and everything relevant on your computer just "appears" in this folder: Word documents, emails, address book entries, etc. (I'm not entirely certain that Longhorn will do this for anything other than files, but the recent OS X Tiger demo definitely did do this.)

However, with virtual folders you still have the chicken and the egg problem. How do you know to set up a query to find something if you can't remember even vaguely that you've entered it or simply have no idea what to search for? What if you can't even remember what relevant categories/folders to use because you have so many? I think the ideal solution to this problem is to use Bayesian learning techniques (ironically, the same statistical techniques used to train many spam filters). Since in any good PIM/heavy outliner you'll have a set of categories/folders, you can use things explicitly placed in those categories as training examples for a Bayesian classifier. Once you have that, every time the user starts entering a note/item, the system can present the user with an automatically-generated list of "suggested categories" which makes it easier to file things away accurately. The Haystack program from MIT appears to do exactly this. However, we can take the idea even farther. Instead of using queries to put data into virtual folders, why not use a Bayesian system instead? Let's say you suddently need to create a category that pulls in all data from the Jorgensen account. Instead of setting up queries, all you would need to do is find half a dozen documents related to this project (doing a basic manual find) and set them as learning examples for the new "Jorgensen" category. The program would do the rest, evaluating everything in the database according to these learned examples and putting items that pass into the new category. It will, of course, make mistakes, but you can select a couple of the mistakes and use them as negative training examples. The system can then re-evaluate everything in the database and pull in everything that fits the new set of learned examples. Rinse and repeat as necessary. Like the best modern spam filters, you would only need a handful of training examples to get very good, relevant predictive categorization. Assuming this kind of system was implemented properly, it would be great for discovering unexpected but relevant little bits of information that you've squirreled away.

-- Chris

srdiamond15 7/11/2004 1:45 pm

Daly mentioned facet analysis, an approach developed by library scientists to solve this problem. Two personal programs implement this approach, MDE InfoHandler and Sycon Idea! (Exclamation point part of name, not my emphasis.)InfoHandler is more systematic but requires more work to implement. Idea! accomplishes this end with less formality, and can be used to construct ordinary hierarchies too. (Karteset and NoteLens are also suitable for implementing a personal facet approach.)

Facet analysis is hierarchical but the hierarchy doesn't pertain to the objects classified as such but to aspects of the objects. You try to sketch out a system of aspects (facets) of the objects you want to classify such that the system is exhaustive but NOT mutually exclusive. For many domains this won't be possible, but you try to approximate the ideal.

Here's how it works. Say you want to classify animals, so that you can recognize an xanthu when you see one, even though you forgot you ever heard of one. You would put in place a facet system in which the facets related to observable characteristics of the animal. One facet might be size, and under it you would have the alternatives: small, medium, and large. Another might be speed of locomation: fast, medium, slow. Another might be means of locomotion: wings, two legs, four legs, many legs. When you have a database with objects cross-classified in a rich set of facets, when you encounter the traits, you can pinpoint the animal.

Perhaps the purest example of facet analysis is the periodic table of the elements, wherein the elements are classified based on facets of their atomic structure. Contrast this with the Linneaus'classification of organisms, which works in the manner of an ordinary hierarchy.
_______________________________
I would like to propose that this thread be continued with suggestions for functions/capabilties that a perfect information storage and retrieval program should have. Not a program for writing...just for getting useful information in and out,including getting info out that we might not know or remember is in there.