Re: Getting it OUT!

< Next Message | Back to archived message list | Previous Message >

Note: This message is from the outliners.com archive kindly provided by Dave Winer.

Outliners.com Message ID: 2034

Posted by thompson.chris
2004-07-10 00:42:22

This is a topic I find quite interesting. My own belief is that any good system needs to be able to easily re-categorize data after you enter it, perhaps weeks or months later, rather than just at the time when you enter it. It seems that most major applications are moving in this direction. As an example, consider email. For the longest time, email programs have let you file away emails into folders, but this was always a pain in the neck if you later needed a group of emails gathered together using a different organization scheme that you hadn’t thought of before. Nowadays most email programs are moving away from the folder concept and towards “virtual folders.” If you need a specific grouping of emails, perhaps related to one specific project at work, you set up a query and all the relevant emails appear in a virtual folder, based not on you directly putting them there but based on the result of a search. The next generation of Mac OS X and Windows Longhorn will both support this concept at the operating system level, e.g. you can set up a folder in the OS asking for documents related to the Jorgensen account and everything relevant on your computer just “appears” in this folder: Word documents, emails, address book entries, etc. (I’m not entirely certain that Longhorn will do this for anything other than files, but the recent OS X Tiger demo definitely did do this.)

However, with virtual folders you still have the chicken and the egg problem. How do you know to set up a query to find something if you can’t remember even vaguely that you’ve entered it or simply have no idea what to search for? What if you can’t even remember what relevant categories/folders to use because you have so many? I think the ideal solution to this problem is to use Bayesian learning techniques (ironically, the same statistical techniques used to train many spam filters). Since in any good PIM/heavy outliner you’ll have a set of categories/folders, you can use things explicitly placed in those categories as training examples for a Bayesian classifier. Once you have that, every time the user starts entering a note/item, the system can present the user with an automatically-generated list of “suggested categories” which makes it easier to file things away accurately. The Haystack program from MIT appears to do exactly this. However, we can take the idea even farther. Instead of using queries to put data into virtual folders, why not use a Bayesian system instead? Let’s say you suddently need to create a category that pulls in all data from the Jorgensen account. Instead of setting up queries, all you would need to do is find half a dozen documents related to this project (doing a basic manual find) and set them as learning examples for the new “Jorgensen” category. The program would do the rest, evaluating everything in the database according to these learned examples and putting items that pass into the new category. It will, of course, make mistakes, but you can select a couple of the mistakes and use them as negative training examples. The system can then re-evaluate everything in the database and pull in everything that fits the new set of learned examples. Rinse and repeat as necessary. Like the best modern spam filters, you would only need a handful of training examples to get very good, relevant predictive categorization. Assuming this kind of system was implemented properly, it would be great for discovering unexpected but relevant little bits of information that you’ve squirreled away.

—Chris

Back to archived message list