RFC - New Software Project: Infosqueezer

Started by Lothar Scholz on 9/11/2019
Lothar Scholz 9/11/2019 11:15 am
Hello,

I want to announce that I will start in October working two full-time years (that’s now full financed) on my Information/Knowledge Application. I spend the last years spare time to implement some prototypes and made enough failures to now be wise enough to create something great.


This Request For Comments is for you to provide me with your ideas, critiques and suggestions as i’m going to implement a few totally new concepts and combine it with well known stuff.

It will be a 1, 2 and even 3 pane outliner for information, a pdf+html reader with qualitative text analysis capability, a web clipper to provide this html and some kind of file management tool.

Development will happen on macOS and Linux first. I keep an eye to make it portable to Android and iOS but this will be coming later. The same goes for Windows because Microsoft GUI development currently totally suck and until they have decided about their direction I postpone all work on that platform.


*** Cards and Panes

Information is stored on little index cards which contain markdown text and optional an represented file (purpose as described later)
The card can link to other files but only represent at most one file. Wiki style linking among cards is possible.

There are two kinds of cards. Free Floating „Knowledge“ cards and cards that are the nodes of an outliner. Free floating nodes can be displayed in multiple outliners and multiple locations in one outliner or they can be outside of any outliner organization and only be revealed by search and browsing features.

The left pane outline is a full featured outliner with columns like Omni Outliner. It allows clones and mark+gather operations and all the features discussed and found useful on this board in the past. Each node in the outliner can also reference a free floating card, which is showing in the 2nd pane.

Compared to other 2pane outliners the concept makes the outline a full data document and not just the table of content of a collection of cards. This is a unique feature. The reason is easy to understand. For example if you write a book, you can use the outline pane to organize your chapters and add comments about the progress and todos while the cards in contain the content of your book. This decoupling is important because an organized set of content items is much more then the sum of its content and often information make only sense because of the combination of the items. No outliner so far could model this.

I just want to mention that there will be no difference between folders and items in the outline. This was concept taken from the technical implementation of file systems and I don’t know why so many outliners just copied it.

Third pane shows cards that have relations to the current card shown in 2nd pane.
This can be the destination/source cards of links on the current card or what looks similar to some AI algorithm.

Also each outline node can contain an explicit list of related cards. In the book writing example you would use this list to attach cards to a node with knowledge that you want mention in the text you write in the 2nd pane. This is a feature I’ve seen in IBM Doors Requirement Engineering Tool where you keep and reference the source material from there, like law requirements or technical specs. Sure you could just add them as intrinsic links on the outline node but having a separate list of documents is IMHO cleaner.

The 2nd and 3rd pane will show multiple cards at once.


*** Data Fields and Sections

As a huge fan of the abandoned Asksam and the ability to add data fields directly inside the document wherever you want I have to add this feature.
And in the current design it has turned out to be the key element.

The design rule is different from normal database that the user and entering information has priority over correctness of the database. If you have a year field in an art database you don’t need to enter a 4 digit year, you can also notate it as „painted during his orange phase living in Washington“ . This will be stored but underlined with just an error line. In the end its better to write it down then to forget it.

Fields can be added one or multiple times. Content of fields can contain multiple data values like comma separated list or a ZIP code/city/country combination that is automatically broken into data pieces by a pattern matcher.

For example if you write the following on a card in your movie database it will create a „Movie“ Record (declared by the @@ line) with the fields

@@Movie:
@Title: Star Wars: Episode IV - A New Hope
@Director | Writer: George Lucas
@Year: 1977
@Rating: 10
@Genre: SciFi
@Actor: Mark Hamill
@Actor: Harrison Ford, Carrie Fisher, Alex Guiness
@Synopsis:
*Luke Skywalker* joins forces with a [[Jedi Knight|Jedi]], a cocky pilot,
a Wookiee and two droids to save the galaxy from the Empire's world-destroying battle station,
while also attempting to rescue Princess Leia from the mysterious Darth Vader.


As you see it’s possible write condensed like the @Fieldname | Fieldname: syntax or the comma separated list.
There are tons of more things and fine tunings but this explainnation should be enough to understand the idea. A a field can contain any markdown text, including images and links. In fact they just split the text into different sections.

Now back to the outline, you can add children to a node automatically, for example you have your curated outline of movies and a node „Best SciFi“ then you can add „@@Movies@Rating > = 9 AND @@Movies@Genre == SciFi SORT BY @Rating“ and the outline will automatically fill the child notes with the top movies.

„@@Movies@Rating GROUP BY @Genre, @Year“ would fill the outline three levels deep and create items like
Movies
- SciFi
- 1977
- Star Wars: Episode IV

If you know databases this is how the SQL Grouping clause works for selecting data. It will be automatically inserted into the outline when you collapse the item containing this query (and then stay this way until collapsed or explicitly refreshed).

I have so far never seen or heard of any outliner who automatically generates the data shown as children.

If you add nodes automatically you can’t add any individual text or attach cards to them (There might be ways but I don’t think about this at the moment). But you could add a description of the Database search you do with the top-level „Movies“ node so you know what happens when you collapse the node.


*** Outliner and Columns

The data fields can be used to fill columns in the outline. Either fields taken from the referenced free floating knowledge cards or the outline node card. In the book example you could add a Progress column and add a „@Progress: mostly done“ line to your outline node and it will show the field value as column/row value.

Each pane will have a slider at the bottom to control how many lines of the markup are shown. So you can compact the outline to one line of text and then use the columns to find to get an easy status overview.


*** Tags

Outlines and Tags are complementary ways to organize notes.

Infosqueezer will support tags in an autogenerated tag tree view (I’ve seen it as a PhD sometimes ago but can’t find the reference anymore, it has never been implemented in real products so far AFAIK). This is not the almost useless idea that Bear is doing but really smart and makes it possible to provide very good browsing through a document collection. Studies have shown that browsing and scrolling is still by far more popular to find items than direct search.

If you add multiple hashtags tag to your card say #Politics #Trump #BrExit to a card. This will generate a tag tree/outline with the following nodes:

- Politics
- Trump
- BrExit
- BrExit
- Trump
- Trump
- Politics
- BrExit
- BrExit
- Politics
- BrExit
- Trump
- Politics
- Politics
- Trump

This means all permutations of tags are created in the tag tree. Each level means that cards must have at least the tags specified by the current node and all nodes above.

You can filter the tags used to build the tree based on fields and search queries. For example you base the tag tree on every tag found inside @@Movies@Synopsis field. This opens endless opportunities to fine tune your database


*** Qualitative Text Analysis

This is another unique aspect i have never seen anywhere but will fit nicely and easily into the overall structure of Infosqueezer.

You can add pdf files or captured HTML webpages into the database. Each file gets it’s own free floating card. It is represented by this card.

This allows you to easily add meta data like bibliography fields to the files. If you use the normal PDF Annotation feature, the marked text and your own added text is added automatically to the card as a field inside an @@Annotation record. This will use the block quote and cite source syntax from the MultiMarkdown specification.

You can add hashtags to all each annotation and use the tag tree growing and all the other powerful methods above.


*** Multiple Pages for Card

Each card has multiple pages. Currently I think about: Foreground, Background, References and Annotations.

The use case for a foreground and background page can be easily seen in Wikipedia where you have the knowledge page and the discussion page for each topic. Or in IMDB where the main page contains an overview (with a selection of the cast members) and the background page can contain the full very long data list.

It makes sense to add another References page for footnotes (becoming endnotes) and pdf based cards become the annotation page.

In Markdown its easy. We have the horizontal ruler syntax already (three dashes). A page break will be specified by three tilde characters followed by the page name like

~ ~ ~
This is main page

~ ~ ~ Discussion
Lets talk about it

~ ~ ~ References
[^bible]: Isaiah 66:11 That ye may suck and be satisfied with the breasts of her consolations




I think this is enough for the first presentation pitch.


Alexander Deliyannis 9/11/2019 5:40 pm
Quick reaction: I'm very impressed. There's much in there that I miss in existing software (or find it scattered in different programmes but never all at once).

More detailed feedback will follow.

Lothar Scholz wrote:
I think this is enough for the first presentation pitch.

Franz Grieser 9/11/2019 7:28 pm
Sounds interesting. And you have a track record of creating and maintaining a stable app :-)
I will surely dig deeper and comment on some points.

First question(s): You talk about a database. Will the data be stored in a proprietary file format? What about the PDF files and HTML data that can be added? Where will they be stored? And what about images, audio, video, equations...?

Lothar Scholz 9/11/2019 9:12 pm

Sounds interesting. And you have a track record of creating and
maintaining a stable app :-)

No. I do not. But i think we all learned to buy software never on promised features and timelines but always on what is available right now.
Products come and go, single developer or big corporation. This is not a kickstarter project.
I don't ask for your money in advance, i ask for your thoughts.

I'm an experienced programmer and do it since i'm 14 years old and now have hit my 50ths birthday a few months ago.
So i think i'm at least more qualified then the guy from Polywick Storyserver.

Oh yeah, my german computer science master thesis was writing a search engine for usenet news. It was used by the once popular german search
engine called "Fireball" in the early days of the internet in 1998. And my interest for information processing never stopped afterwards.

First question(s): You talk about a database. Will the data be stored in
a proprietary file format? What about the PDF files and HTML data that
can be added? Where will they be stored? And what about images, audio,
video, equations...?

There is no "database". I like the NoSQL "movement" because they have shown the world that SQL and relational databases are not the only way to do things.

I have developed a preprocessed format to store the markup text and index the data field / hashtag parts. This is good enough. The markdown of cards and outlines will be keept completely in memory (mmapped so it can be swapped out by the system on memory pressure) without special indexes. The data size is hardly a problem. Let this be a few hundert megabytes but even a few gigabytes will be ok. Just remember all threads and messages in this board have less then 20 MB in size. So people often overestimate this a lot.

By the way exactly this question was why in feb this year i asked here: https://www.outlinersoftware.com/topics/viewt/8580

The data itself is written generational, so only the modified delta is stored to reduce write operations on SSD.

Because the program will run purely in single user mode on your own local database on your SSD there is no need for database optimizations. We have disks with transfer rates of 3GB/sec now and CPUs with a 40GB memory throughput with 8 and more cores in mainstream desktops and even phones. It's time to use them.
The program will not be cloud based but i want implement a Peer2Peer synchronization feature or an on premise synchronisation server.

PDF and HTML will be stored externally and so will any full text index. HTML snapshots are stored in a proprietary format to eliminate duplicate items.

I know very well that some people here love to have their data in the file system as normal markdown so that it can be accessed via Spotlight etc. Therefore i thought about storing a duplicate of the data in the filesystem or the very overengineered but fun idea to implement a custom user file system that gets mounted via FUSE and could provide very interesting access pattern to the stored data. Just for the case anyone want to run a script on them or import them elsewhere. Anyone old enough to remember the MH mail client? That was nerd fun. But there is no FUSE on windows so i doubt it will happen.

Video and audio ... they will be implemented as simple file links, nothing else on the agenda at the moment.

For equations, i looked at the way how ConnectedText handles Latex. It is opensource and i think i could integrate that. But it's not on my agenda at the moment either, but i say it has a much higher probability to get on my agenda then many other features. In the second round of the markdown editor tables and equations will be added. But this is 2+ years in the future.


Franz Grieser 9/11/2019 9:40 pm
Thanks for the clarifications.

Lothar Scholz wrote:

> Sounds interesting. And you have a track record of creating and
>maintaining a stable app :-)

No. I do not.

Oh, sorry. I confused you with Christian Tietze who (co-)created The Archive and a Markdown table editor (https://zettelkasten.de/tools/

To follow up on my "database"/file storage question: So the data will be stored in a proprietary format on disk? No way to get the data out when I decide to stop using your app or when you (one fine day) stop updating the software?


But i think we all learned to buy software never on
promised features and timelines but always on what is available right
now.
Products come and go, single developer or big corporation. This is not a
kickstarter project.
I don't ask for your money in advance, i ask for your thoughts.

I'm an experienced programmer and do it since i'm 14 years old and now
have hit my 50ths birthday a few months ago.
So i think i'm at least more qualified then the guy from Polywick
Storyserver.

Oh yeah, my german computer science master thesis was writing a search
engine for usenet news. It was used by the once popular german search
engine called "Fireball" in the early days of the internet in 1998. And
my interest for information processing never stopped afterwards.

>First question(s): You talk about a database. Will the data be stored
in
>a proprietary file format? What about the PDF files and HTML data that
>can be added? Where will they be stored? And what about images, audio,
>video, equations...?

There is no "database". I like the NoSQL "movement" because they have
shown the world that SQL and relational databases are not the only way
to do things.

I have developed a preprocessed format to store the markup text and
index the data field / hashtag parts. This is good enough. The markdown
of cards and outlines will be keept completely in memory (mmapped so it
can be swapped out by the system on memory pressure) without special
indexes. The data size is hardly a problem. Let this be a few hundert
megabytes but even a few gigabytes will be ok. Just remember all threads
and messages in this board have less then 20 MB in size. So people often
overestimate this a lot.

By the way exactly this question was why in feb this year i asked here:
https://www.outlinersoftware.com/topics/viewt/8580

The data itself is written generational, so only the modified delta is
stored to reduce write operations on SSD.

Because the program will run purely in single user mode on your own
local database on your SSD there is no need for database optimizations.
We have disks with transfer rates of 3GB/sec now and CPUs with a 40GB
memory throughput with 8 and more cores in mainstream desktops and even
phones. It's time to use them.
The program will not be cloud based but i want implement a Peer2Peer
synchronization feature or an on premise synchronisation server.

PDF and HTML will be stored externally and so will any full text index.
HTML snapshots are stored in a proprietary format to eliminate duplicate
items.

I know very well that some people here love to have their data in the
file system as normal markdown so that it can be accessed via Spotlight
etc. Therefore i thought about storing a duplicate of the data in the
filesystem or the very overengineered but fun idea to implement a custom
user file system that gets mounted via FUSE and could provide very
interesting access pattern to the stored data. Just for the case anyone
want to run a script on them or import them elsewhere. Anyone old enough
to remember the MH mail client? That was nerd fun. But there is no FUSE
on windows so i doubt it will happen.

Video and audio ... they will be implemented as simple file links,
nothing else on the agenda at the moment.

For equations, i looked at the way how ConnectedText handles Latex. It
is opensource and i think i could integrate that. But it's not on my
agenda at the moment either, but i say it has a much higher probability
to get on my agenda then many other features. In the second round of the
markdown editor tables and equations will be added. But this is 2+ years
in the future.


Lothar Scholz 9/11/2019 10:36 pm

To follow up on my "database"/file storage question: So the data will be
stored in a proprietary format on disk? No way to get the data out when
I decide to stop using your app or when you (one fine day) stop updating
the software?

Using the Export feature?
I don't think that the database itself needs to be non proprietary.
I'm not designing the program to stop starting one day and prevent you from using the export menu option.

I have no intention to create a vendor lock in.

But if you use it there will be one automatically because of the unique feature set.
This happens with every program.

There will be exports to plain text markdown, xml, json and OPML natively.

The json and xml file format will be simple and documented and the base for anyone who want to write transformation tools.
I will write at least one transformation tool for html myself.

If you have good ideas i'm listening.

I'm a CRIMPer myself, i know that this is important.

Luhmann 9/12/2019 4:56 am
My main concern is workflow. It is hard to get a sense of this without mockups of the actual UX, but it sounds like things will be divided into "cards."? One of the things I like about both Ulysses and Dynalist is the ability to easily merge and/or divide other blocks of text into either larger or smaller units. In some apps, however, this can be quite cumbersome, requiring a lot of pre-processing of the text in something like BBEdit before it can be imported properly. Let's say I have a PDF which exports with improper line breaks. If I import those and it assigns each line of text to a different card, will I be able to easily merge them and clean up the line breaks within the app? Or the reverse, if I have a text that is long can I easily break it into smaller cards? This kind of thing actually takes up a considerable amount of my time as I move text from one app to another and the less I have to do of this the better. (For the same reason I probably won't actually use the app until iOS support is implemented, but I do like the idea!)
Simon 9/12/2019 7:13 pm


Luhmann wrote:
In some apps, however, this can be quite
cumbersome, requiring a lot of pre-processing of the text in something
like BBEdit before it can be imported properly.

Slightly off topic, but Textsoap is good for this: https://www.unmarked.com/textsoap/

I use it all the time and it’s pretty quick.
MadaboutDana 9/13/2019 8:27 am
Well, good for you, Lothar - I think this sounds like an exciting and desirable concept.

You might want to take a look at Ilaro on macOS/iOS, a rather neat research/reference management app (with some irritating flaws) that does some of what you’re describing.

Cheers,
Bill
Paul Korm 9/13/2019 2:18 pm
Good to see there will be documentation. I think only a vanishingly small number of people know or care to know what an XML transform is, let alone how to write one. So if you let others have the html transform tool as part of the package, that would be appreciated.

Lothar Scholz wrote:
The json and xml file format will be simple and documented and the base
for anyone who want to write transformation tools.
I will write at least one transformation tool for html myself.