Slightly Off-Topic: Organizing offline html files

Started by Mirce on 10/23/2020
Mirce 10/23/2020 1:28 pm
Although not strictly related to the main topics of this forum, I would like the input of fellow forum members on how to organize offline html pages.

Since the excellent Firefox addon, Scrapbook X is dead (you can operate it on Waterfox or similar browsers, but its days are numbered) and the successor WebScrapbook requires a Phd to make it work as the old Scrapbook, I was searching for alternatives to save and organize interesting web-articles that I find.

Regarding the saving part, I have settled with SingleFile, a Chrome and Firefox addon which captures the whole page in one html file (including all elements - pictures, styles etc). It is much cleaner and more practicable than the normal browser "Save as" / HTML file approach, or printing to PDF.

However, after I have collected several dozens of those web-pages, I am looking for a way to organize them. By organizing, I envision a way to have a sort of TOC and ideally a search option (for whole text search of the articles), maybe even some way of tagging the articles by topic.

I have tried HelpNDoc (a help authoring application which exports web-sites which you can use offline, with a search option). However, due to the "compression" (URI, base64, whatever) of the SingleFile-files, the imported files have only the text, the pictures and the overall styles are missing.

Does anybody have any ideas on how to organize those files?
Franz Grieser 10/23/2020 2:13 pm
Thanks for the hint to SingleFile. Great add-in.

You can combine it with NotebooksApp for Windows and Mac. However, NotebooksApp does not display the images included in the SingleFile html file. But I can select the "Show in Explorer" command in NotebooksApp and then double-click the html file to open it in Firefox/Chrome.


Kinook 10/23/2020 2:47 pm
You could import (store or link or use folder synchronization) the files into Ultra Recall -- https://kinook.com/UltraRecall/

On my computer, the SingleFile extensions saves the files to C:\Users\user\Downloads

I created a .urd file at C:\Users\user

In UR, and added a folder item with a URL of Downloads, then used Item | Synchronize to import the files into UR, where the files can be viewed, searched, tagged, etc.

https://kinook.com/Download/Misc/SingleFile.zip

https://www.kinook.com/UltraRecall/Manual/foldersynchronization.htm

https://www.kinook.com/UltraRecall/Manual/managedocumentsandfiles.htm

https://www.kinook.com/UltraRecall/Manual/inserthyperlinkdialog.htm
Mirce 10/23/2020 2:58 pm
You're welcome,glad that it is useful for other people.

Thank you for the tip regarding NotebooksApp - i suppose it is the version by Alfons? It is a pitty that it also doesn't show the saved page as it was (i.e. without pictures).

Some more info (maybe useful to other people): I also tried to manage the html files with a app called Snap2Html.
This i basically program to index the contents of external hdd's and produce a nice html file with a search option (in just one html file!). It also allows you to link to the file directly.
So, what I was doing was the following: sorted the web-page html's into thematic folders, run Snap2Html on the top folder and choose the option to link to the files. The exported html file (by Snap2Html) became an entry point or index of my collection,where even the file names could be searched. The linked web-articles open i the same browser window in all their glorywith pics and styles.
However, no full text search (on the contents of the individual web articles) and no way of tagging by using this approach.

I will keep on looking,grateful if you all could point me to some other solutions or workarounds.
Mirce 10/23/2020 5:13 pm
Thank you for your suggestion. I see that you also use SingleFile, so I suppose the saved html files are shown correctly in the UltraRecall viewer? (with pictures,styles etc).
UR could be a solution,but as I see it, if I only link to the files from the urd base, i won't have full text search, but I can tag and manage the collection while the articles remain as they are ( so i am not locked in).
If i import them into the database (urd file), i will have full text search, but the files will be locked in. Given that the html files saved with SingleFile are rather big, what about the scalability of the database format? How big can it get without bogging down (slowing down) UltraRecall?

After the Firefox/Scrapbook X experience i am very hesitant to rely on a locked in solution.

Given the ephemeral nature of web sites, I really wonder that there are so few possibilities to archive and manage interesting web-articles, at lest for the average user (i am aware of the warc format and various other approaches, but they lack the simplicity and straightforwardness of Scrapbook or SingleFile.
How do the other members of this forum keep interesting stuff found on the web?



Kinook wrote:
You could import (store or link or use folder synchronization) the files
into Ultra Recall -- https://kinook.com/UltraRecall/

On my computer, the SingleFile extensions saves the files to
C:\Users\user\Downloads

I created a .urd file at C:\Users\user

In UR, and added a folder item with a URL of Downloads, then used Item |
Synchronize to import the files into UR, where the files can be viewed,
searched, tagged, etc.

https://kinook.com/Download/Misc/SingleFile.zip

https://www.kinook.com/UltraRecall/Manual/foldersynchronization.htm

https://www.kinook.com/UltraRecall/Manual/managedocumentsandfiles.htm

https://www.kinook.com/UltraRecall/Manual/inserthyperlinkdialog.htm
Kinook 10/23/2020 5:28 pm
Yes, the pages will be displayed in UR or UR Viewer.

Even if the files are imported as linked files and not stored, the content is still parsed, indexed, and searchable in UR.

https://kinook.com/UltraRecall/Manual/auto_generated.htm

UR can handle large files without difficulty.

https://www.kinook.com/Forum/showthread.php?t=709

https://www.kinook.com/Forum/showthread.php?t=731
Franz Grieser 10/23/2020 6:33 pm
Mirce wrote:
Thank you for the tip regarding NotebooksApp - i suppose it is the
version by Alfons? It is a pitty that it also doesn't show the saved
page as it was (i.e. without pictures).

Yes, I was talking about Alfons Schmid's NotebooksApp.
For me, the combination of NotebooksApp and SingleFile is good enough, as far as I can say after a few hours of using it. Most of the time, the text is sufficient for my needs.


Gorski 10/24/2020 12:45 am

How do the other members of this forum keep interesting stuff found on the web?

I hate all the crap that comes with saving a web page as html. I usually just want the words so for years I've used an Autohotkey script that just saves selected text from the web page to a text file with the URL.

When I do want the images too, I use the SingleFile extension.

Gorski 10/24/2020 12:50 am

I should add that while I prefer my Autohotkey script, the markdown-clipper extension is very good for saving web pages to text in Markdown format.

https://chrome.google.com/webstore/detail/markdown-clipper/cjedbglnccaioiolemnfhjncicchinao


MadaboutDana 10/26/2020 10:45 am
I tend to print off articles etc. I find useful to PDF, saving them to a folder/set of folders indexed by FoxTrot Pro. I did use Curiota (macOS), the rather nice free app provided by the developer of Curio; this works extremely well if you’ve got a large set of subfolders. However, for various boring reasons I now use a simple PDF export option I set up myself in Safari’s print dialog box.

Having said that, some web pages won’t allow you to go into reader view, or even print off more than one page at a time. For these, I use Bear (macOS/iOS only), which has a truly great web page import option (driven by a Safari/Chrome extension). The page ends up as Markdown, plus whatever images were on the page copied to a subfolder. You can then export the Bear page (if you want to) as PDF or a number of other formats.

I regularly trawl through my (vast) collection of PDF files winnowing them down, or reducing file sizes using PDF Expert’s “Reduce File Size” option (again, macOS/iOS only).

Other options involving Safari extensions include Quiver (which handles web pages very well, but is restricted to macOS only) and Keep Everything (a rather good markdown-based information manager which hasn’t been updated for a while, but is very powerful; Bear is similar, however, in that it converts web pages into markdown).

MacJournal, DEVONthink and Scrivener also have “Save as PDF” options embedded in the macOS print dialog box, but I haven’t experimented with those. At one point I also set up Notebooks as a “Save to PDF” option, but found that this doesn’t always work predictably.

Cheers,
Bill
jbaltsar 10/27/2020 8:28 am
You could give Joplin a try (https://joplinapp.org It has a web clipper which works very nicely and which also cleans and sanitizes the HTML. It's open source and actively developed, only negative: you can only have one database, so if you plan to use it also for your personal notes etc it can become crouded (but works fine for me so far).

Other options (they all have some web clipping features)
Trilium (https://github.com/zadam/trilium
InfoQube and IQOutliner (https://www.infoqube.biz
QOwnNotes (https://www.qownnotes.org
Notion

So far Joplin feels best for me, but all do the job.

Mirce 10/27/2020 8:31 am

Interesting workflow. I also used to Save As PDF / Print to PDF in Chrome for a while. Still use it for some pages (if they print correctly as PDF's), however I found that PDF is sort of a dead end (for me). If I have the articles as html, I can later convert them to PDF if needed, but the other way around (PDF --> html) is not feasible.

Ah the Mac world. You are so "spoiled" by great applications. I've been tempted 100s of times to invest in a Mac, however the costs are prohibitive from my point of view. Managed to make a virtual machine with MacOS, however, played arround with some apps, however this solution is cumbersome (resolution of the guest OS is low, as MacOS is not officially supported by VirtualPC).


MadaboutDana wrote:
I tend to print off articles etc. I find useful to PDF, saving them to a
folder/set of folders indexed by FoxTrot Pro. I did use Curiota (macOS),
the rather nice free app provided by the developer of Curio; this works
extremely well if you’ve got a large set of subfolders. However,
for various boring reasons I now use a simple PDF export option I set up
myself in Safari’s print dialog box.

Having said that, some web pages won’t allow you to go into reader
view, or even print off more than one page at a time. For these, I use
Bear (macOS/iOS only), which has a truly great web page import option
(driven by a Safari/Chrome extension). The page ends up as Markdown,
plus whatever images were on the page copied to a subfolder. You can
then export the Bear page (if you want to) as PDF or a number of other
formats.

I regularly trawl through my (vast) collection of PDF files winnowing
them down, or reducing file sizes using PDF Expert’s “Reduce
File Size” option (again, macOS/iOS only).

Other options involving Safari extensions include Quiver (which handles
web pages very well, but is restricted to macOS only) and Keep
Everything (a rather good markdown-based information manager which
hasn’t been updated for a while, but is very powerful; Bear is
similar, however, in that it converts web pages into markdown).

MacJournal, DEVONthink and Scrivener also have “Save as PDF”
options embedded in the macOS print dialog box, but I haven’t
experimented with those. At one point I also set up Notebooks as a
“Save to PDF” option, but found that this doesn’t
always work predictably.

Cheers,
Bill
Mirce 10/27/2020 8:50 am
Thank you for your suggestions. I have tried Joplin for cross-platform note taking (Windows,Android,iOS). I did try the clipping option couple of years ago and was not impressed; maybe it has been improved in the meantime. Although Joplin ticks most of the boxes (cross-platform, "own cloud" ie you can use dropbox for sync, rudimentary tags, insert attachments etc) , I just cannot commit myself for using it actively. Maybe it's the lag of the windows (electron) app, or the clumsy editor window.

- Trillium is on my "check-out for CRIMP potential" list, will look into it more closely.
- InfoQube - man this app looks powerful, but I just don't have the time / willpower to get past the initial window. I read a lot about it on this forum, according to the author it can do a lot of useful things, but for me, it didn't succeed to strike the balance between user interface (which should invite you to use it) and its underlying power.
- QOwn Notes: never heard of it, will look into the app.
- Notion: online only the last time I checked, so I will pass. I am "too old school" for that

Thanks again for the suggestions!



jbaltsar wrote:
You could give Joplin a try (https://joplinapp.org It has a web
clipper which works very nicely and which also cleans and sanitizes the
HTML. It's open source and actively developed, only negative: you can
only have one database, so if you plan to use it also for your personal
notes etc it can become crouded (but works fine for me so far).

Other options (they all have some web clipping features)
Trilium (https://github.com/zadam/trilium
InfoQube and IQOutliner (https://www.infoqube.biz
QOwnNotes (https://www.qownnotes.org
Notion

So far Joplin feels best for me, but all do the job.

jbaltsar 10/27/2020 9:15 am


Mirce wrote:
Thank you for your suggestions. I have tried Joplin for cross-platform
note taking (Windows,Android,iOS). I did try the clipping option couple
of years ago and was not impressed; maybe it has been improved in the
meantime. Although Joplin ticks most of the boxes (cross-platform, "own
cloud" ie you can use dropbox for sync, rudimentary tags, insert
attachments etc) , I just cannot commit myself for using it actively.
Maybe it's the lag of the windows (electron) app, or the clumsy editor
window.

Maybe you should give it another try - for me it feels slick enough (they cleared
up the GUI lately) and the imported websites look very clean and readable.

- Trillium is on my "check-out for CRIMP potential" list, will look into
it more closely.

If you don't like Joplin for the Electron look-and-feel, you probably won't like this one,
and as far as I can see it doesn't have tags.

- InfoQube - man this app looks powerful, but I just don't have the time
/ willpower to get past the initial window. I read a lot about it on
this forum, according to the author it can do a lot of useful things,
but for me, it didn't succeed to strike the balance between user
interface (which should invite you to use it) and its underlying power.

You could try the IQOutliner - it's a stripped-down version with a focus on the notes.
It uses mhtml containers for website storage, so in theory the pages should be like
the original but there were some glitches in the formatting. But I agree - it's somewhat overwhelming.

- QOwn Notes: never heard of it, will look into the app.
- Notion: online only the last time I checked, so I will pass. I am "too
old school" for that

I too consider myself "old-school" and like to have my stuff on my own machine.
My son came up with this and I like some aspects (databases) but probably won't stay.
washere 10/27/2020 11:35 pm
QOwnNotes (very regularly updated) & Joplin are must haves as far as markdown + private sync (dropbox (free) account, etc) are concerned.

NoteCase pro can set up complex personal servers easily, must have too, best overall outliner, multi platform.

For HTML linking of files as outliner editing, Windows only, try Pro version, probably second best after NoteCase Pro but good at linking html files:

https://www.bauerapps.com/rightnote-version-comparisons/

washere 10/27/2020 11:53 pm
P.S.

BTW like IQ's Pierre, Trillium's dev, zadam, is on this forum too. Haven't had time this year to see what he has done. It started as a (free synced) homage to NoteCase Pro, he updates regularly too.