how best to save web pages · Outliner Software

jimspoon 6/2/2014 9:45 pm

Probably very old news to many of you, but I've made a nice discovery today for saving webpages from Chrome - it works like this:

(1) install the Readability extension for Chrome.
(2) click on the Readabilty icon to save a displayed webpage.
(3) click the "Read Now" option. A beautiful, uncluttered rendition of the original page is displayed.
(4) use the "Share" option on the Readability page toolbar, or at the bottom of the article, and select "Print".
(5) print using the "Save to PDF" printer that comes with Chrome. Of course saving to Dropbox, Google Drive, etc., makes it accessible from other devices.

Voila - you have a very nice pdf file of the article. It contains links back to the original webpage, and Save to PDF also includes the save date in a header. This strikes me as a very useful way of saving web page content for future reference. Now to figure out how to incorporate these saved files in an info database.

Of course - I could accomplish much the same thing by using the Evernote Web Clipper to send the article to Evernote. But I'm not a paid member so I'm limited in how much data I can send to Evernote every month for free.

There are of course many other ways to save web page content, either to files or to notes in a database. And I've tried many of them. For a long time I saved content to MHT files, I used the Scrapbook extension for Firefox, etc. I found ways to insert a clickable url back to the original article in these MHT files, but they were problematic. Printing to PDF from Readability solves that problem, and the resulting files are universally accessible, not browser-dependent, etc.

Would very much like to hear about methods that others have employed.

a related issue - one of the things I like to do is to save historic photographs. I can right-click and click Save Image As ..., but the downloaded image file usually has no descriptive metadata. (Wish all online jpeg photos had ample descriptive metadata including the url of the page where it resides - I wish browsers would automatically add that url metadata when an image is saved using "Save Image As"). Often saving the whole web page that contains the image will provide the missing descriptive info. Then one is faced with the problem of the saving format - "Webpage, complete" produces a mess of files; MHT, MAFF have their own problems of not being universally accessible. The PDF solution won't allow you direct access to the image by itself, but including the link to the original page compensates a bit for that.

Gorski 6/3/2014 12:58 am

My method now is to save to .mht with Chrome, which now supports .mht if you enable it (http://www.tagspaces.org/mhtml-saving-chrome/ Chrome can view .mht files as well now.

The URL of the page is automatically saved in the html at the top if you view the source.

I used to use the SingleFile extension, which is very good and also saves the URL in the source, but saving to mht is faster.

SingleFile: https://chrome.google.com/webstore/detail/singlefile/mpiodijhokgodhhofbcjdecpffjipkle?hl=en

Gorski 6/3/2014 12:59 am

BTW, if you enable mht with Chrome, you lose the ability to save as Web page complete, which isn't a loss for me.

jimspoon 6/3/2014 2:40 am

Hi Mark - it was driving me nuts. I couldn't find the URL of the saved MHT file when I clicked on "View Source" in the browser, but when I dragged the same file to an editor, it was right there. For example I saved this thread from Chrome to an mhtml file, and then opened that file in notepad++. The URL of the original webpage was near the top, as follows:

From:
Subject: Outliner Software: how best to save web pages
Date: Tue, 2 Jun 2014 21:15:21 -0500
MIME-Version: 1.0
Content-Type: multipart/related;
type="text/html";
boundary="----=_NextPart_000_C823_733B6041.FCD1CC7F"

------=_NextPart_000_C823_733B6041.FCD1CC7F
Content-Type: text/html
Content-Transfer-Encoding: quoted-printable
Content-Location: http://www.outlinersoftware.com/topics/viewt/5402/0/how-best-to-save-web-pages

jimspoon 6/3/2014 3:09 am

The MHT file created by Chrome does contain the original URL, and you can readily see it when you view the file in a text editor. But "View Source" in the browser does not display the contents of the MHT file. Rather, the MHT is a container file, which the browser unpacks; View Source displays the HTML component of that container file - and that HTML component may not have the original URL in it. I should have figured that out sooner!

Gorski 6/3/2014 3:13 am

I don't know about should, but glad you figured it out. I've always opened those files in a text editor so didn't realize the URL wasn't visible with view source.

MadaboutDana 6/3/2014 12:05 pm

I generally copy the bit of web page I want to save and then paste it into (a) OneNote (which means it automatically also pastes the URL), or (b) Notebooks (which preserves the web page layout much better than OneNote). But your solution sounds more sensible, I must say.

On my new MacBook Air, I now print web pages as PDFs to GrowlyNotes - that works very well, and is extremely fast! Means I can take advantage of GrowlyNotes's very good search function. The only trouble is, it also means I have three repositories - two of them cross-platform. But I can extract the PDFs from GrowlyNotes if I need to, and save them in either OneNote or Notebooks. And sooner or later, an iPad version of GrowlyNotes will be coming out...

Garland Coulson 6/5/2014 8:45 pm

I prefer Evernote for saving web pages - usually as a simplified article version. I usually don't hit the limit as a free member of Evernote, but if I did, I would just go back on premium.

I prefer clipping web sites to Evernote for the searchability and multiple device access. And it saved me when the Internet was down at my place. I recently wrote an article called "Evernote Replaces The Internet" that outlined my experience.
http://captaintime.com/evernote-replaces-internet/

Neville Franks 6/6/2014 10:28 pm

If all you can do is save a copy of a web page I humbly suggest you will quickly get into a quagmire as your collection grows. It needs to be searchable, indexed with tags, cross-referenced, editable so you can add your own notes etc. It needs to be a live dynamic repository, not a static and stale one. Maybe saving PDF's or MHT's works for some, however you'll likely be wanting something more useful and may well regret starting of this way. Of course I'm biased but I felt it needs saying. ;-)

Chris Murtland 6/9/2014 4:26 pm

Another useful Chrome extension for copying stuff from web pages:

http://template-extension.org/

This allows a lot of customization, allows you to copy the selected text in Markdown, etc.

Dr Andus 6/9/2014 5:40 pm

it looks like there is a new OneNote clipper out for Chrome as well:

http://www.omgchrome.com/microsoft-onenote-clipper-extension/

John Deweerd 6/22/2014 2:44 pm

I used to save entire webpages but found I ended up with a lot of extra stuff that I didn't want so now copy a selection in Markdown format (using Easy Copy in Firefox or AutoCopy in Chrome). I do have to add the formatting (for headers, etc) and drag and drop the images into the page but it is saved as a small .txt file with just the bits I want.