how best to save web pages
< Next Topic | Back to topic list | Previous Topic >
Posted by jimspoon
Jun 2, 2014 at 09:45 PM
Probably very old news to many of you, but I’ve made a nice discovery today for saving webpages from Chrome - it works like this:
(1) install the Readability extension for Chrome.
(2) click on the Readabilty icon to save a displayed webpage.
(3) click the “Read Now” option. A beautiful, uncluttered rendition of the original page is displayed.
(4) use the “Share” option on the Readability page toolbar, or at the bottom of the article, and select “Print”.
(5) print using the “Save to PDF” printer that comes with Chrome. Of course saving to Dropbox, Google Drive, etc., makes it accessible from other devices.
Voila - you have a very nice pdf file of the article. It contains links back to the original webpage, and Save to PDF also includes the save date in a header. This strikes me as a very useful way of saving web page content for future reference. Now to figure out how to incorporate these saved files in an info database.
Of course - I could accomplish much the same thing by using the Evernote Web Clipper to send the article to Evernote. But I’m not a paid member so I’m limited in how much data I can send to Evernote every month for free.
There are of course many other ways to save web page content, either to files or to notes in a database. And I’ve tried many of them. For a long time I saved content to MHT files, I used the Scrapbook extension for Firefox, etc. I found ways to insert a clickable url back to the original article in these MHT files, but they were problematic. Printing to PDF from Readability solves that problem, and the resulting files are universally accessible, not browser-dependent, etc.
Would very much like to hear about methods that others have employed.
a related issue - one of the things I like to do is to save historic photographs. I can right-click and click Save Image As ..., but the downloaded image file usually has no descriptive metadata. (Wish all online jpeg photos had ample descriptive metadata including the url of the page where it resides - I wish browsers would automatically add that url metadata when an image is saved using “Save Image As”). Often saving the whole web page that contains the image will provide the missing descriptive info. Then one is faced with the problem of the saving format - “Webpage, complete” produces a mess of files; MHT, MAFF have their own problems of not being universally accessible. The PDF solution won’t allow you direct access to the image by itself, but including the link to the original page compensates a bit for that.
Posted by Gorski
Jun 3, 2014 at 12:58 AM
My method now is to save to .mht with Chrome, which now supports .mht if you enable it (http://www.tagspaces.org/mhtml-saving-chrome/) Chrome can view .mht files as well now.
The URL of the page is automatically saved in the html at the top if you view the source.
I used to use the SingleFile extension, which is very good and also saves the URL in the source, but saving to mht is faster.
SingleFile: https://chrome.google.com/webstore/detail/singlefile/mpiodijhokgodhhofbcjdecpffjipkle?hl=en
Posted by Gorski
Jun 3, 2014 at 12:59 AM
BTW, if you enable mht with Chrome, you lose the ability to save as Web page complete, which isn’t a loss for me.
Posted by jimspoon
Jun 3, 2014 at 02:40 AM
Hi Mark - it was driving me nuts. I couldn’t find the URL of the saved MHT file when I clicked on “View Source” in the browser, but when I dragged the same file to an editor, it was right there. For example I saved this thread from Chrome to an mhtml file, and then opened that file in notepad++. The URL of the original webpage was near the top, as follows:
From:
Subject: Outliner Software: how best to save web pages
Date: Tue, 2 Jun 2014 21:15:21 -0500
MIME-Version: 1.0
Content-Type: multipart/related;
type=“text/html”;
boundary=”——=_NextPart_000_C823_733B6041.FCD1CC7F”
———=_NextPart_000_C823_733B6041.FCD1CC7F
Content-Type: text/html
Content-Transfer-Encoding: quoted-printable
Content-Location: http://www.outlinersoftware.com/topics/viewt/5402/0/how-best-to-save-web-pages
Posted by jimspoon
Jun 3, 2014 at 03:09 AM
The MHT file created by Chrome does contain the original URL, and you can readily see it when you view the file in a text editor. But “View Source” in the browser does not display the contents of the MHT file. Rather, the MHT is a container file, which the browser unpacks; View Source displays the HTML component of that container file - and that HTML component may not have the original URL in it. I should have figured that out sooner!