How do you mark the internet as "finished"?

Started by Wayne K on 9/29/2013

Wayne K 9/29/2013 2:41 am

This is a nagging problem that I've never been able to find a satisfactory answer for. I've finally settled on a reliable system for capturing net information but I don't know how to keep track of what's done. I tend to capture articles willy nilly then organize them later. The problem is, how do I keep track of what I've already captured? Right now I'm looking at a huge archive of interesting articles. I'll probably want to capture dozens of them. How do I mark articles as "finished" (ie captured).

I know that I can keep my articles organized in a file manager and before capturing an article I can search in the file manager to see if I already have it, but that's a pain.

The ideal solution would be to highlight the article link or attach a note to the web page that says "Done". The note would then appear whenever I visit the page again.

There are some possible solutions but they've been frustrating to use. I thought the new, "improved" Diigo toolbar might do it but I'm already sick of it. The Stickies program allows you to attach notes to windows but I doubt that it's set up to handle thousands of notes. I've tried half a dozen other "highlighting" programs but nothing worked very well.

It's not a big issue for current research because I can usually remember what I've captured. But if I take up research that was done months ago, I have no idea what I've already gathered.

Does anyone have any other suggestions?

Wayne

Cassius 9/29/2013 5:46 am

Don't know if this might help, but...:

For each new article you save, include in the file name one or more key words, prefixing each with something like @@.

Keep a key word list in WordPad , some PIM, or whatever, adding to it as you create new key words. Resort the list so it remains alphabetized. Or use a two-pane PIM with categories in the left pane and for each category the appropriate @@keywords in the right pane.

When you want to find files corresponding to one or a combination of keywords, use a free, fast file search program such as MasterSeeker.

Alexander Deliyannis 9/29/2013 8:39 am

Wayne, you could try a tool such as the Google Chrome extension Note Anywhere. I copy from the description: "with this ext, you can make notes on any web page, any position. When you open that page again, the notes get loaded automatically."

So, when you capture a page, you should make a note such as "Captured" in Note Anywhere. When you visit the page again (in Chrome) the note will be there, and you will know that you have the page already in your files.

Wayne K 9/29/2013 3:06 pm

Cassius,

That looks like a good way to check for duplicates but I'm hoping to avoid the extra steps each time I want to capture an article.

Wayne K 9/29/2013 3:15 pm

Alexander,

That's what I'm looking for. Unfortunately, it's a Chrome extension and I use Firefox or IE to capture articles. I like to convert webpages to PDF's. This can be done by printing the webpage to PDF in any browser but the results are inconsistent (some pages won't print at all while others have jumbled text, etc).

For Internet Explorer I use a "Convert" add-on that came with my Adobe Pro software. It does an excellent job of capturing pages just as they appear online.

Recently, I found something I like even better: a Firefox add-on called "Print Pages to PDF". It's not quite as consistent as the IE add-on but I really like the job it does in automatically naming the files, which saves a step. It displays a list of recent captures in a sidebar but doesn't tell you if you're capturing a duplicates (it creates a second copy with a suffix).

I haven't found a Chrome add-on that does a similar job with pdf's. If I could find one, that would solve the problem -- or if I could find a Firefox add-on as good as Note Anywhere.

I guess a browser add-on is the only thing that's going to work but it's not ideal because there seems to be a rapid turnover in add-ons. They stop being supported then stop working with browser updates.

Gary Carson 9/29/2013 3:31 pm

I'm not sure I understand what the problem is here.

What difference does it make if you capture the same article twice? You can sort out the duplicates when you start organizing all the material.

Wayne K 9/29/2013 4:37 pm

Gary,

Nothing earth-shattering. It's just a waste of time to download the article again, figure out that it's a duplicate, and delete it. Often I change the name of the file when I download it from whatever default comes up. The next time I download the same article I might give it a slightly different name, so then I have open both files to make sure they're duplicates.

It'd also be nice to immediately see that I've visited a website and captured the relevant information without having to re-read the website, re-read the articles, and re-evaluate the articles to see which ones I want to download. That's a significant amount of time and I've gone through this kind of duplicate effort a number of times when I re-visit a website I haven't seen in a number of years.

Dr Andus 9/29/2013 4:52 pm

Wayne K wrote:

It's just a waste of time to download the
article again, figure out that it's a duplicate, and delete it. Often I
change the name of the file when I download it from whatever default
comes up. The next time I download the same article I might give it a
slightly different name, so then I have open both files to make sure
they're duplicates.

How about using some web capture software instead like Surfulater? Then you wouldn't need to go through the PDF production process, and it's very quick to capture a page (right-click, "add new article to Surfulater/or whatever" and that's it).

You can organise the pages into folders, and then it's a lot easier to see whether there is any duplication in the folder. Plus there is the Filter tool in SF that filters the captured content on the basis of the title of the web page, again, to see duplication.

Though I imagine you must have your reasons why you want them as PDFs.

Wayne K 9/29/2013 5:54 pm

Dr Andus wrote:

Wayne K wrote:
> It's just a waste of time to download the
>article again, figure out that it's a duplicate, and delete it. Often
I
>change the name of the file when I download it from whatever default
>comes up. The next time I download the same article I might give it a
>slightly different name, so then I have open both files to make sure
>they're duplicates.

How about using some web capture software instead like Surfulater? Then
you wouldn't need to go through the PDF production process, and it's
very quick to capture a page (right-click, "add new article to
Surfulater/or whatever" and that's it).

You can organise the pages into folders, and then it's a lot easier to
see whether there is any duplication in the folder. Plus there is the
Filter tool in SF that filters the captured content on the basis of the
title of the web page, again, to see duplication.

Though I imagine you must have your reasons why you want them as PDFs.

I tried web capture software a couple of years ago. The stopping point for me was the poor mark-up capabilities. Surfulator couldn't even highlight text that you've captured (I confirmed that with their tech support - maybe it's changed since then). By capturing them as PDF's, I can take advantage of software that can do any kind of markup I can imagine. I use PDF Revu but I know there are other excellent choices for pdf markups.

I also like the idea of staying relatively software-neutral with a file format that's likely to be around longer than I am.

As for seeing duplicates in folders, I can do the same thing with pdf files. It's just that it slows down the research and capture process. I like having some automatic organization on the front-end to save time on the back-end.

Alexander Deliyannis 9/29/2013 6:46 pm

Wayne K wrote:

I haven't found a Chrome add-on that does a similar job with pdf's.
If I could find one, that would solve the problem

I don't have something to propose if you want the PDF page to be laid out as it is the original website, but if content is more important than layout, you might want to try out Cleanprint https://chrome.google.com/webstore/detail/print-or-pdf-with-cleanpr/fklmmmdcofimkjmfjdnobmmgmefbapkf

You may already try it out in Firefox as it is included in the extensions available http://www.formatdynamics.com/bookmarklets/

For precise capture of the layout I use the pro version of Fireshot, also supporting several browsers (I believe that such tools are more likely to be around longer). However, there's a catch: the capture is in image form even if you choose to save it as PDF. So you can't search or copy the textual content unless you OCR it first. Fireshot's strong point is the ability to capture full web pages (including non visible parts).

Dr Andus 9/29/2013 10:21 pm

Wayne K wrote:

The stopping point
for me was the poor mark-up capabilities. Surfulator couldn't even
highlight text that you've captured (I confirmed that with their tech
support - maybe it's changed since then).

Yes, Surfulater can do highlights now in different colours. But you're right, you'd get more annotation options with PDF these days.

Wayne K 9/30/2013 12:11 am

Alexander Deliyannis wrote:

Wayne K wrote:
>I haven't found a Chrome add-on that does a similar job with pdf's.
>If I could find one, that would solve the problem

I don't have something to propose if you want the PDF page to be laid
out as it is the original website, but if content is more important than
layout, you might want to try out Cleanprint
https://chrome.google.com/webstore/detail/print-or-pdf-with-cleanpr/fklmmmdcofimkjmfjdnobmmgmefbapkf

You may already try it out in Firefox as it is included in the
extensions available http://www.formatdynamics.com/bookmarklets/

For precise capture of the layout I use the pro version of Fireshot,
also supporting several browsers (I believe that such tools are more
likely to be around longer). However, there's a catch: the capture is in
image form even if you choose to save it as PDF. So you can't search or
copy the textual content unless you OCR it first. Fireshot's strong
point is the ability to capture full web pages (including non visible
parts).

Just before you posted I found Cleanprint and was trying it out. It looks good. I like that it strips the ads out though you have to be willing considerably longer to get the capture completed.

Slartibartfarst 9/30/2013 4:15 am

Wayne K wrote:

Gary,

Nothing earth-shattering. It's just a waste of time to download the
article again, figure out that it's a duplicate, and delete it. Often I
change the name of the file when I download it from whatever default
comes up. The next time I download the same article I might give it a
slightly different name, so then I have open both files to make sure
they're duplicates.

It'd also be nice to immediately see that I've visited a website and
captured the relevant information without having to re-read the website,
re-read the articles, and re-evaluate the articles to see which ones I
want to download. That's a significant amount of time and I've gone
through this kind of duplicate effort a number of times when I re-visit
a website I haven't seen in a number of years.

============================================
I have had very much the same requirements as you, for years - except for (b) below.
Having tried out various alternative approaches to meeting those requirements (including all those mentioned in the thread, as above), I have at arrived at the conclusion - and I could be wrong, of course - that at least to my knowledge there is no single tool that I have come across that will enable you to:
(a) save a web page from a particular URL.
(b) tell you it was/had been saved when you next happen to visit or try to save the same page at that URL.
(c) give you some form of structured management of the saved pages.
(d) provide index/search of the content of the saved pages.
(e) prevent you from unintentional duplication - saving the same web page more than once.
(f) keep track of pages saved and tell you when those page change.

So I have used different tools. I have been unable to completely resolve (b), (e) though.
I have overcome the difficulties of searching for and finding text data in all/most types of files that I am interested in.

What I have done is:
1. Install the Firefox add-on called "Scrapbook" - https://addons.mozilla.org/En-us/firefox/addon/scrapbook/
This will give you (a), (c), (d).
That can save pages intact, including nested pages to "n" levels deep and any embedded types of document files, image files, mp3 files, zip files (or whatever you choose it to save). It's a superb cribbing/scavenging tool.
It also provides index and search functionality and you can create a tree structure of folders to save content into. I have my folders named with specific subject/category names (e.g., Education, Finance, Tree-ring data). When you start to save a page, Scrapbook pops up a list of the specific folders most recently saved to, for you to select ONE, which is helpful when you are gathering info on a particular subject/category. You can also add comments, notes, text, highlights to the saved pages, and the comments become part of the searchable data.
The Scrapbook saved files are in html and a near-exact replica of the original (excluding Java). Content in these pages can be indexed and searched by the extremely powerful Windows Desktop Search function (which is well worth the time invested in learning how to make it stand on the head of a pin.)

2. Install the Firefox extension "Update Scanner" - http://sourceforge.net/projects/updatescanner/
This will help you with (f).
This tool will tell you when any designated website changes.

3. Use duplication (rather than avoid it). I save the same web page to different relevant subject/category names (folders).
Ideally, one would like to be able to "Tag" Scrapbook contents. The most readily available method for this is a bit of a kludge: I search just the Scrapbook directory (using xplorer²) for files containing (say) the text "tree rings", then append the string "Tree rings" (or whatever string you want) to the Comments field for all files found, in one go. Comments are part of the Windows ADS (Alternate Data Stream) in NTFS, so may not be carried across if you move/copy or backup the files to a non-NFTS compatible file storage volume. For export, you can pack ADSes up, together with their files, using an xplorer² function. Appending "Tag" words like this to the Comments field is a generic and useful way to categorise your data as you can always search for those tagged files at some later stage, and add more tags, or edit them.

4. Use different search tools on Scrapbook contents.
- Windows Desktop Search (for the content of any filetype that I have set it to index).
- xplorer² search (I use it mostly for text-searching in document and html files).
- Everything search (for filename search),
- Picasa (can search all image files in Scrapbook files for filename, faces, colours, and text/notes in most EXIF and IPTC components) of those image files. Note that not all image files have EXIF and IPTC components.

Hope this all makes sense or is of use. Bit of a rushed coredump as my 3 y/o son is clamouring for attention.

jimspoon 9/30/2013 7:54 am

Scrapbook is a great extension. It was the main thing that kept me using Firefox before finally switching to Chrome. Also has good annotation features. I used to use the "auto-save" extension so that I saved every page I browsed without any intervention on my part. I wished it had the ability to save the files in MHT format though.

Here is one option that might help Wayne or some others. Evernote Web Clipper extension for Chrome has this option: "When enabled, searching the web on supported search engines will also be performed on your Evernote account." I just enabled this and when I did a Google search in Chrome, the results page showed not only the usual results, but also matching items in my Evernote database.

MadaboutDana 9/30/2013 9:51 am

I think I've understood the problem, and while I can't pretend to be able to offer "the ultimate" solution, I've found the following approach useful.

I use the Windows version of Notebooks (by Alfons Schmid: notebooksapp.com) to copy and save web pages. Notebooks preserves the formatting of web pages almost unchanged, but also allows you to edit the pages (e.g. add comments,highlights, or even rewrite/reformat the things completely, etc.). So a typical Notebooks page consists of:

Title
URL (pasted)
Comments (by me)
Tags (by me)
Contents of web page (pasted)

This means that when I eventually read through the web page, I may decide to cut out bits that aren't directly relevant to my interests (easy: just select and delete). Other pages I make "read-only" (Notebooks offers that facility) so I can't change them (e.g. nice bits of writing I want to preserve for my future edification).

Notebooks automatically time-stamps pages anyway, and you can arrange them into folders. The actual pages are held as separate files (an HTML file plus a .plist file for each page, containing the index and references), and Notebooks automatically indexes them for searching (I have to say the iOS app's search function is much better than the Windows client's search function, but you can always use Windows Desktop Search or any other search app of your choice; I use Copernic, for example). Notebooks folders are thus actual folders in the file system, which makes Notebooks very "open".

While this doesn't obviate the issue of duplicate pages in particular, it does make it very easy to organize pages and delete them, annotate them (using highlights if you wish!), shove 'em about wherever you want 'em, and so on. Although it's a slightly lengthier process than using e.g. Surfulater or Scrapbook, I've found it's more flexible - and unlike Surfulater, Notebooks supports full UTF-8 encoding, so is compatible with most languages. Finally, if you want to manipulate your web pages without reference to Notebooks, you can easily do so in the actual Windows file system (or on a Mac - Notebooks also has a MacOS client).

The Notebooks Windows client is currently free (because it's still in beta). If you've got an iPad, you can synchronize easily via Dropbox (Notebooks defaults to Dropbox in any case); the cost of the iOS app is low (can't remember what, exactly). Notebooks has, as a result of all the above, become my go-to information repository.

Slartibartfarst 9/30/2013 3:40 pm

jimspoon wrote: Sep 30, 2013 at 07:54 AM
Scrapbook is a great extension. It was the main thing that kept me

using Firefox before finally switching to Chrome. Also has good
annotation features. I used to use the "auto-save" extension so that I
saved every page I browsed without any intervention on my part. I
wished it had the ability to save the files in MHT format though.
...

============================
Re MHT: Probably not much use to you now that you have switched to Chrome, but I can report useful results with my trials of these Firefox extensions: (pretty impressive results actually)
- UnMHT - http://www.unmht.org/unmht/en_index.html
- Mozilla Archive Format (MHT/MAFF) - http://maf.mozdev.org/

The caveat I have is that, though these both work perfectly, the content of saved MAFF files (pages) might not always be easily searchable. Whether you have saved single or multiple pages/tabs, viewing the archive file may be restricted to a Mozilla browser with the MAF extnsion. Universal Viewer and Internet Explorer do not seem to be able to decode the compression (.ZIP) format they use, though 7-zip can open the files OK as an archive.

Interestingly, I noticed that the MS Labs OneNote Canvas experimental software (circa 2009?) exported/converted all the OneNote .one files into .mht files. (It was a one-way conversion only.)

jimspoon 9/30/2013 8:18 pm

Slartibartfarst wrote:

Re MHT: Probably not much use to you now that you have switched to
Chrome, but I can report useful results with my trials of these Firefox
extensions: (pretty impressive results actually)
- UnMHT - http://www.unmht.org/unmht/en_index.html
- Mozilla Archive Format (MHT/MAFF) - http://maf.mozdev.org/

The caveat I have is that, though these both work perfectly, the content
of saved MAFF files (pages) might not always be easily searchable.
Whether you have saved single or multiple pages/tabs, viewing the
archive file may be restricted to a Mozilla browser with the MAF
extnsion. Universal Viewer and Internet Explorer do not seem to be able
to decode the compression (.ZIP) format they use, though 7-zip can open
the files OK as an archive.

Interestingly, I noticed that the MS Labs OneNote Canvas experimental
software (circa 2009?) exported/converted all the OneNote .one files
into .mht files. (It was a one-way conversion only.)

Also excellent extensions, I ran them both at the same time. When I think about the great Firefox extensions - Scrapbook, Tab Mix Plus, Tab Groups Manager, All in One Sidebar, I wonder why I switched! I liked that Chrome allowed me to make "application shortcuts" to webapps and webpages so I could run them in their own windows. Firefox dropped the ball on that. Also liked that Chrome runs each tab in its own process, less prone to crashes and freezes? It also seemed likely I'd have a better experience with Google Webapps in Chrome. Plus perhaps better cross-device synchronization? And a shift in development away from Firefox extensions to Chrome extensions/apps. But I still miss those great Firefox extensions. But I am getting away from Wayne's topic.

I used to save webpages to MHTs all the time; now very rarely. like Wayne I have mostly shifted to PDFs. I think links are usually no longer clickable when I save a page to PDF, but the PDFs are easy to view regardless of platform or device. I also think I'm going to save my scanned documents and even receipts as PDFs rather than JPEG/TIFF images for the same reason.

Wayne K 10/1/2013 12:59 am

Thanks for everyone's detailed suggestions. I have a lot to work through. As Jim mentioned, I want to stick with PDF's because I more easily do markups. I used Scrapbook Plus for a while but finally gave it up because there was no way to do the markups I want. Plus, I don't like my pages saved into a folder of dozens of files. I want one simple file that can be marked up and easily exchanged with others.

Wayne K 10/1/2013 1:11 am

MadaboutDana wrote:

I think I've understood the problem, and while I can't pretend to be
able to offer "the ultimate" solution, I've found the following approach
useful.

I use the Windows version of Notebooks (by Alfons Schmid:
notebooksapp.com) to copy and save web pages. Notebooks preserves the
formatting of web pages almost unchanged, but also allows you to edit
the pages (e.g. add comments,highlights, or even rewrite/reformat the
things completely, etc.). So a typical Notebooks page consists of:

Title
URL (pasted)
Comments (by me)
Tags (by me)
Contents of web page (pasted)

This means that when I eventually read through the web page, I may
decide to cut out bits that aren't directly relevant to my interests
(easy: just select and delete). Other pages I make "read-only"
(Notebooks offers that facility) so I can't change them (e.g. nice bits
of writing I want to preserve for my future edification).

Notebooks automatically time-stamps pages anyway, and you can arrange
them into folders. The actual pages are held as separate files (an HTML
file plus a .plist file for each page, containing the index and
references), and Notebooks automatically indexes them for searching (I
have to say the iOS app's search function is much better than the
Windows client's search function, but you can always use Windows Desktop
Search or any other search app of your choice; I use Copernic, for
example). Notebooks folders are thus actual folders in the file system,
which makes Notebooks very "open".

While this doesn't obviate the issue of duplicate pages in particular,
it does make it very easy to organize pages and delete them, annotate
them (using highlights if you wish!), shove 'em about wherever you want
'em, and so on. Although it's a slightly lengthier process than using
e.g. Surfulater or Scrapbook, I've found it's more flexible - and unlike
Surfulater, Notebooks supports full UTF-8 encoding, so is compatible
with most languages. Finally, if you want to manipulate your web pages
without reference to Notebooks, you can easily do so in the actual
Windows file system (or on a Mac - Notebooks also has a MacOS client).

The Notebooks Windows client is currently free (because it's still in
beta). If you've got an iPad, you can synchronize easily via Dropbox
(Notebooks defaults to Dropbox in any case); the cost of the iOS app is
low (can't remember what, exactly). Notebooks has, as a result of all
the above, become my go-to information repository.

Do you have a link for this software? I'm afraid the name isn't the best choice for marketing. I just spent ten minutes in Google trying every combination of "Notebooks", "Windows", and "Software" I could think of but was unable to find it.

jimspoon 10/1/2013 1:54 am

wayne - here's the link for notebooks -

http://www.notebooksapp.com/

download links are at the bottom.

doesn't look an android version is coming any time soon.

http://www.helpify.de/notebooks-for-iphone/2191/notebooks-for-android-2?s=1