How do you mark the internet as "finished"?
< Next Topic | Back to topic list | Previous Topic >
Posted by Dr Andus
Sep 29, 2013 at 10:21 PM
Wayne K wrote:
>The stopping point
>for me was the poor mark-up capabilities. Surfulator couldn’t even
>highlight text that you’ve captured (I confirmed that with their tech
>support - maybe it’s changed since then).
Yes, Surfulater can do highlights now in different colours. But you’re right, you’d get more annotation options with PDF these days.
Posted by Wayne K
Sep 30, 2013 at 12:11 AM
Alexander Deliyannis wrote:
Wayne K wrote:
>>I haven’t found a Chrome add-on that does a similar job with pdf’s.
>>If I could find one, that would solve the problem
>
>I don’t have something to propose if you want the PDF page to be laid
>out as it is the original website, but if content is more important than
>layout, you might want to try out Cleanprint
>https://chrome.google.com/webstore/detail/print-or-pdf-with-cleanpr/fklmmmdcofimkjmfjdnobmmgmefbapkf
>
>You may already try it out in Firefox as it is included in the
>extensions available http://www.formatdynamics.com/bookmarklets/
>
>For precise capture of the layout I use the pro version of Fireshot,
>also supporting several browsers (I believe that such tools are more
>likely to be around longer). However, there’s a catch: the capture is in
>image form even if you choose to save it as PDF. So you can’t search or
>copy the textual content unless you OCR it first. Fireshot’s strong
>point is the ability to capture full web pages (including non visible
>parts).
Just before you posted I found Cleanprint and was trying it out. It looks good. I like that it strips the ads out though you have to be willing considerably longer to get the capture completed.
Posted by Slartibartfarst
Sep 30, 2013 at 04:15 AM
Wayne K wrote:
Gary,
>
>Nothing earth-shattering. It’s just a waste of time to download the
>article again, figure out that it’s a duplicate, and delete it. Often I
>change the name of the file when I download it from whatever default
>comes up. The next time I download the same article I might give it a
>slightly different name, so then I have open both files to make sure
>they’re duplicates.
>
>It’d also be nice to immediately see that I’ve visited a website and
>captured the relevant information without having to re-read the website,
>re-read the articles, and re-evaluate the articles to see which ones I
>want to download. That’s a significant amount of time and I’ve gone
>through this kind of duplicate effort a number of times when I re-visit
>a website I haven’t seen in a number of years.
============================================
I have had very much the same requirements as you, for years - except for (b) below.
Having tried out various alternative approaches to meeting those requirements (including all those mentioned in the thread, as above), I have at arrived at the conclusion - and I could be wrong, of course - that at least to my knowledge there is no single tool that I have come across that will enable you to:
(a) save a web page from a particular URL.
(b) tell you it was/had been saved when you next happen to visit or try to save the same page at that URL.
(c) give you some form of structured management of the saved pages.
(d) provide index/search of the content of the saved pages.
(e) prevent you from unintentional duplication - saving the same web page more than once.
(f) keep track of pages saved and tell you when those page change.
So I have used different tools. I have been unable to completely resolve (b), (e) though.
I have overcome the difficulties of searching for and finding text data in all/most types of files that I am interested in.
What I have done is:
1. Install the Firefox add-on called “Scrapbook” - https://addons.mozilla.org/En-us/firefox/addon/scrapbook/
This will give you (a), (c), (d).
That can save pages intact, including nested pages to “n” levels deep and any embedded types of document files, image files, mp3 files, zip files (or whatever you choose it to save). It’s a superb cribbing/scavenging tool.
It also provides index and search functionality and you can create a tree structure of folders to save content into. I have my folders named with specific subject/category names (e.g., Education, Finance, Tree-ring data). When you start to save a page, Scrapbook pops up a list of the specific folders most recently saved to, for you to select ONE, which is helpful when you are gathering info on a particular subject/category. You can also add comments, notes, text, highlights to the saved pages, and the comments become part of the searchable data.
The Scrapbook saved files are in html and a near-exact replica of the original (excluding Java). Content in these pages can be indexed and searched by the extremely powerful Windows Desktop Search function (which is well worth the time invested in learning how to make it stand on the head of a pin.)
2. Install the Firefox extension “Update Scanner” - http://sourceforge.net/projects/updatescanner/
This will help you with (f).
This tool will tell you when any designated website changes.
3. Use duplication (rather than avoid it). I save the same web page to different relevant subject/category names (folders).
Ideally, one would like to be able to “Tag” Scrapbook contents. The most readily available method for this is a bit of a kludge: I search just the Scrapbook directory (using xplorer²) for files containing (say) the text “tree rings”, then append the string “Tree rings” (or whatever string you want) to the Comments field for all files found, in one go. Comments are part of the Windows ADS (Alternate Data Stream) in NTFS, so may not be carried across if you move/copy or backup the files to a non-NFTS compatible file storage volume. For export, you can pack ADSes up, together with their files, using an xplorer² function. Appending “Tag” words like this to the Comments field is a generic and useful way to categorise your data as you can always search for those tagged files at some later stage, and add more tags, or edit them.
4. Use different search tools on Scrapbook contents.
- Windows Desktop Search (for the content of any filetype that I have set it to index).
- xplorer² search (I use it mostly for text-searching in document and html files).
- Everything search (for filename search),
- Picasa (can search all image files in Scrapbook files for filename, faces, colours, and text/notes in most EXIF and IPTC components) of those image files. Note that not all image files have EXIF and IPTC components.
Hope this all makes sense or is of use. Bit of a rushed coredump as my 3 y/o son is clamouring for attention.
Posted by jimspoon
Sep 30, 2013 at 07:54 AM
Scrapbook is a great extension. It was the main thing that kept me using Firefox before finally switching to Chrome. Also has good annotation features. I used to use the “auto-save” extension so that I saved every page I browsed without any intervention on my part. I wished it had the ability to save the files in MHT format though.
Here is one option that might help Wayne or some others. Evernote Web Clipper extension for Chrome has this option: “When enabled, searching the web on supported search engines will also be performed on your Evernote account.” I just enabled this and when I did a Google search in Chrome, the results page showed not only the usual results, but also matching items in my Evernote database.
Posted by MadaboutDana
Sep 30, 2013 at 09:51 AM
I think I’ve understood the problem, and while I can’t pretend to be able to offer “the ultimate” solution, I’ve found the following approach useful.
I use the Windows version of Notebooks (by Alfons Schmid: notebooksapp.com) to copy and save web pages. Notebooks preserves the formatting of web pages almost unchanged, but also allows you to edit the pages (e.g. add comments,highlights, or even rewrite/reformat the things completely, etc.). So a typical Notebooks page consists of:
Title
URL (pasted)
Comments (by me)
Tags (by me)
Contents of web page (pasted)
This means that when I eventually read through the web page, I may decide to cut out bits that aren’t directly relevant to my interests (easy: just select and delete). Other pages I make “read-only” (Notebooks offers that facility) so I can’t change them (e.g. nice bits of writing I want to preserve for my future edification).
Notebooks automatically time-stamps pages anyway, and you can arrange them into folders. The actual pages are held as separate files (an HTML file plus a .plist file for each page, containing the index and references), and Notebooks automatically indexes them for searching (I have to say the iOS app’s search function is much better than the Windows client’s search function, but you can always use Windows Desktop Search or any other search app of your choice; I use Copernic, for example). Notebooks folders are thus actual folders in the file system, which makes Notebooks very “open”.
While this doesn’t obviate the issue of duplicate pages in particular, it does make it very easy to organize pages and delete them, annotate them (using highlights if you wish!), shove ‘em about wherever you want ‘em, and so on. Although it’s a slightly lengthier process than using e.g. Surfulater or Scrapbook, I’ve found it’s more flexible - and unlike Surfulater, Notebooks supports full UTF-8 encoding, so is compatible with most languages. Finally, if you want to manipulate your web pages without reference to Notebooks, you can easily do so in the actual Windows file system (or on a Mac - Notebooks also has a MacOS client).
The Notebooks Windows client is currently free (because it’s still in beta). If you’ve got an iPad, you can synchronize easily via Dropbox (Notebooks defaults to Dropbox in any case); the cost of the iOS app is low (can’t remember what, exactly). Notebooks has, as a result of all the above, become my go-to information repository.