more idle thoughts - about file metadata
< Next Topic | Back to topic list | Previous Topic >
Posted by Slartibartfarst
Nov 16, 2012 at 12:47 PM
Just a quick core dump of some thoughts on this:
1. Metadata in Filenames: If you must have the metadata somehow attached to the file, and absent any existing file-embedded metadata tags (e.g., EXIF data for JPG files), then the idea of packing as much document metadata into its filename as possible is a tried-and-tested approach, though you need to have pretty strict and basic minimum document naming standards in place and being observed/enforced, otherwise it could all turn to custard.
One document file-naming convention I recall from a while back had 7 fields which went something like this:
YYYY-MM-DD Project ID - Document Name - Version # (Author initials) Status.ext
- where:
- 1. YYYY-MM-DD was the ISO date;
- 2. Project ID was a numeric code;
- 3. Document Name was just that;
- 4. Version # would be vn-n - e.g., v2-5;
- 5. (Author initials) would be e.g., (DJW);
- 6. Status=DRAFT or FINAL;
- 7. .ext - appropriate file extension.
Dots (Full stops) were not otherwise allowed as they were a filename delimiter, but that rule could probably be relaxed today.
Special characters were not allowed as different file management systems might prohibit their use.
Max Path length of 255 characters was also a recognised constraint, but that rule could probably be relaxed today.
I have seen that sort of approach used on very large documentation projects. It can be used to compensate to some extent for the fact that you might not have an expensive proprietary DMS (Document Management System) or a search engine that can read inbuilt metadata (e.g., metadata fields in MS Word documents), or embedded textual metadata, or where the inbuilt/embedded metadata might have not been created, or might be wrong, or may have been deleted and you might have no way of knowing without opening each file and inspecting it.
Scanning tables of filenames could thus be a relatively quick and easy quality check, by comparison.
__________________________________
2. Search, OCR, TIFF, Indexing: Until relatively recently, scanning and indexing/updating the index of the content of document files was pretty computer-intensive. However, Microsoft - possibly spurred on by Google’s now obsolete Desktop product - have made it a commodity with their Win 7 Search facility - which is a very good product.
If you select the system option of adding adding in TIFF image file OCR scanning to the Search, then you have the possibility that some of your data can be text but in image format. You could thus have a document’s filename and its metadata in the image, and uneditable.
It’s easy to put text into an image file. Just Copy the text to Clipboard, Open an image file in (say) irfanview, and then Paste it. Then save as a .TIFF file. The text can then be picked up from the resultant .TIFF file when it is scanned by the Indexing system (as long as OCR has been enabled, as above).
If you used a Reference Management System such as (say) Qiqqa, then you could let Qiqqa automate things and build the metadata for each document as it is being scanned inside Qiqqa, or search out metadata for each document, on the Internet and from scholarly reference sources.
As well as Indexing the text content of .PDF and .DOC documents, Qiqqa can also OCR scan “imaged” .PDF documents (which are not normally text-searchable), and make them text searchable. I think Qiqqa is not yet able to OCR scan .TIFF documents.
Google Drive can OCR scan and search for text in imaged PDF files and in TIFF images.
Thus, by using a combination of search/indexing engines - e.g., those in:
* Qiqqa (metadata, search and content index)
* Win 7 Search (search and content index)
* something like Everything or Locate32 (folder/filename search)
* Google Drive.
- you should be able ensure that you can find just about anything you want in libraries of PDF, DOC and TIFF files..
3. NTFS streams: Fairly extensive metadata can be saved in streams in NTFS. You can “pack to go” your files and their stream data using file manager tools - xplorer² is one such that I use, with “ADS | Bundle to go” (Pack all selected files and their ADS contents in a bundle for transferring to a non-NTFS disk).
Hope this makes sense and is useful to someone.
Posted by jimspoon
Nov 17, 2012 at 04:07 AM
Slartibartfarst, it does make sense and is very useful!
I am particularly intrigued by the use of alternate data streams ... I really didn’t know much about them and I’ve been doing a bit of research today ...
this was interesting - http://en.wikipedia.org/wiki/Fork_(file_system)
I didn’t know how easy it is to “attach” an ADS to a file .. as shown here ...
http://www.irongeek.com/i.php?page=security/altds
Seems like there could be a lot of potential here for an information organizer - if it could (1) attach data fields to external files in alternate data streams that travel with the files; (2) index those data fields; and (3) search and retrieve files matching queries.
I don’t know if there are organizers which make use of the alternate data streams in this way.
I will keep reading and report back on any interesting findings.
jim
Posted by jimspoon
Nov 17, 2012 at 05:47 AM
Hmm so ADS is not the future?
“Alternate data streams are strictly a feature of the NTFS file system and may not be supported in future file systems. However, NTFS will be supported in future versions of Windows NT.
Future file systems will support a model based on OLE 2.0 structured storage (IStream and IStorage). By using OLE 2.0, an application can support multiple streams on any file system and all supported operating systems (Windows, Macintosh, Windows NT, and Win32s), not just Windows NT.”
http://support.microsoft.com/kb/105763
ok, so what is OLE 2.0 Structured Storage, IStream, and IStorage?
http://en.wikipedia.org/wiki/COM_Structured_Storage
http://msdn.microsoft.com/en-us/library/aa380369(v=vs.85).aspx
It seems like “file metadata” might be handled under the rubric of “properties and property sets”, including “user-defined property sets”
http://msdn.microsoft.com/en-us/library/aa380062(v=vs.85).aspx
“Persistent properties are stored as sets, and one or more sets may be associated with a file system entity. These persistent property sets are intended to be used to store data that is suited to being represented as a collection of fine-grained values. They are not intended to be used as a large data base. They can be used to store summary information about an object on the system, which can then be accessed by any other object that understands how to interpret that property set.”
“The UserDefined property set can be used to hold any properties. Typically, it is used to store named properties created by a user.”
Posted by Dr Andus
Nov 17, 2012 at 03:19 PM
jimspoon wrote:
Slartibartfarst, it does make sense and is very useful!
Yes, this is good stuff, thanks for this!
Posted by jimspoon
Jan 27, 2014 at 07:09 PM
in another thread, the topic of attaching custom fields to files has come up again.
http://www.outlinersoftware.com/topics/viewt/5280/0/wincatalog
It would be very useful to have this capability, particularly if it is implemented in a standard way, generally accessible from other computer programs.
It would seem that Windows Alternate Data Streams would offer the ability to attach data fields to files. But it seems like the capability is not widely used. And the continuing availability of ADS might be in doubt.
I would like to find a good utility that makes it easy to find files with alternate data streams, and to view the content of those streams.
For finding ADS streams, a good utility is Nirsoft AlternateStreamView:
http://www.nirsoft.net/utils/alternate_data_streams.html
So far I have not found an easy way to view the CONTENT of ADS streams.
But today I did find a good article about ADS. It does describe how to use Powershell to view the contents of streams, and also to create and delete them.
https://blogs.technet.com/b/askcore/archive/2013/03/24/alternate-data-streams-in-ntfs.aspx
And here is a little tutorial about ADS I found useful -
http://www.irongeek.com/i.php?page=security/altds
It shows how even executables can be put into the alternate data stream of a file.