more idle thoughts - about file metadata

Started by jimspoon on 11/12/2012
jimspoon 11/12/2012 4:12 am
We need to be able to attach user-defined metadata fields to any type of file.

We do have some metadata capabilities with NTFS - we have file comments, and extended attributes, and "alternate data streams".

The problem is - if this metadata is stored in the file system, rather than inside the file itself - it may be lost whenever the file is copied to another drive or location, or sent to another person, etc.

The metadata needs to travel with the file itself. And it does - for some kinds of files. JPEG images have very good built-in metadata capabilities - you have EXIF, IPTC, and XMP fields to hold all kinds of metadata. And when you copy the file to another location, it is not lost.

Problem - if you added such metadata fields to a simple text file - other programs, other operating systems, etc. would not know how to deal with it. For example - if you loaded such a file into a Linux text editor, it might show up as gibberish, and it could be easily corrupted.

Solution - there ought to be a cross-platform method of joining "application data" and "metadata" parts in one file. The operating system should mediate so that "dumb" applications would see only the "application data" part of the file, while "smart" programs would be able to see and manipulate the metadata section. The file could be sent anywhere without losing its metadata.
Slartibartfarst 11/13/2012 9:12 am
"We need to be able to attach user-defined metadata fields to any type of file."

Is this a general requirement?
To what purpose?
Cold you taek advantage of the "pack to go" files/folders, where the NTFS streams are bundled into the package, before the file is sent out somewhere or moved to non-NTFS media?
Cassius 11/13/2012 5:05 pm
Several years ago, I looked into the possibility of using metadata in "file properties" as a means of tagging files (in Win XP). However, I could not get it to work, although I no longer remember why.
jimspoon 11/13/2012 7:45 pm

Slartibartfarst wrote:
"We need to be able to attach user-defined metadata fields to any type
of file."

Is this a general requirement?
To what purpose?
Cold you taek advantage of the "pack to go" files/folders, where the
NTFS streams are bundled into the package, before the file is sent out
somewhere or moved to non-NTFS media?

Well - inevitably a large part of our information is stored in discrete files ... notes, documents, photos, sounds, etc., while other information is stored in structured databases - emails, notes, contacts, tasks, etc. Ideally we would be able to search both types of "information stores" at once, and filter and sort the retrieved information in useful ways. Traditionally we have stored files in the directory tree structure. But as files have many different properties/attributes etc, they could be organized in many different ways. Just for example, you could arrange software programs by the author/vendor (the default method in Windows in the Program Files folder), or by function (the method I try to use in structuring my program files). You could organize your photo files by camera, date taken, event, persons appearing, etc. etc. Now we could use "hard links" or "symlinks" to categorize files in different ways .. so they would appear in several different locations in the directory tree. But if that information is stored in the file or directory system ... rather than the files themselves, it can be easily lost when moved around.

With adequate file metadata capabiltiies, notes could be stored in individual files, and this metadata could be used to retrieve, filter, and sort both notes and other related files.

I think it's been an ongoing goal of Microsoft to transition from a tree-based structure to a database-type or tag-based structure for a long time ... as far back as Vista. A number of third-party file-tagging programs have appeared - but these suffer from the same problem .. the tags are usually lost when the files are moved out of the particular file system where it resides. Microsoft has come out with a new file system for Windows 8 called ReFS or something like that .. Resilient File System, which allows for much more in the way of metadata capabilities. For now, though it appears to be only for servers.

That's an interesting idea about bundling NTFS data streams together when the data is sent out of the file system. The receiving system would have to know how to handle the bundled streams, of course. I suppose it's lack of a standard that has prevent this sort of thing from happening.

Just for example ... when EXIF/IPTC/XMP metadata started to be included in image files, you had the problem that many image viewers/editors did not know how to handle this extra data in the file. Now most of them do, I suppose. Similar standard could be useful for other types of files too.

MP3 files have a well-established metadata system - players/editors on different operating systems know how to handle it.

A related point ... if "hard links" and symlinks were used to implement categorization of files ... that categorization would be lost if files are sent out without the other hardlinks/symlinks. Somehow the metadata needs to be contained in the files themselves.


jimspoon 11/13/2012 7:54 pm
p.s. I have improved a very crude metadata system for files ... i put metadata into the file name. Specifically, I often begin file names with a date in the format "2012.11.13". Then, using Voidtools Everything, I can instantly retrieve all files in the file system that have "2012.11.13" in the file name, and sort them by filename. I can drag and drop all the found files into a Xplorer2 "scrap container" to browse and view them. Very crude but I use it all the time.

You might wonder why I don't just search by file creation date / modified date ... those dates are easily modified by the system, and these dates are often not the date I am looking for. Say for example, I have files about items I have purchased, and I want to view these files sorted by date of purchase. The "creation date" and "modified date" fields are not proper locations for this info.


jimspoon 11/13/2012 7:55 pm
"improvised" .. not "improved" haha.
Cassius 11/14/2012 1:15 am
jimspoon wrote:
p.s. I have improved a very crude metadata system for files ... i put
metadata into the file name. Specifically, I often begin file names
with a date in the format "2012.11.13". Then, using Voidtools
Everything, I can instantly retrieve all files in the file system that
have "2012.11.13" in the file name, and sort them by filename. I can
drag and drop all the found files into a Xplorer2 "scrap container" to
browse and view them. Very crude but I use it all the time.
================
I thought about this. One might run into problems if one wants to include a large number of tags in the file name. It probably would be better to place tags at the end of a file name, each preceded by a special character, such as ##.

Then, all one needs is a very fast search engine that has Boolean capabilities. Does anyone know of such, preferably one that doesn't require indexing? If so, LET US KNOW!

-c


jimspoon 11/14/2012 4:51 am
Cassius, I'd recommend Voidtools Everything -

http://www.voidtools.com/download.php

It does index, but the indexing is very fast and it is continuously updated. Instead of traversing the directory tree, it relies on the NTFS "USN journal" aka change journal. It indexes file names, not the content.

Off hand I don't recall other search options, but if you type in two strings separated by a space, it finds all files/paths with those two strings anywhere in the file or path name. It narrows down the list of matching files as you type.

I wish it had a built in file viewer, but I can use xplorer2 scrap containers for that. Also for some reason it won't return matching files that are in my Dropbox folders. Haven't tried it with Skydrive or other syncing services.

If you think about using the filename or pathname for tagging, remember the limits on filename / pathname length - for NTFS it is as follows -

"Individual components of a filename (i.e. each subdirectory along the path, and the final filename) are limited to 255 characters, and the total path length is limited to approximately 32,000 characters. However, you should generally try to limit path lengths to below 260 characters (MAX_PATH) when possible. See http://msdn.microsoft.com/en-us/library/aa365247.aspx for full details."



Tim the Red 11/14/2012 4:59 pm
Everything can do boolean and even regular expression searches, and it's awesome. Practically instantaneous. Important: only on local NTFS drives.

See: http://www.voidtools.com/faq.php#How_do_I_use_boolean_operators

I use file name tagging and Everything, like jimspoon described. Just a few tags per file, more than that doesn't work well. I've been doing it for a couple years. The major drawback is it's not cross-platform. The real solution requires a standard that spans all companies but I'm not holding my breath that this is going to happen any time soon.
jimspoon 11/14/2012 8:26 pm
Great stuff, Tim!

If anyone wants to try Everything - you'll see that the version on the download page is 1.2.1.371.

But you might want to install 1.2.1.451a instead - here's the link

http://www.voidtools.com/Everything-1.2.1.451a.zip


Slartibartfarst 11/16/2012 12:47 pm
Just a quick core dump of some thoughts on this:

1. Metadata in Filenames: If you must have the metadata somehow attached to the file, and absent any existing file-embedded metadata tags (e.g., EXIF data for JPG files), then the idea of packing as much document metadata into its filename as possible is a tried-and-tested approach, though you need to have pretty strict and basic minimum document naming standards in place and being observed/enforced, otherwise it could all turn to custard.

One document file-naming convention I recall from a while back had 7 fields which went something like this:
YYYY-MM-DD Project ID - Document Name - Version # (Author initials) Status.ext
- where:
- 1. YYYY-MM-DD was the ISO date;
- 2. Project ID was a numeric code;
- 3. Document Name was just that;
- 4. Version # would be vn-n - e.g., v2-5;
- 5. (Author initials) would be e.g., (DJW);
- 6. Status=DRAFT or FINAL;
- 7. .ext - appropriate file extension.

Dots (Full stops) were not otherwise allowed as they were a filename delimiter, but that rule could probably be relaxed today.
Special characters were not allowed as different file management systems might prohibit their use.
Max Path length of 255 characters was also a recognised constraint, but that rule could probably be relaxed today.

I have seen that sort of approach used on very large documentation projects. It can be used to compensate to some extent for the fact that you might not have an expensive proprietary DMS (Document Management System) or a search engine that can read inbuilt metadata (e.g., metadata fields in MS Word documents), or embedded textual metadata, or where the inbuilt/embedded metadata might have not been created, or might be wrong, or may have been deleted and you might have no way of knowing without opening each file and inspecting it.
Scanning tables of filenames could thus be a relatively quick and easy quality check, by comparison.
__________________________________
2. Search, OCR, TIFF, Indexing: Until relatively recently, scanning and indexing/updating the index of the content of document files was pretty computer-intensive. However, Microsoft - possibly spurred on by Google's now obsolete Desktop product - have made it a commodity with their Win 7 Search facility - which is a very good product.
If you select the system option of adding adding in TIFF image file OCR scanning to the Search, then you have the possibility that some of your data can be text but in image format. You could thus have a document's filename and its metadata in the image, and uneditable.
It's easy to put text into an image file. Just Copy the text to Clipboard, Open an image file in (say) irfanview, and then Paste it. Then save as a .TIFF file. The text can then be picked up from the resultant .TIFF file when it is scanned by the Indexing system (as long as OCR has been enabled, as above).

If you used a Reference Management System such as (say) Qiqqa, then you could let Qiqqa automate things and build the metadata for each document as it is being scanned inside Qiqqa, or search out metadata for each document, on the Internet and from scholarly reference sources.
As well as Indexing the text content of .PDF and .DOC documents, Qiqqa can also OCR scan "imaged" .PDF documents (which are not normally text-searchable), and make them text searchable. I think Qiqqa is not yet able to OCR scan .TIFF documents.
Google Drive can OCR scan and search for text in imaged PDF files and in TIFF images.

Thus, by using a combination of search/indexing engines - e.g., those in:
* Qiqqa (metadata, search and content index)
* Win 7 Search (search and content index)
* something like Everything or Locate32 (folder/filename search)
* Google Drive.

- you should be able ensure that you can find just about anything you want in libraries of PDF, DOC and TIFF files..

3. NTFS streams: Fairly extensive metadata can be saved in streams in NTFS. You can "pack to go" your files and their stream data using file manager tools - xplorer² is one such that I use, with "ADS | Bundle to go" (Pack all selected files and their ADS contents in a bundle for transferring to a non-NTFS disk).

Hope this makes sense and is useful to someone.
jimspoon 11/17/2012 4:07 am
Slartibartfarst, it does make sense and is very useful!

I am particularly intrigued by the use of alternate data streams ... I really didn't know much about them and I've been doing a bit of research today ...

this was interesting - http://en.wikipedia.org/wiki/Fork_(file_system

I didn't know how easy it is to "attach" an ADS to a file .. as shown here ...

http://www.irongeek.com/i.php?page=security/altds

Seems like there could be a lot of potential here for an information organizer - if it could (1) attach data fields to external files in alternate data streams that travel with the files; (2) index those data fields; and (3) search and retrieve files matching queries.

I don't know if there are organizers which make use of the alternate data streams in this way.

I will keep reading and report back on any interesting findings.

jim


jimspoon 11/17/2012 5:47 am
Hmm so ADS is not the future?

"Alternate data streams are strictly a feature of the NTFS file system and may not be supported in future file systems. However, NTFS will be supported in future versions of Windows NT.

Future file systems will support a model based on OLE 2.0 structured storage (IStream and IStorage). By using OLE 2.0, an application can support multiple streams on any file system and all supported operating systems (Windows, Macintosh, Windows NT, and Win32s), not just Windows NT."

http://support.microsoft.com/kb/105763

ok, so what is OLE 2.0 Structured Storage, IStream, and IStorage?

http://en.wikipedia.org/wiki/COM_Structured_Storage
http://msdn.microsoft.com/en-us/library/aa380369(v=vs.85).aspx

It seems like "file metadata" might be handled under the rubric of "properties and property sets", including "user-defined property sets"

http://msdn.microsoft.com/en-us/library/aa380062(v=vs.85).aspx

"Persistent properties are stored as sets, and one or more sets may be associated with a file system entity. These persistent property sets are intended to be used to store data that is suited to being represented as a collection of fine-grained values. They are not intended to be used as a large data base. They can be used to store summary information about an object on the system, which can then be accessed by any other object that understands how to interpret that property set."

"The UserDefined property set can be used to hold any properties. Typically, it is used to store named properties created by a user."

Dr Andus 11/17/2012 3:19 pm
jimspoon wrote:
Slartibartfarst, it does make sense and is very useful!

Yes, this is good stuff, thanks for this!
jimspoon 1/27/2014 7:09 pm
in another thread, the topic of attaching custom fields to files has come up again.

http://www.outlinersoftware.com/topics/viewt/5280/0/wincatalog

It would be very useful to have this capability, particularly if it is implemented in a standard way, generally accessible from other computer programs.

It would seem that Windows Alternate Data Streams would offer the ability to attach data fields to files. But it seems like the capability is not widely used. And the continuing availability of ADS might be in doubt.

I would like to find a good utility that makes it easy to find files with alternate data streams, and to view the content of those streams.

For finding ADS streams, a good utility is Nirsoft AlternateStreamView:

http://www.nirsoft.net/utils/alternate_data_streams.html

So far I have not found an easy way to view the CONTENT of ADS streams.

But today I did find a good article about ADS. It does describe how to use Powershell to view the contents of streams, and also to create and delete them.

https://blogs.technet.com/b/askcore/archive/2013/03/24/alternate-data-streams-in-ntfs.aspx

And here is a little tutorial about ADS I found useful -

http://www.irongeek.com/i.php?page=security/altds

It shows how even executables can be put into the alternate data stream of a file.

22111 1/28/2014 4:01 pm
Jim,

I should have said, I didn't find any info on ADS that both was intellectually accessible to my mind, AND gave me some hint how to PROCESS those ADS in a file M environment. I had seen the Russinovich tool, but didn't see a way to have then all this necessary ADS info ready for processing = sorting / filtering for it (as is done in some of those file managers though); just looking, one by one, at those ADS attributes, for hundreds of files, by batch-triggering the same tool again and again, and fetching the individual results, did not seem to be a viable solution for me; I would have needed "more direct" access to it (as X2 and DO certainly will both have found, but their developers have way more know-ho to correctly "read" highly technical info).

This being said, if I got the necessary info which enabled me to access it, "in numbers", so that I could then process it, by AHK, I'd be happy to share any AHK script I then wrote to process such info from any file manager. (Here again, it would be some additional tool, automatically triggered, then closed, but no way to do it WITHIN the respective file commander - and as said in the other thread, both X2 and DO seem to "withhold" their respective know-how on these matters.

Perhaps I overestimate the burden on today's computers when I think there should be an easier way to access that data but by running a 41 kb tool 1,000 times, again and again, to produce the raw data for such a list which then could be sorted/filtered.

Another idea: X2 cannot rename/add further such ADS attributes, but what about its displaying (and then even processing/changing the contents of) EVERY "non-standard" attribute, created by a third-party application (sw or just little tool?). Unfortunately, it's not worthwile to "try" this with MS Office, since very certainly X2 has "incorporated" attributes created by Office, as "standard" ones into those it can handle - does anyone knows about some "exotic" prog though that creates really "exotic" ADS attributes, and from which a file, with such attributes, could then be checked from within X2 (or DO)?