Lady Bumps and Data Dumps
< Next Topic | Back to topic list | Previous Topic >
Posted by 22111
Jul 12, 2022 at 09:51 AM
Ad Mad:
Tabs
So my first answer from yesterday is gone; after losing it, I first reloaded recently closed tabs, and before closing any tab, I checked if “behind” the current page in each tab, there was another page: to no avail. So this is just another example that tabs are among the WORST navigational devices, just ok for some 6, 8, perhaps 12 tabs, and allowing to read the respective title (begin), but beyond that number, you need something really better - hence tree form. (Old Firefox add-ins doing this don’t function on newer FF versions anymore…)
Specifics re personally used tools, and re tools not used personally
You, Mad, now criticize for the second time (at least) my doing both, but then:
1) I always say I’ve got my knowledge upon the latter ones by reading web manuals, web comments, viewing YT vids, etc., and e.g. for DT I have viewed several hours of such videos; I also say please correct me if I’m mistaken with some detail, and I never present assumptions as facts.
2) It’s a fact though that my writings about (just some, in fact two) Mac software are the best info anyone considering these software can get in this forum, and any “invitation” by myself, to (quite numerous) real users of these software here, to inform us (i.e. me and prospects for these tools) better than I can do it, are in vain, and it’s a fact, too, that even trialing software (which Mac users would then allegedly do before buying), since in a short time frame and without “real data” - most trialing can’t be more than just “playing around” a little bit in practice -, will not necessarily inform the prospect of the real suitability and appropriateness for their real work then, let alone in the long run, i.e. when additional tasks will have to be realized with that software.
And that’s because the individual as a software prospect does not act as corporations do: the individual will not first build a functional spec, or when they do - which is a rare exception -, they will very probably overlook even requirements which will become apparent as being of high importance later on.
Then though, the software user will be more or less stuck with a piece of software which isn’t really appropriate for their tasks at hand anymore, and most professionals don’t have “crimping” as their favorite hobby. ;-)
Thus, in the absence of real software users’ “interest” in describing the “real work” with their tools here, my “initiative” replaces their lack of interest in “sharing” somewhat.
3) Since, as I have mentioned and then even “underlined” above, software, whenever you look at its specifics, at its purpose (!), and at the “cooperation” of the former with the latter, is NOT interchangeable, or very rarely only, and then only for quite basic tasks.
I don’t have to prove this here, just look at my in part comparative and in every case as detailed as possible software descriptions, even detailing what’s possible “factory-wise”, and for what functionality you’d need some (often simple, sometimes more elaborate) “macros”; of comparative interest in this respect: I can’t remember having read about any such personal “enhancements” of their respective tools here, from other contributors, and even - or should I say: “as more or less expected by me”? - my “invitation” to give some info upon their possible Mac macros to “upgrade” their Mac software “out-of-the-box”... fell flat.
I admit though that a long time ago, I have ceased to read threads re online tools, and these are the majority here, so I might have missed such info over there, albeit I can’t easily imagine how you would apply your own macros to online tools?
At the end of the day and from an objective POV, many of my posts are among the most instructive and useful ones here, but I was mistaken indeed, yesterday, as I wrongly assumed people who risk to get their adrenaline flowing when reading me, could just check for the thread starter (which is standard in standard forum software, and please forgive me, here, when you write, you don’t see any thread if you don’t load them into other tabs).
Unfortunately, I can’t edit existing thread titles, but I promise I’ll try to think of adding some (beware) or similar to any future title. On the other hand, whenever the title isn’t but just technical but presents some copyright-value texting (i.e. threshold of originality for application of copyright law), you know it’s by me, even without a special warning. ;-)
Posted by tightbeam
Jul 12, 2022 at 11:08 AM
And why do you assume I’m referring to your posts? What about your posts might lead you to draw that conclusion? Isn’t it possible I was referring to MadaboutDana?
22111 wrote:
“Posted by tightbeam
>Jul 11, 2022 at 06:57 PM
>I would like to submit again my request for the ability to block
>specific users on this forum.”
>
>I don’t understand: When tightbeam sees a thread has been started by
>22111, why does they click on that thread, to begin with? Didn’t his
>mother tell’em they could get wholly new experiences, by pawing at the
>burner?
Posted by 22111
Jul 22, 2022 at 08:46 AM
On the road again, and more about dumps, and, below, from wikipedia data onto your desktop (or wherever):
As said, I have NOT found a single adequate iPhone/iPad “app” for data dumps, but a single one for Android, and which on a cheap “smartphone”, and for “hit lists” (i.e. “filtering”) from dumps in the 1-million-char range, gives immediate and visually pleasant results, but for single search strings only; thus, you’ll better put your dump lines in alphabetical order upon export, so that your “bergman, ingmar (year)” (by “berman”; “bergm”, without the quotes, will suffice, too) will follow each other, in chronological order, and then only, afterwards, you’ll get the movies with the actress, instead of a blunt mix.
You would have to do some scripting anyway, upon export, since diacritics, as you know, ain’t people fed up to the core after a 3-hour slides show evening, but chars like ä or é, and searching for them on your virtual hand-held keyboard would be a nightmare, so ä>ae, é>e, ñ>n, ß>ss, etc., before the dump - you see here that just changing the default US keyboard to something - one specific - “national” would not resolve the problem…
Btw, I don’t know why people use 60-bucks tools for transmitting files to and from their hand-helds, or even pay for annual subscriptions of such tools - they may be “necessary” for appleware then? -, since for Android at least, your USB loading cable, connected to your PC, will do it - works fine in both directions -, but perhaps if you load your battery by “induction” now, or whatever they call it… all I know for sure being that proud iMobiles users don’t like it, but not at all, when you imply, use your iTablet in your grocery store, and other (grocery, not necessarily apple…) customers might think you’re a clerk - well, wear a white coat then, and they’ll figure you for the manager!
Speaking of dumps for quick reviewing info here, not for editing, then, in which case you would need better soft- and better hardware, and such inputs, in virtual keyboards of a handheld, are error-prone, according to me, and even for just new telephone numbers, it’s always a good idea to ring that number immediately, also perhaps in order to check it the number’s owner (e.g. female, diverse?) committed an oral communication error (e.g. in the above-mentioned grocery situation), but of course, you can also script the info way back, from hand-held back to your (more or less “stationary”) Windows device.
As for two-way, I just read in some forum that between Scrivener and .fdx format (Final Draft and others), it’s possible to transfer forth AND back your data, including comments (i.e. “ScriptNotes”), and while I do this, from UR to FD and FI, one-way - it obviously comes enormously handy to have this two-way and out-of-the-box - pay attention though that most web comments re Scrivener refer to the Mac version, not to the heavily crippled Windows version, you the above probably doesn’t apply to the latter…
_______________
Now for wikipedia dumps; you might prefer to look it all up “online”, but in good’ol’Europe at least, more and more countries currently fall back to third-world standards, and governments think about heavily taxing “traffic”, not the one of human beings, mind you, but the electronic one, so having “your” data at home, together with some good, heavy batteries might come handy for quite everybody soon, whatever:
First, “national” dumps-without-pics are near 30 gb, and the English one is about 50 gb - “download while you can!”, hehe!
Then, you will have difficulty to find the “necessary” software, in order to handle such data, and - speaking for the Windows Club here only! -, there are some “XML editors” out there, with prices near (or, incl. VAT, attaining the) 4 digits, and you would prefer an “XML database” anyway?
Now there are several ones, even free ones, but then try to get the wiki data into them, let alone the “necessary” indexing, by the different “page” elements… but is that necessary, in the end? Good luck to you; I failed, since I’m not willing to spend a fortnite upon that “problem”, and yes, there is some out-of-the-box wikipedia db, called “wikitaxi”, the developer of which really knowing what he does, e.g. its size (after trouble-free import) is just a fraction of the original dumps (about 35, 40 p.c.), and the page title strings are indexed, so this specific search is instantaneous.
Unfortunately, the wikitaxi developer knows “too well” what he does, since - my assumption only - he deliberately (?) discarded any possibility to select, and to copy (and there’s no comment functionality either). Dump import into your sql or other, general db then, with full-text search built up, upon import, by SQLite or the maker of the db? That’s possible in theory indeed, from the “work-flow” that follows, i.e. your general db (i.e. UR in my case) will probably offer automated import of text files within a folder (and its sub-folders), with every file (i.e. originally: wiki “page”) becoming an “item”.
On the other hand, importing multi-millions of files, with together 100 or more GB, into your (even db-based) “outliner”, would be an incredible “stress test” for that, and not speaking here of the “answer times” after import, or of the fact that in some months, you might be interested in “updating” your dump, i.e., technically, in doing, and then process a brand-new dump; the maintenance of this in a Postgres-backed “outliner” would be hassle-free, but there is no such thing, and my experience with SQLite (“outliners”) will make me avoid that adventure before trying.
Hence: You will split those multi-gb dumps into single “pages” again, a file per “page”, i.e. you will get multi-millions of files, necessarily spread over a set of (just numbered) (sub-) folders, each one containing a set of perhaps 2,000 to 5,000 files (up to 5,000 each is reasonable in NTFS; modern Macs though have got some other file system I don’t know the characteristics of, but as said, describing the Windows work-flow here anyway).
You would have, for example, d:\w for wikipedia, then d:\w\f for the French wikipedia, and in there, d:\w\f\1…d:\w\f\400, with each 5,000 files, 1.txt…5000.txt, or .xml or .w or whatever you like, you then set in Windows a default “app” for that suffix, for “Enter” on the file system entry; instead of 1…5000 in every one of the 400 folders, you might get 1…2000000 instead, according to your script, or to the (free or paid) tool you will use, or you might create 1,000 folders in d:\w\f, each with just 2,000 files, whatever.
Now, how to split? I have not found any (even paid) tool which, instead of numbering the files, fetches the page titles, then names the files accordingly, be it with additional numbering, or even without; in fact, any worthwhile numbering would be by the page IDs anyway. Btw, the page titles may contain chars not allowed in file names, so your script would have to replace them accordingly, before trying to creating the files. Also, none of the tools will delete the trailing indentation spaces, contained in the dumps, and which technically are not needed for their xml construction - let alone discarding unwanted metadata like redactors, revisions, etc. - you own script could delete them easily, since that’s the “beauty” of well-formed xml: you just delete all lines between and including
e.g. if you want your “output” text somewhat “neater”.
Thus, from the (paid or free) tools you get, you’ll get several millions of “page” files per dump, just numbered (also, in case, instead of 1…n in the form of a, aa, aaa, aaaa, aaab, etc.), all of them with all sorts of “content” parts which you may not be interested in, and with leading spaces before the
... you run then your own “cleaning”, and especially “meaningful-title” script on millions of files (outer loop for the folders, intermediate loop for for the files, inner loop for the lines… and then finally “innerst” loops for replacing within some lines, etc.) - technically, this is no problem at all, but this “work-flow” means writing millions of files (by the tool), then opening, changing, and saving again, millions of files, one-by-one (by your script). (Some of the wiki pages being titles identically, you will need some lines of additional code, checking if the intended file already “exists” (in that sub-folder) as a homonym, and then adding “order numbers” (i.e. 1, 2, 3…) if necessary.)
Thus, needing your script anyway… why not do it “better”? Ideally, you could run a script upon your 50 gb dump, reading line-by-line, then creating the necessary, already-“cleaned” files, and that might be possible indeed. I, using Autohotkey, cannot do that, since the smartest = fastest and most reliable ways of doing this in there don’t allow for reading but into variables (i.e. not files) of less than 1 gb, forcing me to begin by splitting - not in Autohotkey - the dumps into such multiple chunks.
With a paid tool, you can do just that, set a limit of less than 1 gb, than have the tool split “enough” “pages” into each chunk, in order to come as near as possible to the limit, but without exceeding it; for a 50 gb dump, you’ll thus get 51 chunks, and then you run your script on these 51 files, similarly to the description above, it’s just 51 source files for reading line-by-line, than millions.
You can do similar with free tools, but among them, I don’t have found any that will do as well, since the ones I found will either set the limit by lines (but that risks to exceed the (here for Autohotkey: 1 gb) size limit), fetching complete “pages” (as the paid tool above does), or will set the limit by, here, 1 gb, but then fill up the chunks with as many lines as it gets, not taking the care to not split but after a complete “page”; thus, in the first alternative, you will have to set the line limit low enough in order to not exceed the (not settable) size limit (which will multiply the chunks), and in the second alternative, you will have a minimum number of chunks, but your script code is somewhat “complicated” since your “page” loop crosses the (chunk) file loop.
So I had settled for the first alternative, and the free split tool split 23 gb into 55 chunks, in less than 10 minutes (on hdd), and then I ran my (up to now, just “cleaning, analyzing and target file creating”) script on one of those chunks, from which (i.e. 450 MB read into var, then working from that var) it neatly created 110,000 correctly processed and meaningfully named files (in 55 sub-folders à 2,000 files) in again less than 10 minutes; I’ll now write the complete script, for the situation where the split tool will create, for 23 gb, just 24 chunks, with with “pages” overlapping two chunks, and this will run then, for processing the whole 23 gb and not counting the 10 minutes for the chunk creation, about 7 to 8 hours (the “pages” in the dumps are not alphanumerically ordered, and the “pages” within the first chunks tend to be much “longer”, i.e. those contain, at equal size, much less “pages”, hence the high “page” number mentioned above for a “later” chunk (110,000 for “just one” out of in fact 52 “and a half”).
Then, the big moment - as said, I have already created 110,000 (“final”, not “dummy”) files like that, and “trialed” them -: All the power of Voidtools “Everything” (even from the command line in case, and incl. regex and all) will be available upon these - combined or distinct - “sets”, for their file names = “page” titles, and if you buy some indexing search tool (in this case, you must name or rename your files to “.txt”, or even “.xml”, in order to probably take advantage of the tool’s xml categorization functionality?), even the files’ contents will be indexed, i.e. becoming available instantly.
I think I’ll be happy with “Everything”‘s power re the meaningful titles (i.e. file names), with which not only “stored searches” are possible, but also building of your own “collections”, e.g. by automatic “renaming”, i.e. bulk adding of some “collection” code (e.g. ” .ar”) to just the currently selected (sic!) files within any search result.
Needless to say that for “people on the road” (nowadays: “active people”), any (even 500 bucks, “Everything” is that fast!) laptop or even slate will do, but Windows it must be: Applers’ mileage shall vary.
Posted by Cyganet
Jul 22, 2022 at 11:30 AM
For the use case you’re describing, I think I would try Easy Data Transform. Available for both Windows and Macs.
Posted by 22111
Jul 24, 2022 at 01:21 PM
At Cyganet: You’re right, there are many more powerful tools than just more-or-less crappy AHK; unfortunately, that’s the only scripting language & tool I can currently come forward with, so I can only describe the - successful, albeit inferior - Windows’ “work flow” I do with GSplit 3, then doing - successfully - my AHK script - - e.g. most “splitters” (incl. much-touted HJSplit) just split by size, notwithstanding line ends / returns, and for much-touted awk - in/on Windows: gwak -, I don’t even find - after 30 minutes of searching among all those crap alleged “download” “links” - a single Windows download link for some “executable” - obviously, the vast majority of “coders” just try to shine, instead of trying to share.
My way of “downloading”, and of “making useful”, at least works for the layperson (having following my previous AHK links) with, obviously, some speed yields, and I’m perfectly aware of the fact that UNIX has got much better tools, e.g. avoiding the chunks creation, “mmaps” being the “core search term” here, but then, I can’t delve for a fortnite into these “just-speed- considerations” - if you’ve got some links-plus-code to share, I’ll owe you.