Word frequency counters and some concordancers (a prerequisite for smart text expander = autocompleter usage)
< Next Topic | Back to topic list | Previous Topic >
Posted by 22111
Oct 26, 2013 at 02:00 PM
1)
In order to do text expanding, it’s a good thing to run a word frequency counter on typical texts first, to identify
- the real, relative frequency of all those short words we all use by the thousands
- the typical words in such typical texts, i.e. those 10, 20 or 40 “special” words it would make sense to enter in an abbreviated form here. This being said, depending on your work, you might not have the need to identify such different terminology groups, but perhaps you use the same “special” terminology in all of your texts… or then, both is true: there will be a base group of words of yours, and then some more typical terms for different text sub-groups; as said, good text expanders (and AHK, especially), permit combinations of several such abbreviation groups, i.e. several abbreviation files run concurrently if you wish so.
Also, it’s important to know that those word frequency list tools are totally different from word count tools, which don’t build up lists but serve translators to bill their customers, on translated words. For this, different, software category, there are also many different offerings, some showing clemency to the customers, while others do not (but tend to multiply the “word” count).
Also, all prices are plus VAT, here in Europe, so a price of 40$ comes to 40€ for example, which is a real nuisance. In ancient times, many vendors sold directly, without bothering about these awful taxes (then directly sent to people who live on us, e.g. around the Mediterranean basin and in Brussels); today, they almost all make you pay by payment services that take those takes from you, then (perhaps) send it to your respective tax collector - this fact very often makes me refrain from buying software, even it’s “only” 20$, since for your feeling to be stolen from, 20$ equal 200$.
2)
So you need a word frequency counter, or a concordancer (of which most (or all?) also permit to create just a word frequency list. I have to add that my text, exported from my outliner, is a plain text with some 2,1 million bytes, and with some 340,000 words (this latter number heavily depends on the tool used), so I didn’t treat the tools with some tiny texts, but with a real-world example.
From google, you’ll get some real crap first if you leave out the “frequency” part of your search term.
There is “Word List Expert”, 15$, one of the worst programs I ever have installed in my system. Sometimes, the creation of the word list just takes 10 minutes, and then, even after 30 minutes, the progress bar just shows about 15 p.c. “progress”... and if you get the word list, it will only show the very first 780 words or so (in my system), and there’s big differences to what I get from other tools. Except for viruses, I’ve never been so happy as when I de-installed this piece of software.
There is “Word List Creator” and / or “Word List Maker”, both from the same developer, I suppose, but available screenshots show some differences. There are several “buy” sites for them (and one even with 30 p.c. off), but there is not a single download site for either of them (perhaps there is, but I searched for about 30 minutes or so), and all there “original” sites are unavailable: http://www.wordlistcreator.com and http://www.wordlistmaker.com and http://www.mysoftwarefactory.net - There’s also “Word Sorter”, 10$ from the same, cannot be found either. I did not try to buy, in anticipation of big problems for the second part, “immediately after payment, you’ll get the direct download link”, when all those trial download links were 404’s…
There is “Word List”, from “www.i496.net”, cannot be found.
There is WordMetry (29$ or 25$), from http://guoshesen.51.net/ - another 404.
I did not try “Crunch Wordlist Generator” (from sourceforge), “WordList 1.0” (free), “Free Wordlist Generator 1.9” (free), “WordList Generator” (free, from sourceforge), “Word List Compiler” (free, from LastBit Corp).
I did not try “Translator’s Abacus”, I did not try “WordCounter for MS Word” (20$, from Editorium), but both seem to be serious offerings.
3)
There is “Word Frequency Count Software 7.0” from Sobolsoft - their specialty is to have about 1,000 applications, since they systematically cut up their software into minimum scope: e.g. a word frequency tool for Word texts, for .txt texts, for .pdf texts, and so on, so you buy 10 times what should be the same software (For anything else, they do the same slicing-up.). I never bought anything from them, and I’ll never will.
There is http://www.seasite.niu.edu/trans/wordfrequency.htm - did not try.
There is MyWordCount, from mywritertools.com (15$) - does not seem to be bad, but there is no trial, and what do I know about software perhaps choking my real-world examples like the above if no trial is permitted?
There is “Word Frequzency Counter” (fre or 20$) from wordfrequencycounter.com - another 404, also a beta from 2009 under wordfrequency.codeplex.com - did not touch it.
There is WordStat (see below), from provalisresearch.com - well, this is 3,000$, and they also offer other qualitative analysis software and such, QDA Miner and more (academic prices are about 600$).
4)
There are some free concordancers from some universities, and which also do just word frequency lists:
There is “Abundatia Verborum”, from the university of Louvain / Leuven, Belgium. If you ask for access codes, they will perhaps send you them; I didn’t bother to do so. ( wwwling.arts.kuleuven.ac.be/genling/abundant )
There is “AntConc”, seemingly from a British scientist at some Japanese university - http://www.antlab.sci.waseda.ac.jp/software.html -, and it’s quite known, but I didn’t have real success with it: From its word list tab, it builds up the word list in just some seconds, but then takes 100 p.c. of my processor, for hours (!), and whilst other applications crawl (but ain’t entirely dead, as is AntConc), I’m unable to scroll the word list, so I only get the very first 39 hits. (with XP and 2 gigabytes of memory, most of it free) Perhaps on your system it works correctly, and then it would be a very good choice for a free program, it does a lot of things (if you can bring it to work).
You’ll have “Kwic Concordance”, from another Japanese university, http://www.chs.nihon-u.ac.jp/... - not tried but seems to be a serious offering; 5.0 is for XP, 5.1 being for more “modern” operating systems.
There is TextStat2 (not to be mixed-up with WordStat above), from Freie Universität Berlin, and I ended up using this program since it’s similar to AntConc (but with some early “difficulty”), and didn’t choke on 2,1 million characters but put out the word list after just several seconds. The “difficulty” lies in its gui: For the file to be analyzed, there is no extra pane, but you’ll put it into the main pane which afterwards will contain your list, after your triggering the list building command. So this is not very intuitive, but it works perfectly. On the other hand, I got some false positives with it, e.g. “tten” with 140 occurencies, when in fact, in the text, this “tten” is not a word, but just a part of words like “hatten”, “hätten”, etc. - perhaps my settings weren’t correct. On the other hand, my first 39 hits were more or less identical to those from AntConc, with only minor differences in frequency, so for creating better text expansions than just from my “guesses”, as before, I’ll now use TextStat2… AND also one of the programs below!:
5)
Now for some commercial offerings:
“Hermetic Word Frequency Counter”, from Switzerland, 40$ or, “Advanced Version”, 60$. If you often need such a tool, this advanced version (don’t buy the regular version, those 20$ more (plus tax, of course…) are really a good investment (for both: “PayPal 5 p.c. off”, which is very original, since PayPal is rather expensive for both sides (for the customer, too, because of their bad exchange rates), so in many cases, PayPal payments do bear a surcharge, not a rebate), you’ll have lots of options, and it’s in continuous development.. but this might be considered a problem, since there will then be frequent, paid updates… (They also offer 1-year licenses and 3-months licenses, another rare thing but devoid of sense since either you need such a thing “everyday”, or you’ll do with free offerings like I do here.)
6)
Textanz (40$) from http://www.textanz.com - not tried. Looks a serious offering, but then, those 40$ end up in 31,02€ plus VAT (with the € currently at 1,38$!!! So it always comes to “dollar to euro 1:1”, and I HATE THIS TO THE POINT OF NOT BUYING ANYMORE BUT IN CASE OF ABSOLUTE NECESSITY), and Franz Grieser wrote here:
http://www.outlinersoftware.com/topics/viewt/1560/15
Posted by Franz Grieser
Aug 27, 2010 at 12:48 PM
Hugh
>Curiously the single piece of pure desktop
>text-analysis software that I’d previously heard of, Textanz, isn’t mentioned in
>the list on the second link. Textanz is aimed at writers, not corporate or
>governmental data-miners, and is on the PC platform. It was last updated in 2009; a
>note on its website placed in 2010 says that a cross-platform Java version is being
>developed, but as the note spells its own software “Textans”, I don’t have high
>hopes!
I wouldn’t count Textanz as a text-analysis tool. What it does
- create a concordance (= a list of words used in a text)
- create a list of frequently used words and phrases
- show how often each word is used
- show where in the text a selected word is used
That is pretty useful for writers: I use the tool for checking my texts for repetitions.
But it’s not really what I consider text ANALYSIS.
Just my 2 cents.
Franz
7)
Which brings us to SmartEdit, also commented on here. Somebody (Delyannis) said it hadn’t left its free version, which is not true, but both a good screenshot and the free version are well hidden. Here’s a comparison: http://www.smart-edit.com/compare-smartedit.html and at the bottom of the page, there’s the download link for the lite version. Install it, then press (in the ribbon, unfortunately!) “Help”, which again brings you to the developer’s site, and here you’ll finally have a good screenshot of the program: http://www.smart-edit.com/help.html?free (Of course, this links works from here, too.)
Now this lite version is really crippled: There is no way to export your word frequency list, except by screenshot, scoll, another screenshot…
But for what I need this for, I’m happy with what I get, and I can sort (as in most other programs) both by frequency and alphabetically (the list is called “Repeated Words”). Here again, it’s not too intuitive: first, click on “Options”, and then, in “Select Checks to Run”, de-select everything except for “Run Word Frequency Counter”, click “OK”, then only click on “Run Checks”. Then, the program just takes some seconds, and you’ll get your word frequency list, and I cannot tell you HOW MUCH INSTRUCTIVE such a list will be: 8,000 “die”, 7,000 “der”, 6,500 “und”, 3,000 “das” / “nicht” / “ist”, 2,700 “zu” / “den”, 2,500 “sie” (attention: TextStat2 says, 1,500 “Sie” and 1,000 “sie”!), 2,300 “auch”, 2,100 “mit” / “es” / “oder” / “sich”, 2,000 “ein”, 1,900 “auf”, 1,700 “für” / “im”, 1,500 “des” / “eine” / “dass”, 1,400 “aber” / “dem” / “als” / “wenn”, 1,200 “bei” / “nur” / “sind”, 1,100 “er” / “werden” / “dann” / “man” / “wird”, 1,000 “wie” / “vor”, etc., etc. - you’ll understand that such a list is TOTALLY different from what I did expect to be my most frequent words in the past, and this explains why I always have been so unhappy with my abbreviations: As soon as you get the “real numbers”, you can tweak them accordingly, so that they finally become really useful to you!
But back to SmartEdit, the paid version: It’s 50$ for normal people, and about 50€ for us EU-subjects/-slaves, and this passage from the license, “SmartEdit is new software, first released in December, 2012. Though minor upgrades are free to licensed users, the software is sold as is, for what it does today, not for what it might do in the future.” has arisen more than one comment here and elsewhere. This being said, it has a lot of fine, potentially useful features, but I’m not sure at all it’s a concordancer, i.e. that it also shows the hits, from the list, within their respective context!
8)
As does “Concordance”, from http://www.concordancesoftware.co.uk/ (87$ plus VAT), so perhaps you’ll need both in the end if you really need them to look into your texts before publishing. On the other hand, the main purpose of SmartEdit seems to be to VARY your terminology, when in fact, it doesn’t help you with CONSOLIDATING, so its appeal to non-fiction writers is greatly reduced by this choice of scope, at this moment in time at least - development in just this very first year of its lifetime has been impressive, though.
9)
There are other concordancers, sometimes with 2-year licences only, from Athel/Athelstan, or then, they have got a word list, and a collocation list, but not in the same frame (e.g. tlCorpus Concordance 6.0, 55$: so you’ll have to switch forth and back between these two views).
And then, there are aligners, but that’s another story: Aligners have always fascinated me since we also discussed text atomization - which they automate, for translation purposes -, and I’m wondering if some of their concepts could be transposed to resolve similar problems - cross-referencing between and/or gathering of paragraphs all over the tree - in outliners (especially the semi-AI found in expensive aligners: not the free/200$, but the 1,000$ variety).
As it is, I highly recommend SmartEdit Lite for building up frequency lists for your autocompleter files, and if you need to export those lists, check out the university offerings mentioned above, especially TextStat2 (or AntConc if it doesn’t choke on your system, too).
Posted by 22111
Oct 26, 2013 at 02:47 PM
I just discovered that the SmartEdit word list only goes down to 5 occurrences of a given term (i.e. it only lists terms with a least 5 occurrences), which means that contrary to the (free or paid) concordancers, you can NOT use it for checking for typos. And of course, all these programs list singular/plural forms (in English, let alone German plural forms like Maus/Mäuse, or then, declination forms of words, nominativ die Maus, genitiv der Maus, but nominativ das Haus, genitiv des Hauses, and a myriad of others) - so much for “text analysis”: some better algorithms and/or prefetched word lists would be needed for such simili-AI here. So don’t get too excited and expect what such programs can NOT deliver yet.
Posted by 22111
Oct 26, 2013 at 02:49 PM
“list singular/plural forms” meant “list singular/plural forms distinctively, instead of consolidating them by option”.
Posted by 22111
Oct 31, 2014 at 01:03 PM
This is just for adding another aspect to this monologist thread: Word lists ain’t everything, even with context (concordance). (And within parentheses: Some people really try very hard, and deliver lots of value to their customers, for quite very modest sums of money over the years, Zoot being an outstanding example of this way of treating your customers as a real friend. And other developers seem to work the easy way, offerings lots of lists and such, with lots of code recycling, but also with very frequent paid “upgrades” to what finally does not bring anything really new AND exceptional, and where, paid upgrade after paid upgrade, the really needed things continue to be missing. Anyway, my comment on SmartEdit for MS Word over at bits will certainly put into perspective some of the alleged utility of easy-to-code list fabrication:)
http://www.bitsdujour.com/software/smartedit-for-word#comments94403
Some remarks.
SmartEdit (not for MS Word) had been on offer some time ago; there also was a cut-down free edition available, which does not seem to be available anymore.
I quickly disenchated from my trial; in fact so fast that I even had time to decide I finally did not want to buy even at the very low price for the full version, and afterwards, I never used the free version either - but this is all me, you mileage might grossly vary.
It’s understood SmartEdit does not offer spelling correction, so you need some dedicated tool for that, or you need that functionality to be integrated within your main program; fortunately this is the case with MS Word.
At the time, the developer spoke of his intention to cut his program into two: One for helping editing fiction and such, and one for textbooks and such; this would have been devoid of sense. So he obviously cut his program into the stand-alone tool, and a MS Word add-in, which obviously DOES make a lot of sense, not only commercially, but also for the user, three quarter or more of prospects certainly preferring the MS Word add-in now.
From the stand-alone version, though, you can deduct that “major” (which means: paid) updates are quite frequent, so holding your version up-to-date could become quite onerous, and here it becomes evident that developer’s and users’ interests might not be entirely consistent in the long run.
Anyway, most/lots of SmartEdit’s functionality (both stand-alone and MS Word add-in) is about list making, which from a coding point of view is rather easy, but which for editing is not as wonderfully helpful as you might think beforehand; on the other hand, integrating more interactivity would mean lots more of coding effort, so…
From two very recent (Sept and Oct, 2014) discussions over at donationcodercom (just search their forum for “smartedit”) re SmartEdit I deduct that even now, some major = paid updates far from 1.0, the standalone version (and by extension, the MS Word spin-off) does not offer any functionality yet that will check repeated use of meaningful words (i.e. instead of using either a synonym now or rephrase the wording altogether) within a sentence or two or three consecutive sentences, i.e. some sort of “progressive vicinity search” for words (let alone radicals: such a function would signal, e.g. different forms of the same verb following each other, as presumably unwanted) from word 1 of a text down to the last word of its very last sentence, for each of them checking any other word / phrase in the current sentence and in the next two sentences for possible unwanted repetition - hint: the coding would not be too difficult here either, even though such functionality would indeed comprise some exception tables, and even individual ones for this subject range or that one (in medical texts, for example, it’s evident that the very same Latin expressions would be repeated over and over again, without such a tool harassing you with “Do you really want to use this term again here?” dialogs).
Also, it’s clear as day that such functionality would be so much more handy in an add-in tool like the one on offer here, than in a stand-alone tool, since only the add-in version (if correctly programmed that is) would enable this check functionality in real-time, i.e. within Word when you type the original text, and yes, mid-term, there should be some integrated learning functionality grasping and correctly processing your reactions to these “Do you really” pop-ups for further interaction with the user; for example, there could be an “intermediate” reaction from the program, which for some things would not pop the dialog up anymore, but would indicate some POSSIBLY unwanted repetition by highlighting the previous occurence of some word or phrase with some background color for some seconds or as long as your continuation of typing exceeds the sentence in curse.
Thus - but that’s just me - it seems to me that both/all versions of SmartEdit, of all possible, imaginable functions out there, lack the most important, needed and helpful one of them all, this word / phrase repetition check (the overall word frequency list (where the frequency of 1 is left out btw, or has it been integrated in the meanwhile?) very probably being a useless gimmick for most users (or then, tell me why we would need it in real life, except perhaps for 1 user in 100?).
As it stands, I thus do not see much interest for me in this program; for MS Word users, this would be quite different indeed once SmartEdit got the above-described, so much needed functionality. Btw, and as for text expanding, such functionality should be integrated not within an MS Word add-in, but within some dedicated tool, running in the background of ANY word processor and other text program (outliners, screenplay software, and so on), just as any text expanders does; technically, it would be quite easy for any such expansion / macro tool running in the background anyway to be spiced up this way: I just would need its own, dedicated memory for the very last 3 sentences, and it would do its repetition checked within that memorized text; accordingly, whenever you edit some other, non-current-writing text, the tool automatically would fetch the sentence you edit, plus two sentences before and after, and work on that text compound.
In a word, it’s clear as day that “run-time” text edit (i.e. in the curse of actual writing writing) is even more useful than after-the-spot editing (especially because the same corrections there are much less time-demanding than the same corrections here, afterwards, and also bec/of a learning tool not even presenting you repetitions and such anymore as possible candidates for correction once it will have identified them as presumably not unwanted), so the idea to integrate SmartEdit into MS Word is to be followed… but then, please offer more really helpful functionality! Wherever really needed functionality is left out, much less needed functionality becomes auxiliary and sometimes to the brink of irrelevant.
Posted by MadaboutDana
Nov 4, 2014 at 04:24 PM
Thanks for the somewhat rambling but interesting summary of concordancing software.
It’s also worth bearing in mind that most standard computer-assisted translation (CAT) tools have powerful concordancing systems built in (although their developers haven’t yet taken the next logical step and made these concordancing systems the core function of the software).
For Mac users, DEVONthink’s fairly basic concordancing system is an excellent alternative. And although the function is fairly basic, the complementary ‘See also’ and ‘Classify’ (including auto-classify) functions turn this information management tool into a very powerful comparative knowledge base. As does the excellent search engine.
Currently my favourite. All the other wannabes (Together, Stache, Yojimbo etc.) have been discarded. DEVONthink rulez!