Cross Comparison of documents
Started by Wojciech
on 2/22/2020
Wojciech
2/22/2020 9:26 pm
I am sorry for such an off-topic question, but I could not find the answer by myself :)
I have dozens of documents in doc/docx, pdf, txt files. These are my notes, summaries, quotations, full articles and shorter drafts etc. I want to compare them to find repetitions, borrowings, content I used in full or just in part and so on. Of course, there is a Word feature 'Compare documents', but it allows for a comparison of only two documents while I need to compare each file with the rest of them. There is also 'Plagiarism Checker' software that has an option 'Bulk Search', but it only shows the percentage of duplicate content while I want to know precisely which parts are duplicated and which are not. Do you possibly know the solution to such a problem? Maybe there is an outliner that makes it possible to import all these documents and cross-comparing the content of all of them against each other?
I will be grateful for any help and advice.
I have dozens of documents in doc/docx, pdf, txt files. These are my notes, summaries, quotations, full articles and shorter drafts etc. I want to compare them to find repetitions, borrowings, content I used in full or just in part and so on. Of course, there is a Word feature 'Compare documents', but it allows for a comparison of only two documents while I need to compare each file with the rest of them. There is also 'Plagiarism Checker' software that has an option 'Bulk Search', but it only shows the percentage of duplicate content while I want to know precisely which parts are duplicated and which are not. Do you possibly know the solution to such a problem? Maybe there is an outliner that makes it possible to import all these documents and cross-comparing the content of all of them against each other?
I will be grateful for any help and advice.
Andy Brice
2/22/2020 10:52 pm
My favourite file comparison tool is the excellent 'Beyond compare' by Scooter Software. I think it will only let you compare pairs of documents or pairs of folders though.
Prion
2/23/2020 8:46 am
If you need to get an overview about similar documents (i.e. one against many) the excellent Devonthink is a strong contender. It does not show you which parts of the documents are similar or identical but you get a list of documents ordered by their similarity with the document you are comparing against. It does give you a list of terms that contribute to the similarity although this is only possible after selecting a document in the above list of candidates (i.e. a pairwise comparison).
macOS only I am afraid, there is an iOS companion app but it lacks this "see also" feature.
macOS only I am afraid, there is an iOS companion app but it lacks this "see also" feature.
satis
2/23/2020 9:46 pm
You can compare one document to another with the 'diff' tool via the Mac command-line:
https://osxdaily.com/2018/02/06/use-diff-compare-files-command-line-mac/
There are apparently graphic front-ends for it:
https://stackoverflow.com/questions/7871702/is-there-any-graphical-binary-diff-tool-for-mac-os-x
Here are some apps that do diff:
https://www.git-tower.com/blog/diff-tools-mac/
I've done diff on text/html documents using BBEdit, which has a free tier; not sure what is or isn't in the free tier, though it's a great app. (Currently $10 off, to $39, with coupon SMILEWORTHY)
https://www.barebones.com/products/bbedit/
https://osxdaily.com/2018/02/06/use-diff-compare-files-command-line-mac/
There are apparently graphic front-ends for it:
https://stackoverflow.com/questions/7871702/is-there-any-graphical-binary-diff-tool-for-mac-os-x
Here are some apps that do diff:
https://www.git-tower.com/blog/diff-tools-mac/
I've done diff on text/html documents using BBEdit, which has a free tier; not sure what is or isn't in the free tier, though it's a great app. (Currently $10 off, to $39, with coupon SMILEWORTHY)
https://www.barebones.com/products/bbedit/
Lucas
2/23/2020 10:19 pm
A bit of googling turned up this free (Windows) option, which might be worth a try:
https://plagiarism.bloomfieldmedia.com/software/wcopyfind/
"WCopyfind is an open source windows-based program that compares documents and reports similarities in their words and phrases."
See the instructions page:
https://plagiarism.bloomfieldmedia.com/software/wcopyfind-instructions/
"When the process finishes, a browser window will open, allowing you to examine the pairs of matching files. You can click on the files individually for ease of printing or you can click on the “side-by-side” option to display the pair of file together in adjacent panels of new browser window.
"When you view the files side-by-side, all the matching phrases are actively linked between the two files. If you click on a matching phrase in the left file panel, the corresponding phrase in the right file will move to the top of the right panel, and vice versa."
https://plagiarism.bloomfieldmedia.com/software/wcopyfind/
"WCopyfind is an open source windows-based program that compares documents and reports similarities in their words and phrases."
See the instructions page:
https://plagiarism.bloomfieldmedia.com/software/wcopyfind-instructions/
"When the process finishes, a browser window will open, allowing you to examine the pairs of matching files. You can click on the files individually for ease of printing or you can click on the “side-by-side” option to display the pair of file together in adjacent panels of new browser window.
"When you view the files side-by-side, all the matching phrases are actively linked between the two files. If you click on a matching phrase in the left file panel, the corresponding phrase in the right file will move to the top of the right panel, and vice versa."
washere
2/24/2020 12:40 am
Coders know the old fc command well. However as an application, a couple of years ago I tested quite a few in this genre too, file comparison apps, the best two for me were:
* Araxis Merge
* ExamDiff Pro (Master Edition)
* Araxis Merge
* ExamDiff Pro (Master Edition)
MadaboutDana
2/24/2020 10:35 am
On the Mac, DeltaWalker is a fairly impressive option.
Franz Grieser
2/24/2020 11:57 am
Deltawalker is also available for Windows. Strangely, the website does not mention Windows - except for the Buy and the Download page, where you can get a 64 bit Windows version.
When you use a Mac, you could make use of the new Mac bundle on Bundlehunt and get Deltawalker Pro for $5 (instead of $59): https://bundlehunt.com/?ap_id=macappware
When you use a Mac, you could make use of the new Mac bundle on Bundlehunt and get Deltawalker Pro for $5 (instead of $59): https://bundlehunt.com/?ap_id=macappware
Franz Grieser
2/24/2020 12:02 pm
One more thing: Deltawalker Pro supports comparing 2 documents to a third one (as does ExamDiff Pro). The standard edition only supports comparing 2 documents (as do most of the other file comparing tools).
But if I am not mistaken: That is not exactly what Wojciech is looking for.
But if I am not mistaken: That is not exactly what Wojciech is looking for.
Wojciech
2/24/2020 8:57 pm
Many thanks to All for the hints! However, Franz is right - I need to cross-compare multiple documents against each other.
As I mentioned, Plagiarism Checker is able to show repetitive text, but only as a percentage and is only for Windows - I need to do it on a Mac.
As I mentioned, Plagiarism Checker is able to show repetitive text, but only as a percentage and is only for Windows - I need to do it on a Mac.
xtabber
2/27/2020 12:22 am
I don’t think file comparison software is likely to be very helpful for your purpose.
If you can define the text patterns that you are looking for, even approximately, a good full text search program will allow you to search for them across multiple files and return both the number of hits and the files where the patterns occur.
File Locator Pro and dtSearch will both do that, but both require Windows. There may be Mac solutions available too, but I wouldn’t know about them.
If you want to analyze files to isolate text patterns that you might then want to search for, you should be looking at text mining software. There are many text mining systems targeted to the marketing research and legal professions, but they tend to be both very pricy and complex.
If you can define the text patterns that you are looking for, even approximately, a good full text search program will allow you to search for them across multiple files and return both the number of hits and the files where the patterns occur.
File Locator Pro and dtSearch will both do that, but both require Windows. There may be Mac solutions available too, but I wouldn’t know about them.
If you want to analyze files to isolate text patterns that you might then want to search for, you should be looking at text mining software. There are many text mining systems targeted to the marketing research and legal professions, but they tend to be both very pricy and complex.
MadaboutDana
2/27/2020 10:40 am
This is a really interesting issue, I must say.
Translation software uses the “concordance” feature to identify matching phrases in multiple documents, but it wouldn’t really work for the use case you’re describing.
The trouble is, most of the free/cheap concordance software you’ll find online only shows occurrences of individual words or predefined phrases; after that, as far as I can tell, you’d have to tackle the Very Expensive stuff lawyers love to use.
However, this might be a useful starting point. There’s an interesting list of free (Windows) software here, for example: https://listoffreeware.com/best-free-concordance-software-windows/
I’m not sure how many of them work across multiple documents, however. That’s where the Very Expensive analytics software comes in, I fear.
Another option would be to talk to the nice gentleman who developed, but then shelved, a Windows concordance app that was clearly very popular in the day: http://www.concordancesoftware.co.uk
There are a few open-source concordance programs about, it appears, but I have no idea how good they are. The Wikipedia article on “Concordancers” might provide you with some interesting starting points? https://en.wikipedia.org/wiki/Concordancer
Actually, I’ve just found an interesting page on all kinds of concordancers: http://martinweisser.org/corpora_site/concordancers.html
Might be worth a look!
Cheers,
Bill
Translation software uses the “concordance” feature to identify matching phrases in multiple documents, but it wouldn’t really work for the use case you’re describing.
The trouble is, most of the free/cheap concordance software you’ll find online only shows occurrences of individual words or predefined phrases; after that, as far as I can tell, you’d have to tackle the Very Expensive stuff lawyers love to use.
However, this might be a useful starting point. There’s an interesting list of free (Windows) software here, for example: https://listoffreeware.com/best-free-concordance-software-windows/
I’m not sure how many of them work across multiple documents, however. That’s where the Very Expensive analytics software comes in, I fear.
Another option would be to talk to the nice gentleman who developed, but then shelved, a Windows concordance app that was clearly very popular in the day: http://www.concordancesoftware.co.uk
There are a few open-source concordance programs about, it appears, but I have no idea how good they are. The Wikipedia article on “Concordancers” might provide you with some interesting starting points? https://en.wikipedia.org/wiki/Concordancer
Actually, I’ve just found an interesting page on all kinds of concordancers: http://martinweisser.org/corpora_site/concordancers.html
Might be worth a look!
Cheers,
Bill
MadaboutDana
2/27/2020 10:48 am
Alternatively, as xtabber suggests, you could use a good search engine.
My favourite on macOS is FoxTrot Pro, and I use it for trawling through my own corpus of research documents; you can customise indices to a considerable degree if you wish - and it’s not ridiculously expensive. dtSearch is an amazing tool, but only available on Windows (or as a web server).
Cheers!
Bill
My favourite on macOS is FoxTrot Pro, and I use it for trawling through my own corpus of research documents; you can customise indices to a considerable degree if you wish - and it’s not ridiculously expensive. dtSearch is an amazing tool, but only available on Windows (or as a web server).
Cheers!
Bill
MadaboutDana
2/27/2020 11:10 am
Now this looks more like the kind of thing you need (Windows only, unfortunately): https://tshwanedje.com/corpus/
There’s also the respected WordSmith (again, Windows only), here:
And an interesting open-source tool that looks pretty serious: http://www.ddc-concordance.org (available for Windows and Linux).
This might actually be your best bet - development looks lively!
There’s also the respected WordSmith (again, Windows only), here:
And an interesting open-source tool that looks pretty serious: http://www.ddc-concordance.org (available for Windows and Linux).
This might actually be your best bet - development looks lively!
Alexander Deliyannis
2/27/2020 5:54 pm
MadaboutDana wrote:
Apparently it also works on MacOS with some limitations.
I assume you mean this:
https://lexically.net/wordsmith/index.html
Interesting, thanks!
Now this looks more like the kind of thing you need (Windows only,
unfortunately): https://tshwanedje.com/corpus/
Apparently it also works on MacOS with some limitations.
There's also the respected WordSmith (again, Windows only), here:
I assume you mean this:
https://lexically.net/wordsmith/index.html
Interesting, thanks!
CRC
2/28/2020 2:01 pm
This is a very interesting question. I think what it really leads you to is text mining or text analysis software. This is because you are probably not looking for character by character identical text, but text expressing particular concepts or ideas. You can find a number of these tools with a simple Google search for text mining.
As an example I once wanted to, given a corpus of documents representing proposals for past work by a particular company, find which one of those documents best matched the requirements in a request for a proposal. I experimented with this tool: https://gate.ac.uk/ . It turned into a very interesting and absorbing project, and while it was never used, it showed some real promise.
I will say that going down this road could be long and winding. If nothing else you will find it a incredibly engrossing and, perhaps rewarding.
As an example I once wanted to, given a corpus of documents representing proposals for past work by a particular company, find which one of those documents best matched the requirements in a request for a proposal. I experimented with this tool: https://gate.ac.uk/ . It turned into a very interesting and absorbing project, and while it was never used, it showed some real promise.
I will say that going down this road could be long and winding. If nothing else you will find it a incredibly engrossing and, perhaps rewarding.
Wojciech
2/29/2020 7:11 pm
Dear All, many thanks for all your advice.
@CRC, yes, right, GATE looks promising, although I think it requires a lot of effort to master it.
No, text mining is not exactly what I need. Describing again what I mean, you can say that it is something like an auto plagiarism test. I have a collection of various types of documents, and I want to check which parts of them possibly coincide or are identical. Of course, I can do this by using popular programs to compare the content of two documents side by side, but since I have plenty of them, it would take a lot of time. That's why I'm looking for a program that would make such a comparison – to compare every document with everyone. I was hoping that this is precisely what Plagiarism Checker X (option: Bulk Cross Comparison) has, but unfortunately, after installing the program, it turned out that it did not work – the program hangs – and the support did not find the time (or will) to answer me; another thing is that it only works on Windows. That's why I'm still looking for the right program. If any of you had any other ideas, I would be grateful.
@CRC, yes, right, GATE looks promising, although I think it requires a lot of effort to master it.
No, text mining is not exactly what I need. Describing again what I mean, you can say that it is something like an auto plagiarism test. I have a collection of various types of documents, and I want to check which parts of them possibly coincide or are identical. Of course, I can do this by using popular programs to compare the content of two documents side by side, but since I have plenty of them, it would take a lot of time. That's why I'm looking for a program that would make such a comparison – to compare every document with everyone. I was hoping that this is precisely what Plagiarism Checker X (option: Bulk Cross Comparison) has, but unfortunately, after installing the program, it turned out that it did not work – the program hangs – and the support did not find the time (or will) to answer me; another thing is that it only works on Windows. That's why I'm still looking for the right program. If any of you had any other ideas, I would be grateful.
Lucas
5/23/2020 9:55 pm
I just come across Textflo (for Windows), which offers some interesting text analysis algorithms. The site is here:
http://distributedcomputingsystems.co.uk/textfilter.html
The algorithms appear relatively esoteric, but might be relevant to this topic.
The program was very recently updated. The help file is extensive, although the link for the separate document on text filters doesn't seem to work at the moment.
http://distributedcomputingsystems.co.uk/textfilter.html
The algorithms appear relatively esoteric, but might be relevant to this topic.
The program was very recently updated. The help file is extensive, although the link for the separate document on text filters doesn't seem to work at the moment.
