Cross Comparison of documents
< Next Topic | Back to topic list | Previous Topic >
Posted by CRC
Feb 28, 2020 at 02:01 PM
This is a very interesting question. I think what it really leads you to is text mining or text analysis software. This is because you are probably not looking for character by character identical text, but text expressing particular concepts or ideas. You can find a number of these tools with a simple Google search for text mining.
As an example I once wanted to, given a corpus of documents representing proposals for past work by a particular company, find which one of those documents best matched the requirements in a request for a proposal. I experimented with this tool: https://gate.ac.uk/ . It turned into a very interesting and absorbing project, and while it was never used, it showed some real promise.
I will say that going down this road could be long and winding. If nothing else you will find it a incredibly engrossing and, perhaps rewarding.
Posted by Wojciech
Feb 29, 2020 at 07:11 PM
Dear All, many thanks for all your advice.
@CRC, yes, right, GATE looks promising, although I think it requires a lot of effort to master it.
No, text mining is not exactly what I need. Describing again what I mean, you can say that it is something like an auto plagiarism test. I have a collection of various types of documents, and I want to check which parts of them possibly coincide or are identical. Of course, I can do this by using popular programs to compare the content of two documents side by side, but since I have plenty of them, it would take a lot of time. That’s why I’m looking for a program that would make such a comparison – to compare every document with everyone. I was hoping that this is precisely what Plagiarism Checker X (option: Bulk Cross Comparison) has, but unfortunately, after installing the program, it turned out that it did not work – the program hangs – and the support did not find the time (or will) to answer me; another thing is that it only works on Windows. That’s why I’m still looking for the right program. If any of you had any other ideas, I would be grateful.
Posted by Lucas
May 23, 2020 at 09:55 PM
I just come across Textflo (for Windows), which offers some interesting text analysis algorithms. The site is here:
http://distributedcomputingsystems.co.uk/textfilter.html
The algorithms appear relatively esoteric, but might be relevant to this topic.
The program was very recently updated. The help file is extensive, although the link for the separate document on text filters doesn’t seem to work at the moment.