學術英文編修
中文 繁體中文 English 한국어 日本語 Português Español

Understanding the Similarity Score

The similarity score is the first thing you see when a document is processed and, because it’s easy to focus on this number as signifying a problem, a common question new users of the system ask is ‘what level of similarity score indicates a problem?’

The answer to this question is there is no such thing as a ‘magic number’ that will tell you whether a document contains problematic content. The similarity score gives you a rough ‘headline’ that ensures heavily duplicated papers are brought straight to your attention and allows you to quickly disregard papers with hardly any matches. Beyond that, the score itself doesn’t give you definitive answers and definitely cannot tell you whether you have a case of plagiarism.

Why is this?

Well, there are a number of factors that need to be taken into account when assessing a paper’s overall similarity score.

Firstly, it’s important to note the similarity score is telling you the total amount of matching text. This is probably going to be made up of a number of smaller matches. It is possible a 30% score will turn out to be a 30% match to one source, but it’s much more likely that when you look at the reports you’ll find the 30% is made up of a number of smaller matches, the largest of which might be just 4 or 5%. Of course, a paper with six separate matches of 5% could well be as problematic as one that has copied 30% of its content from a single source, but it’s impossible to tell whether this is the case without looking at the reports.

Secondly, where the match appears can sometimes be more important than how big the match is. For example, editors in certain subject areas may be less concerned about sizable matches in methods sections, where there are only so many ways to describe a certain process. A match in the discussion or conclusions with no appropriate citation, on the other hand, could set alarm bells ringing even though it only accounts for a small percentage of the manuscript.

Similarly, acceptable thresholds for one type of article may not be appropriate for another: Review articles could be expected to have a higher overall similarity score than original research articles.

It is also important to bear in mind there could be simple errors in the unedited manuscript that mean matches are picked up incorrectly. The exclude bibliography feature of crossref sofewares relies on the reference section having a title on its own line within the document. If this is omitted from the manuscript, the references will not be excluded.

Similarly, the exclude quotes feature looks for quotation marks. If the author has not used quotation marks or missed one at the start or end of the passage, the system will not recognize it as a quote, even though it might be apparent to the editor due to its layout and reference.

For all of these reasons it’s important to look at the reports rather than rely on the similarity score alone.

(Please retain the reference in reprint: http://www.letpub.com/index.php?page=author_education_evidence)