IR & NLP

Evaluation of Fingerprint Selection Algorithms for Two-Stage Plagiarism Detection (Java code)

Version 1.0 (December 30, 2021) - download (GNU LGPL license)

This software was used for the experiments in the paper "Evaluation of Fingerprint Selection Algorithms for Two-Stage Plagiarism Detection" (also available from the publisher here).

This software is developed for evaluating the effectiveness of fingerprint selection algorithms for a two-stage (source retrieval + aligning) local text reuse detection. It implements Full fingerprinting, Every p-th, 0 mod p, Winnowing, Hailstorm, Frequency-Biased Winnowing (FBW), and Modified Frequency-Biased Winnowing (MFBW) (see the paper above or the paper here for details). Indexing of the fingerprints is implemented using the Apache Lucene library.

Fingerprint selection algorithms for local text reuse detection (Java code)

Version 1.0 (June 5, 2020) - download (GNU LGPL license)

This software was used for the experiments in the paper "Evaluation of Fingerprint Selection Algorithms for Local Text Reuse Detection" (also available from the publisher here).

This software is developed for evaluating the effectiveness of fingerprint selection algorithms for the source retrieval stage of a local text reuse detection system. It implements the following fingerprint selection algorithms (see the paper for details):

Full fingerprinting;
Every p-th;
0 mod p;
Winnowing;
Hailstorm;
Frequency-Biased Winnowing (FBW);
Modified Frequency-Biased Winnowing (MFBW) - proposed in the paper.

Local text reuse occurs when parts of a document, such as a paragraph or a sentence, are reused in another document. The reused text may also be modified by inserting, removing, replacing, or rearranging words or sentences, as well as interleaving text from one source with a text from another source. Detection of such reuse is central to a variety of applications, including plagiarism detection, origin detection, and information flow analysis.

StemmerLV: Latvian lemmatizer and stemmer for Java

Version 1.2 (November 6, 2020) - download (GNU LGPL license)

StemmerLV uses Hunspell affix and dictionary files created by Jānis Eisaks (available here).

StemmerLV was initially developed as a pure Java substitute for the HunspellJNA library to do lemmatization for Latvian but quickly got some additional functionality.

What StemmerLV can do:

Lemmatize a word according to the affix and dictionary files. The result is the same as with HunspellJNA (but unfortunately it works on average about 15% slower).
Save time by returning as soon as the first lemma is found. In this mode it works on average almost 3 times faster than HunspellJNA but never returns more than one lemma.
Stem a word by either finding or guessing its lemma and then stemming the lemma. Lemma guessing allows finding consistent short stems for unknown words that are not included in the dictionary.
List all word forms for a given lemma.
List all lemmas included in the dictionary together with all their word forms.

How StemmerLV does stemming:

Uses the affix file to generate lemma candidates for a given word.
Checks if any of the lemma candidates exist in the dictionary. If at least one candidate is there, discards all the candidates that are not there. If none of candidates are there and guessing is disabled, just returns the original word.
If none of the lemmas exist in the dictionary, filters out those with weird unnatural endings but keeps all the rest of the candidates as guesses for the lemma. There can be up to about 20 different guesses. (This step is skipped if guessing is disabled.)
Stems all lemmas using four simple character removal rules designed specifically for stemming Latvian lemmas and returns the shortest stem.

To use any of the functionality of StemmerLV, add the .jar file to your project, create a StemmerLV object, and see the list of available functions - their names are pretty self-explanatory. Source code is included in the .jar file.

EN/ET/LT/LV/RU machine-readable dictionaries extracted from Wiktionary

download

Machine-readable English-Estonian, English-Lithuanian, English-Latvian, English-Russian, Lithuanian-Estonian, Lithuanian-Russian, Latvian-Estonian, Latvian-Lithuanian, Latvian-Russian, and Russian-Estonian dictionaries extracted from English Wiktionary dump 20210720.

The dictionaries are licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License (“CC BY-SA”).

EN-RU / RU-EN V. K. Mueller's machine-readable dictionary

download

Machine-readable restructured English-Russian as well as reversed Russian-English versions of V. K. Mueller's English-Russian dictionary (7th edition, 1961). The dictionaries are based on files from here and here.
The aim was to make English-Russian and Russian-English dictionaries with very simple machine-readable structure (in contrast to the V. K. Mueller's dictionary files typically circulating on the internet). The focus was on translations and tags, therefore pronunciations and most of the explanatory phrases are not included.

Word counts:

English-Russian dictionary:
- 44886 entries
- 152250 translations
Russian-English dictionary (reversed):
- 72655 entries
- 152250 translations
Russian-English dictionary (reversed) with single-word Russian entries:
- 36116 entries
- 107058 translations

(See readme.txt for more details.)

The dictionaries are licensed under the GNU GPL licence (while it's not the best licence for a dictionary, I can't do much about it).

EN-RU / RU-EN machine-readable dictionaries extracted from Wiktionary

download

Machine-readable English-Russian and Russian-English dictionaries extracted and merged from English and Russian Wiktionary dump 20180220. Both separate and merged dictionaries are available. Only translations and POS tags are included.
The extraction was done using JWKTL (Java-based Wiktionary Library).

Word counts for merged dictionaries:

Merged English-Russian dictionary:
- 53672 entries
- 119600 translations
Merged English-Russian dictionary with single-word English entries:
- 45839 entries
- 110433 translations
Merged Russian-English dictionary:
- 61205 entries
- 119600 translations
Merged Russian-English dictionary with single-word Russian entries:
- 55472 entries
- 113284 translations

(See readme.txt for more details.)

The dictionaries are licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License (“CC BY-SA”).

Gints Jēkabsons

IR & NLP

Evaluation of Fingerprint Selection Algorithms for Two-Stage Plagiarism Detection (Java code)

Fingerprint selection algorithms for local text reuse detection (Java code)

StemmerLV: Latvian lemmatizer and stemmer for Java

EN/ET/LT/LV/RU machine-readable dictionaries extracted from Wiktionary

EN-RU / RU-EN V. K. Mueller's machine-readable dictionary

EN-RU / RU-EN machine-readable dictionaries extracted from Wiktionary