IR & NLP

StemmerLV: Latvian lemmatizer and stemmer for Java

Version 1.0 (January 18, 2020) - download (GNU LGPL license)

StemmerLV uses Hunspell affix and dictionary files created by Janis Eisaks (available here).

StemmerLV was initially developed as a pure Java substitute for the HunspellJNA library to do lemmatization for Latvian but quickly got some additional functionality.

What StemmerLV can do:

  • Lemmatize a word according to the affix and dictionary files. The result is the same as with HunspellJNA (but unfortunately it works on average about 15% slower).
  • Save time by returning as soon as the first lemma is found. In this mode it works on average almost 3 times faster than HunspellJNA but never returns more than one lemma.
  • Stem a word by either finding or guessing its lemma and then stemming the lemma. Lemma guessing allows finding consistent short stems for unknown words that are not included in the dictionary.
  • List all word forms for a given lemma.
  • List all lemmas included in the dictionary together with all their word forms.

How StemmerLV does stemming:

  1. Uses the affix file to generate lemma candidates for a given word.
  2. Checks if any of the lemma candidates exist in the dictionary. If at least one candidate is there, discards all the candidates that are not there. If none of candidates are there and guessing is disabled, just returns the original word.
  3. If none of the lemmas exist in the dictionary, filters out those with weird unnatural endings but keeps all the rest of the candidates as guesses for the lemma. There can be up to about 20 different guesses. (This step is skipped if guessing is disabled.)
  4. Stems all lemmas using four simple character removal rules designed specifically for stemming Latvian lemmas and returns the shortest stem.

To use any of the functionality of StemmerLV, add the .jar file to your project, create a StemmerLV object, and see the list of available functions - their names are pretty self-explanatory. Source code is included in the .jar file.

Latvian Twitter corpus

Latvian Twitter corpus consisting of more than 7 million tweets (11 million if we include retweets) gathered since 2017.
The corpus is available upon request for non-commercial research.

Twitter korpuss sastāvošs no vairāk nekā 7 miljoniem tvītu (11 miljoni, ja ieskaita retvītus) latviešu valodā, kas tiek vākti kopš 2017. gada.
Korpuss ir pieejams pēc pieprasījuma nekomerciālu pētījumu vajadzībām.

EN-RU / RU-EN V. K. Mueller's machine-readable dictionary

download

Machine-readable restructured English-Russian as well as reversed Russian-English versions of V. K. Mueller's English-Russian dictionary (7th edition, 1961). The dictionaries are based on files from here and here.
The aim was to make English-Russian and Russian-English dictionaries with very simple machine-readable structure (in contrast to the V. K. Mueller's dictionary files typically circulating on the internet). The focus was on translations and tags, therefore pronunciations and most of the explanatory phrases are not included.

Word counts:

  • English-Russian dictionary:
    • 44886 entries
    • 152250 translations
  • Russian-English dictionary (reversed):
    • 72655 entries
    • 152250 translations
  • Russian-English dictionary (reversed) with single-word Russian entries:
    • 36116 entries
    • 107058 translations

(See readme.txt for more details.)

The dictionaries are licensed under the GNU GPL licence (while it's not the best licence for a dictionary, I can't do much about it).

EN-RU / RU-EN machine-readable dictionaries extracted from Wiktionary

download

Machine-readable English-Russian and Russian-English dictionaries extracted and merged from English and Russian Wiktionary dump 20180220. Both separate and merged dictionaries are available. Only translations and POS tags are included.
The extraction was done using JWKTL (Java-based Wiktionary Library).

Word counts for merged dictionaries:

  • Merged English-Russian dictionary:
    • 53672 entries
    • 119600 translations
  • Merged English-Russian dictionary with single-word English entries:
    • 45839 entries
    • 110433 translations
  • Merged Russian-English dictionary:
    • 61205 entries
    • 119600 translations
  • Merged Russian-English dictionary with single-word Russian entries:
    • 55472 entries
    • 113284 translations

(See readme.txt for more details.)

The dictionaries are licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License (“CC BY-SA”).

Gints Jēkabsons, Dr.sc.ing.

Riga Technical University
Institute of Applied Computer Systems
Sētas str. 1, LV-1048, Riga, Latvia