Production of Greek language resources

Capitalizing on the Greek Web content, about 20 million URIs, which consists of approximately 10 trillion characters, using state of the art technologies and open source software we produced the following language resources.

Production of vector word representations using fasttext as described in the relevant paper.

  • File .vec (1.3G)
  • File .bin (5.6G)

Datasets produced for the evaluation of Greek word embeddings, with the methodology described in the relevant paper.

Datasets produced for the evaluation of CBOS, a new word embeddings method proposed in "An Ensemble Method for Producing Word Representations focusing on the Greek Language" with the methodology described in the relevant paper.

We are initially releasing a small Electra pre-trained model: ELECTRA-Small, Uncased: 12-layer, 256-hidden, 14M parameters.

The model was trained on 80Gb uncased Greek text produced in Word Embeddings from Large-Scale Greek Web content. For training we used the official code provided in github repository.

We also plan to release soon an ELECTRA-Base and a BERT-Base model.

Download the file (50M)

Download the file (954M)

Download the file (3.7G)

Lexicon contains 500K words with their counts in Greek Web Corpus. It was produced following various cleaning processing steps on the 2M tokens of unigrams.

It is worth mentioning that there are thousands newly appearring words (not included in any Greek thesaurus). A sample of them follows:

Word Wordcount
τριχοσμηγματογόνο 68
φωσφατιδυλική 68
συνεφρίνη 68
ομοκυστεϊναιμία 22
κυκλοκόνιου 22
ενωχιανές 22
θλιπτηρίου 22
διβαράτικο 22

Download the file (5M)

Download the file (2K)

The raw text corpus produced with the methodology described in the relevant paper.

Download the file (11G)

If you use these language resources, please cite the following papers:

@article{Outsios2018,
author = {Outsios, Stamatis and Skianis, Konstantinos and Meladianos, Polykarpos and Xypolopoulos, Christos and Vazirgiannis, Michalis},
journal = {arXiv preprint arXiv:1810.06694},
title = {Word Embeddings from Large-Scale Greek Web content},
year = {2018}
}

@article{Outsios2019,
author = {Outsios, Stamatis and Karatsalos, Christos and Skianis, Konstantinos and Vazirgiannis, Michalis},
journal = {arXiv preprint arXiv:1904.04032},
title = {Evaluation of Greek Word Embeddings},
year = {2019}
}

@article{Lioudakis2020,
author = {Lioudakis, Michalis and Outsios, Stamatis and Vazirgiannis, Michalis},
journal = {arXiv preprint arXiv:1904.04032},
title = {An Ensemble Method for Producing Word Representations focusing on the Greek Language},
year = {2020}
}

For any relevant info please email to: mvazirg@aueb.gr

Credits
Research group "Data & Web Mining" of Department of Informatics (Athens University of Economics and Business)
Niarchos Foundation (partial funding)
Stamatis Outsios, Polykarpos Meladianos, Christos Karatsalos, Michalis Lioudakis, Marina Rekatsina, Konstantinos Skianis, Christos Xypolopoulos

stout © 2018-2021 - DBnet
Powered by FlaskJinja, Bootstrap 4, Apache2 mod_wsgi