KLEJ Benchmark

NKJP-NER

The NKJP-NER is based on a human-annotated part of NKJP. We extracted sentences with named entities of exactly one type. The task is to predict the type of the named entity.

License:

GNU GPL v.3

Original citation:

@book{przepiorkowski2012narodowy,
title={Narodowy korpus j{\k{e}}zyka polskiego},
author={Przepi{\'o}rkowski, Adam},
year={2012},
publisher={Naukowe PWN}
}

CDSC-E

The Compositional Distributional Semantics Corpus consists of pairs of sentences which are human-annotated for their entailment.

License:

CC BY-NC-SA 4.0

Original citation:

@inproceedings{wroblewska2017polish,
title={Polish evaluation dataset for compositional distributional semantics models},
author={Wr{\'o}blewska, Alina and Krasnowska-Kiera{\'s}, Katarzyna},
booktitle={Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages={784--792},
year={2017}
}

CDSC-R

The Compositional Distributional Semantics Corpus consists of pairs of sentences which are human-annotated for their semantic relatedness.

License:

CC BY-NC-SA 4.0

Original citation:

CBD

The Cyberbullying Detection task was part of 2019 edition of PolEval competition. The goal is to predict if a given Twitter message contains a cyberbullying (harmful) content.

License:

BSD 3-Clause

Original citation:

@article{ptaszynski2019results,
title={Results of the PolEval 2019 Shared Task 6: First Dataset and Open Shared Task for Automatic Cyberbullying Detection in Polish Twitter},
author={Ptaszynski, Michal and Pieciukiewicz, Agata and Dyba{\l}a, Pawe{\l}},
journal={Proceedings of the PolEval 2019 Workshop},
publisher={Institute of Computer Science, Polish Academy of Sciences},
pages={89},
year={2019}
}

PolEmo2.0-IN

The PolEmo2.0 is a set of online reviews from medicine and hotels domains. The task is to predict the sentiment of a review. There are two separate test sets, to allow for in-domain (medicine and hotels) as well as out-of-domain (products and university) validation.

License:

CC BY-NC-SA 4.0

Original citation:

@inproceedings{kocon-etal-2019-multi,
title = "Multi-Level Sentiment Analysis of {P}ol{E}mo 2.0: Extended Corpus of Multi-Domain Consumer Reviews",
author = "Koco{\'n}, Jan and
Mi{\l}kowski, Piotr and
Za{\'s}ko-Zieli{\'n}ska, Monika",
booktitle = "Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)",
month = nov,
year = "2019",
address = "Hong Kong, China",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/K19-1092",
doi = "10.18653/v1/K19-1092",
pages = "980--991",
}

PolEmo2.0-OUT

License:

CC BY-NC-SA 4.0

Original citation:

DYK

The Did You Know (pol. Czy wiesz?) dataset consists of human-annotated question-answer pairs. The task is to predict if the answer is correct. We chose the negatives which have the largest token overlap with a question.

License:

CC BY-SA 3.0

Original citation:

@inproceedings{marcinczuk2013open,
title={Open dataset for development of Polish Question Answering systems},
author={Marcinczuk, Micha{\l} and Ptak, Marcin and Radziszewski, Adam and Piasecki, Maciej},
booktitle={Proceedings of the 6th Language \& Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, Wydawnictwo Poznanskie, Fundacja Uniwersytetu im. Adama Mickiewicza},
year={2013}
}

PSC

The Polish Summaries Corpus contains news articles and their summaries. We used summaries of the same article as positive pairs and sampled the most similar summaries of different articles as negatives.

License:

CC BY-SA 3.0

Original citation:

@inproceedings{ogro:kop:14:lrec,
title={The {P}olish {S}ummaries {C}orpus},
author={Ogrodniczuk, Maciej and Kope{\'c}, Mateusz},
booktitle = "Proceedings of the Ninth International {C}onference on {L}anguage {R}esources and {E}valuation, {LREC}~2014",
year = "2014",
}

AR

The Allegro Reviews is a set of product reviews from a popular e-commerce marketplace (Allegro.pl). The task is to predict a rating ranging from 1 to 5.

License:

CC BY-SA 4.0

Original citation:

@inproceedings{rybak-etal-2020-klej,
title = "{KLEJ}: Comprehensive Benchmark for Polish Language Understanding",
author = "Rybak, Piotr and Mroczkowski, Robert and Tracz, Janusz and Gawlik, Ireneusz",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.acl-main.111",
pages = "1191--1201",
}