Overview
GUM's strange cousin
GENTLE is a manually annotated multilayer corpus following the same design and annotation layers as GUM, but of unusual text types. The goal of this corpus is to provide a test set of challenging genres for NLP systems to be evaluated on. In particular, we aim to make data available to support:
- Evaluating NLP models trained on homogeneous data (or even multi-genre data, such as GUM) to find how much they degrade on out-of-domain data.
- Describing and understanding unusual text types in linguistic terms, drawing comparisons with other more familiar genres (e.g. mathematical proof is not quite similar to any other genres, while poetry, which seems to be a highly non-conventional genre, is most similar to GUM's fiction genre).
Composition
GENTLE follows the same corpus design as GUM and serves as an extention to it by adding 8 unusual genres:
Text type | Source | Docs | Tokens | |
---|---|---|---|---|
Dictionary entries | Wiktionary | 3 | 2,423 | |
Esports commentaries | YouTube | 2 | 2,149 | |
Legal documents | Wikisource | 2 | 2,288 | |
Medical notes | MTSamples | 4 | 2,164 | |
Poetry | Wikisource | 5 | 2,090 | |
Mathematical proofs | ProofWiki | 3 | 2,106 | |
Syllabuses | GitHub | 2 | 2,431 | |
Threat letters | casetext | 5 | 2,146 | |
Total | 26 | 17,797 |