Citations
Publications and citation information
License and attribution information
GUM is made available under a Creative Commons license in keeping with the underlying texts. The documents from Wikimedia (Wikinews, including interviews, and Wikivoyage) are available under a CC-BY attribution license, as are academic articles, Wikipedia biographies, OpenStax textbooks and YouTube vlogs (retrieved using YouTube's Creative Commons filtered search). Some of the political speeches included in the corpus did not specify exact licenses, but are made available by official government and UN websites which indicate that these speeches are in the public domain, and not subject to copyright. Conversations from the Santa Barbara Corpus have been made available for annotation in GUM under the CC-BY license, courtesy of Jack DuBois (UCSB).
However please note that wikiHow texts and fiction texts are made available under a CC-BY-NC-SA license (non-commercial, share alike), meaning that commercial and/or non-open source use of those texts is prohibited. Data from reddit forum discussions is not made available with the corpus, but can be obtained using a script under the licensing conditions imposed by reddit. When using the data, please make sure to cite the sources of the texts as required by their source sites, and give credit to the GUM annotators, which are listed below, for the annotated data.
Academic citations
As a scholarly citation for the corpus in articles, please use the paper most closely matching your use case:
- General citation for the corpus:
Zeldes, Amir (2017) "The GUM Corpus: Creating Multilayer Resources in the Classroom". Language Resources and Evaluation 51(3), 581–612. - Papers using the Reddit subcorpus:
Behzad, Shabnam and Zeldes, Amir (2020) "A Cross-Genre Ensemble Approach to Robust Reddit Part of Speech Tagging". In: Proceedings of the 12th Web as Corpus Workshop (WAC-XII), 50–56. - Papers focusing on entities:
Jessica Lin, and Amir Zeldes (2021), "WikiGUM: Exhaustive Entity Linking for Wikification in 12 Genres". In: Proceedings of The Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop (LAW-DMR 2021). Punta Cana, Dominican Republic, 170–175. - Using the OntoGUM annotations for coreference:
Yilun Zhu, Sameer Pradhan, and Amir Zeldes (2021), "OntoGUM: Evaluating Contextualized SOTA Coreference Resolution on 12 More Genres". In: Proceedings of ACL-IJCNLP 2021. Bangkok, Thailand, 461–467. - Zeldes, Amir and Simonson, Dan (2016) Different Flavors of GUM: Evaluating Genre and Sentence Type Effects on Multilayer Corpus Annotation Quality. In: Proceedings of LAW X - The 10th Linguistic Annotation Workshop at the Annual Meeting of the ACL. Berlin, 68-78.
- Zeldes, Amir (2016) rstWeb - A Browser-based Annotation Interface for Rhetorical Structure Theory and Discourse Relations. In: Proceedings of NAACL-HLT 2016 System Demonstrations. San Diego, CA, 1-5.
- Horsmann, Tobias, Erbs, Nicolai and Zesch, Torsten (2016), Fast or Accurate? – A Comparative Evaluation of PoS Tagging Models. In: Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology. Duisburg, Germany, 22-30.
- Zeldes, Amir and Zhang, Shuo (2016) When Annotation Schemes Change Rules Help: A Configurable Approach to Coreference Resolution beyond OntoNotes. In: Proceedings of the NAACL2016 Workshop on Coreference Resolution Beyond OntoNotes (CORBON). San Diego, CA, 92-101.
- Wojatzki, Michael, Melamud, Oren and Zesch, Torsten (2016) Bundled Gap Filling: A New Paradigm for Unambiguous Cloze Exercises. In: Proceedings of the Building Educational Applications Workshop at NAACL 2016. San Diego, CA, 172-181.
- Plank, Barbara (2016) What to do about non-standard (or non-canonical) language in NLP. In: Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016). Bochum, Germany, 13-20.
- Meyer, Niklas, Wojatzki, Michael and Zesch, Torsten (2016) Validating Bundled Gap Filling – Empirical Evidence for Ambiguity Reduction and Language Proficiency Testing Capabilities. In: Proceedings of the NLP4CAL at SLTC 2016. Umea, Sweden, 2016.
- Horsmann, Tobias and Zesch, Torsten (2016) Assigning Fine-grained PoS Tags based on High-precision Coarse-grained Tagging. In: Proceedings of COLING 2016. Osaka, 328-336.
- Krause, Thomas, Leser, Ulf and Lüdeling, Anke (2016) graphANNIS: A Fast Query Engine for Deeply Annotated Linguistic Corpora. Journal for Language Technology and Computational Linguistics 31(1), 1-25.
- Zeldes, Amir (2017) The GUM Corpus: Creating Multilayer Resources in the Classroom. Language Resources and Evaluation 51(3), 581-612. (this is the reference paper for citing the corpus)
- Zeldes, Amir (2017) A Distributional View of Discourse Encapsulation: Multifactorial Prediction of Coreference Density in RST. In: 6th Workshop on Recent Advances in RST and Related Formalisms at INLG. Santiago de Compostela, Spain.
- Peng, Siyao and Zeldes, Amir (2018) All Roads Lead to UD: Converting Stanford and Penn Parses to English Universal Dependencies with Multilayer Annotations. In: Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018) at COLING2018. Santa Fe, NM, 167-177.
- Rodriguez, Juan Diego, Caldwell, Adam and Liu, Alexander (2018). Transfer Learning for Entity Recognition of Novel Classes. In Proceedings of COLING 2018. Santa Fe, NM, 1974-1985.
- Zeldes, Amir (2018) A Multi-Dimensional Analysis of RST Discourse Relations in Eight Genres. In: 14th American Association of Corpus Linguistics Conference (AACL 2018). Atlanta, GA.
- Peng, Siyao and Zeldes, Amir (2018) Validating and Merging a Growing Multilayer Corpus – the Case of GUM. In: 14th American Association of Corpus Linguistics Conference (AACL 2018). Atlanta, GA.
- Prange, Jakob, Schneider, Nathan and Abend, Omri (2019) Semantically Constrained Multilayer Annotation: The Case of Coreference. In: First International Workshop on Designing Meaning Representations (DMR). Florence, Italy.
- Yan, Jianwei and Liu, Haitao (2019) Which annotation scheme is more expedient to measure syntactic difficulty and cognitive demand?. In: Proceedings of the First Workshop on Quantitative Syntax (Quasy, SyntaxFest 2019). Paris, France,16-24.
- Philippe Muller, Chloé Braud, Mathieu Morey (2019) ToNy: Contextual embeddings for accurate multilingual discourse segmentation of full documents. In: Proceedings of the Workshop on Discourse Relation Parsing and Treebanking 2019. Minneapolis, MN, 115–124.
- Gessler, Luke, Peng, Siyao, Liu, Yang, Zhu, Yilun, Behzad, Shabnam and Zeldes, Amir (2020) AMALGUM – A Free, Balanced, Multilayer English Web Corpus. In: Proceedings of LREC 2020. Marseille, France, 5267-5275.
- Behzad, Shabnam and Zeldes, Amir (2020) A Cross-Genre Ensemble Approach to Robust Reddit Part of Speech Tagging. In: Proceedings of the 12th Web as Corpus Workshop (WAC-XII). Marseille, France, 50–56.
- Sanguinetti, Manuela, Bosco, Cristina, Cassidy, Lauren, Çetinoğlu, Özlem, Cignarella, Alessandra Teresa, Lynn, Teresa, Rehbein, Ines, Ruppenhofer, Josef, Seddah, Djamé and Zeldes, Amir (2020) Treebanking User-Generated Content: A Proposal for a Unified Representation in Universal Dependencies. In: Proceedings of LREC 2020. Marseille, France, 5240-5250.
- Hoo, Yutai Wanxiang Che, Yongkui Lai, Zhihan Zhou, Yijia Liu, Han Liu and Ting Liu (2020) Few-shot Slot Tagging with Collapsed Dependency Transfer and Label-enhanced Task-adaptive Projection Network. In: Proceedings of ACL 2020. Seattle, WA.
- Lan, Ouyu, Huang, Xiao, Lin, Bill Yuchen, Jiang, He, Liu, Liyuan and Ren, Xiang (2020) Learning to Contextually Aggregate Multi-Source Supervision for Sequence Labeling. In Proceedings of ACL 2020, 2134-2146.
- Hao, Zhifeng, Lu, Di, Li, Zijian, Cai, Ruichu, Wen, Wen and Xu, Boyan (2021) Semi-Supervised Disentangled Framework for Transferable Named Entity Recognition. Neural Networks 135, 127-138.
- Sun, Kun, Wang, Rong and Xiong, Wenxin (2021) Investigating genre distinctions through discourse distance and discourse network. Corpus Linguistics and Linguistic Theory.
- Sarti, Gabriele, Brunato, Dominique and Dell'Orletta, Felice (2021) That Looks Hard: Characterizing Linguistic Complexity in Humans and Language Models. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics at NAACL 2021, Online, 48-60.
@Article{Zeldes2017, author = {Amir Zeldes}, title = {The {GUM} Corpus: Creating Multilayer Resources in the Classroom}, journal = {Language Resources and Evaluation}, year = {2017}, volume = {51}, number = {3}, pages = {581--612}, doi = {http://dx.doi.org/10.1007/s10579-016-9343-x} }
@InProceedings{BehzadZeldes2020, author = {Shabnam Behzad and Amir Zeldes}, title = {A Cross-Genre Ensemble Approach to Robust {R}eddit Part of Speech Tagging}, booktitle = {Proceedings of the 12th Web as Corpus Workshop (WAC-XII)}, pages = {50--56}, year = {2020}, url = {https://aclanthology.org/2020.wac-1.7/} }
@inproceedings{lin-zeldes-2021-wikigum, title = {{W}iki{GUM}: Exhaustive Entity Linking for Wikification in 12 Genres}, author = {Jessica Lin and Amir Zeldes}, booktitle = {Proceedings of The Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop (LAW-DMR 2021)}, year = {2021}, address = {Punta Cana, Dominican Republic}, url = {https://aclanthology.org/2021.law-1.18}, pages = {170--175}, }
@InProceedings{ZhuEtAl2021, author = {Yilun Zhu and Sameer Pradhan and Amir Zeldes}, booktitle = {Proceedings of ACL-IJCNLP 2021}, title = {{OntoGUM}: Evaluating Contextualized {SOTA} Coreference Resolution on 12 More Genres}, year = {2021}, address = {Bangkok, Thailand}, pages = {461--467}, url = {https://aclanthology.org/2021.acl-short.59.pdf} }
The ISLRN for the corpus is 421-566-418-865-2.
Papers using GUM
This is a (non-exhaustive) list of papers using the GUM corpus, feel free to let us know if you know more:
For other research citing GUM, see also the Semantic Scholar entry for the reference paper.