Publications and citation information

License and attribution information

GUM is made available under a Creative Commons license in keeping with the underlying texts. The documents from Wikimedia (Wikinews, including interviews, and Wikivoyage) are available under a CC-BY attribution license, as are academic articles, Wikipedia biographies, OpenStax textbooks and YouTube vlogs (retrieved using YouTube's Creative Commons filtered search). Some of the political speeches included in the corpus did not specify exact licenses, but are made available by official government and UN websites which indicate that these speeches are in the public domain, and not subject to copyright. Conversations from the Santa Barbara Corpus have been made available for annotation in GUM under the CC-BY license, courtesy of Jack DuBois (UCSB).

However please note that wikiHow texts and fiction texts are made available under a CC-BY-NC-SA license (non-commercial, share alike), meaning that commercial and/or non-open source use of those texts is prohibited. Data from reddit forum discussions is not made available with the corpus, but can be obtained using a script under the licensing conditions imposed by reddit. When using the data, please make sure to cite the sources of the texts as required by their source sites, and give credit to the GUM annotators, which are listed below, for the annotated data.

Academic citations

As a scholarly citation for the corpus in articles, please use the paper most closely matching your use case: