GUM - The Georgetown University Multilayer Corpus

Overview

A multilayer corpus for research on discourse models

Latest stable version: 7.0.0
Latest commit:

What's new in GUM?

GUM is an open source multilayer corpus of richly annotated texts from 24 text types. Annotations include:

Multiple POS tags, morphological features, morphological segmentation and lemmatization
Sentence segmentation and rough speech act
Document structure in TEI XML (paragraphs, headings, figures, etc.)
Normalized ISO date/time annotations
Speaker and addressee information (where relevant)
Constituent and (enhanced) Universal Dependencies syntax
Select grammatical constructions using Construction Grammar (CxG)
Information status (given, accessible, new, split antecedent) and graded salience (scores of 0-6 for all entities)
Entity and coreference annotation, including bridging anaphora
Entity linking (Wikification)
Discourse parses in enhanced Rhetorical Structure Theory (eRST), including connective detection, non-projective relations and and discourse dependencies
Shallow discourse relation annotations according to the PDTB v3 guidelines, including explicit and implicit connectives, alternative lexicalizations/AltLexC, entity relations and hypophora
Abstractive summarization, with five summaries per document

The corpus is collected and expanded by students as part of the curriculum in LING-4427 Computational Corpus Linguistics at Georgetown University. Each year students begin by choosing a text from within one of four possible genres, and as we learn about different annotation types and standards, participants are responsible for analyzing their own document, to which they add more and more layers of analysis: from part-of-speech tagging, through treebanking, entity recognition, discourse parsing, and more. Texts are chosen from openly available sources, and students who wish to contribute their analyses at the end of semester can do so under a Creative Commons license. The resulting data is checked for consistency and published online via GitHub. See this page for a list of contributors.

Text types and sources

Genre, modality, intended recipients, background knowledge and communicative intent all influence how we use language extensively. The selection of text types in GUM is meant to represent different communicative settings, while coming from sources that are readily and openly available, so that new texts can be annotated and published with ease, without restrictive licenses and free of charge. In order to support a collaborative environment, each year we work on texts from four genres, creating small groups of students conducting research on one type of texts, which can be compared with three others within the classroom. Every three years, we change genres and select four new types of data to work on. The GUM corpus currently contains the following proportions of texts:

Text type	Source	Docs	Tokens
Academic writing	Various	18	17,169
Biographies	Wikipedia	20	18,213
CC Vlogs	YouTube	15	16,864
Conversations	UCSB Corpus	15	17,932
Courtroom transcripts	Various	14	17,094
Essays	Various	14	17,176
Fiction	Various	19	17,511
Forum	reddit	18	16,364
How-to guides	wikiHow	19	17,081
Interviews	Wikinews	19	18,196
Letters	Various	18	16,362
News stories	Wikinews	24	17,191
Podcasts	Various	14	16,176
Political speeches	Various	15	16,720
Textbooks	OpenStax	15	16,693
Travel guides	Wikivoyage	18	16,515
Total GUM		275	273,257
OOD GENTLE		26	17,799
All partitions		301	291,056