Multiple formats, open access

GUM in CoNLLU format

You can download the entire corpus or separate annotation layers in the following formats. Please make sure to read about reconstructing reddit token data, which is not included in the downloadable version but can be added using a script. If you are interested in other subsets or formats of the data, please contact Amir Zeldes.

relANNIS3.3all (merged), for search with ANNIS
PAULA XMLall (merged), in standoff XML
TreeTagger/CWB/CQPWeb XMLtoken annotations and TEI, including sentence types and speakers
Penn style bracketstokens, pos, constituent categories and PTB functions
CoNLL-UUD dependencies, morphology, sentence types, speakers, entities, coreference, Wikification and RST dependencies
CoNLL coreference formatuntyped coreference and entities, excluding bridging relations
WebAnno TSV3 formattyped coreference, including bridging, entity types, Wikification and information structure
Rhetorical Structure Theory++untokenized text with RST analyses in .rs4 XML, lisp brackets, DISRPT formats and RST dependencies