Download
Multiple formats, open access

You can download the entire corpus or separate annotation layers in the following formats. Please make sure to read about reconstructing reddit token data, which is not included in the downloadable version but can be added using a script. If you are interested in other subsets or formats of the data, please contact Amir Zeldes.
Format | Annotations |
---|---|
relANNIS3.3 | all (merged), for search with ANNIS |
PAULA XML | all (merged), in standoff XML |
TreeTagger/CWB/CQPWeb XML | token annotations and TEI, including sentence types and speakers |
Penn style brackets | tokens, pos, constituent categories and PTB functions |
CoNLL-U | UD dependencies, sentence types, speakers, entities, coreference, Wikification and RST dependencies |
CoNLL coreference format | untyped coreference and entities, excluding bridging relations |
WebAnno TSV3 format | typed coreference, including bridging, entity types, Wikification and information structure |
Rhetorical Structure Theory | untokenized text with RST analyses in .rs3 XML, lisp brackets and RST dependencies |