Our SIGDIAL 2023 paper on English RST parsing errors examines and models some of the factors associated with parsing difficulties
Our LAW-XVII 2023 (co-located with ACL 2023) paper on a Genre-Diverse Multilayer Challenge Set for English NLP and Linguistic Evaluation presents GENTLE , a new mixed-genre English challenge corpus totaling 17K tokens and consisting of 8 unusual text types for out-of-domain evaluation and openly released as part of the Universal Dependencies 2.12 version available here
Our ACL 2023 Findings paper on Multi-Genre Data and Evaluation for English Abstractive Summarization presents a 12-genre challenge set for English abstractive summarization (the extreme summarization task) following both generall and genre-specific guidelines
Our EACL 2023 paper on a thorough investigation of RST generalizability issues, with a focus on the impact of data diversity, thereby promoting multi-genre benchmarks for RST parsing based on our experimental results
Please join us online for Digital Coptic 3 , the virtual workshop for DH project on Coptic!
Would like to have more data to work with? Check our LREC paper , where we present a freely available, genre-balanced English web corpus totaling 4M tokens and featuring a large number of high-quality automatic annotation layers, including dependency trees, non-named entity annotations, coreference resolution, and discourse trees in Rhetorical Structure Theory.
Thoughts on how to treebank social media? Read our LREC paper
RFTokenizer now supports Arabic
GUM version 5.0.0 has been released!
Janet and Amir will present a paper about anchoring discourse signals in RST-DT at SCiL 2019
Logan, Janet and Amir are presenting 3 papers at AACL2018 in Atlanta
Logan is presenting a paper about UD GUM at MASC SLL in UMBC
V2.5.0 of Coptic Scriptorium corpora is released
Amir gave a talk about discourse signals at JHU