Overview

I am a computational linguist specializing in work on and with corpora, meaning I spend a lot of my time creating and analyzing datasets, as well as the NLP and annotation tools needed in order to build them. I also run the Georgetown University Corpus Linguistics lab, Corpling@GU, and I am currently president of the ACL Special Interest Group on Annotation (SIGANN).

My main research interests are in computational models of discourse, above the sentence level: I study how discourse is constructed in natural language and within computational models, such as LLMs. I'm especially interested in computational models of salience, referentiality and discourse relations, as well as the inferences that they entail. Which entities do we track in conversation and how do we signal their relative importance for our message? How are they introduced into the discourse initially and referred back to later? How do we recognize discourse relations which signal how a current utterance relates to preceding or subsequent utterances, such as by contrasting with other claims, or supporting them with evidence? How do we signal the main point of a text or a paragraph, and how do we signal supporting information?

To illustrate what I mean, consider this very short text:

Kim and Mary went home yesterday after school with their friend Jane. Suddenly Kim fell. Mary pushed her!

Reading this we probably infer:

  • "her" means "Kim" (anaphora)
  • First the three started going home, then Mary pushed Kim, then Kim fell (temporal sequence)
  • Kim fell because Mary pushed her (causality)
  • The school is the school they all go to (bridging anaphora)
  • The most salient characters are Kim and Mary, Jane is less important (graded salience)
  • The first sentence provides context and is less important than the second and third (discourse unit nuclearity)

How do we make these inferences which are not explicit in the text? And can we make computational models make the right ones? When models make correct inferences of this kind, we may not notice, but when they make mistakes, we call them 'hallucinations' that are not supported by the data.

But notice that these human inferences are also not spelled out - they are not so different from LLM hallucinations, except that they are desirable hallucinations. This is what makes representations of implicit discourse structure so important!

Research Interests

  • Discourse relations (especially in enhanced Rhetorical Structure Theory, eRST)
  • Graded salience
  • Coreference and entity resolution
  • Universal Dependencies (UD)
  • Information structure
  • Corpus Linguistics
  • Building and using multilayer corpora
  • Digital Humanities for Coptic studies
  • Corpus search and annotation interfaces

Stuff I work on

News and events

Send me an e-mail if you'd like to join corpinfo, the GU mailing list for information on corpus linguistics events, jobs and corpus releases at GU and the DC area. For more news check out the Corpling@GU page

Older events...