I am a computational linguist specializing in work on and with corpora, including corpus linguistics studies, building corpora, and creating annotation interfaces and NLP tools that make corpus creation easier. I also run the Georgetown University Corpus Linguistics lab, Corpling@GU, and I am currently president of the ACL Special Interest Group on Annotation (SIGANN).

My main research interests are in computational models of discourse, above the sentence level: I study "how we construct discourse, given what we want to say". In particular, I have been working on predictive computational models of referentiality and discourse relations. Which entities do we track in conversation? How are they introduced into the discourse and referred back to? How do we recognize discourse relations which signal how a current utterance relates to preceding or subsequent utterances, such as by contrasting with other claims, or supporting them with evidence? How do we signal the main point of a text or a paragraph, and how do we signal supporting information?

For example, if we read even a very short text such as "Yun fell. Kim pushed her", we infer a lot of things: we understand that there are two events, that the same two people were involved in both (her=Yun), that the second probably happened before the first, and that the first was caused by the second. But how do we do this? And can we make computers understand these kinds of inferences? It turns out that computers find this very hard!

I am also interested in how we learn to be productive in our first, second and subsequent languages, producing some (but not only, and not just any) utterances and combinations we have never heard before. I believe that very many factors constantly and concurrently influence the choice between competing constructions, which means that we need multifactorial methods and multilayer corpus data in order to understand what it is that we do when we produce and understand language.

Research Interests

  • Discourse relations (especially in Rhetorical Structure Theory, RST)
  • Coreference and entity resolution
  • Universal Dependencies (UD)
  • Information structure
  • Corpus Linguistics
  • Building and using multilayer corpora
  • Predictive modelling of syntactic alternations
  • Digital Humanities for Coptic studies
  • Corpus search and annotation interfaces

Stuff I work on

News and events

Send me an e-mail if you'd like to join corpinfo, the GU mailing list for information on corpus linguistics events, jobs and corpus releases at GU and the DC area. For more news check out the Corpling@GU page

Older events...