Search interfaces

The lab maintains two corpus search interfaces, which offer students and the general public access to language data and statistical analysis tools, as well as an online dictionary:

NLP tools

We develop a number of NLP tools that help to build corpora automatically, or feed into manual correction loops:

  • RFTokenizer - a SOTA trainable segmenter for morphologically rich languages
  • DisCoDisCo - discourse relation classification, segmentation and detection
  • Coptic NLP - a complete pipeline for processing Coptic data
  • xrenner - multilingual non-named entity and coreference resolution
  • HebPipe - an NLP pipeline for Hebrew

Annotation tools

We provide a number of freely available annotation tools:

  • rstWeb - open source web interface for Rhetorical Structure Theory annotation
  • GitDox - a version controlled, online XML and spreadsheet editor with built-in validation
  • DepEdit - configurable rule-based editing for dependency corpora in the conll-u format


Several of our corpora are freely available, open source projects:

  • GUM - The Georgetown University Multilayer corpus, created and published by our students in LING-367
  • UD Hebrew IAHLTwiki - a new UD treebank of Hebrew from Wikipedia
  • Coptic Corpora under CC licenses, including the Coptic Treebank
  • AMALGUM - A Machine-Annotated Lookalike of GUM, a freely available, genre-balanced English web corpus totaling 4M tokens and featuring a large number of high-quality automatic annotation layers

Featured research


GUM7 – four added genres, Wikification and more!

The first release of GUM series 7 now adds four more genres to our multilayer corpus, in addition to brand new annotation layers, corrections, and more.

New features in our Coptic NLP pipeline

New features in our Coptic NLP pipeline

Coptic Scriptorium’s Natural Language Processing (NLP) tools now support two new features...

RNN reads newspaper for discourse signals

A neural network reads the newspaper...

... in search of discourse signals! We now know a lot about what cues people use to identify discourse relations, but can we teach computers to notice the same signals?

More research