GitDox
  • Download
  • Documentation
  • corpling@GU

GitDox

A version controlled annotation interace for XML and spreadsheet data.

GitDox is the Git Data-storage online XML and spreadsheet linguistic annotation editor. Its main features are:

  • Save data locally and directly to GitHub
  • One collaborative version per file: no forks, branches or conflicts possible
  • XML syntax highlighting and auto complete using CodeMirror
  • Realtime collaborative spreadsheet editing with EtherCalc
  • Continuous validation integration with XSD schema and spreadsheet validation rules
  • Skinnable for different projects

For more information and to cite this tool, please refer to the following paper:

    Zhang, Shuo and Zeldes, Amir (2017) "GitDOX: A Linked Version Controlled Online XML Editor for Manuscript Transcription". In: Proceedings of FLAIRS 2017, Special Track on Natural Language Processing of Ancient and other Low-resource Languages. Marco Island, FL, 619-623.

GitDox Instances

  • Coptic Scriptorium - a platform for interdisciplinary and computational research in texts in the Coptic language (http://copticscriptorium.org/)
  • LING-367 - annotation interface for the GUM corpus, built as part of the course LING-367 at Georgetown

XML editor Spreadsheet mode

Project management

GitDox lets you track progress on annotation projects:

  • Assign documents to users
  • Define statuses such as annotation, adjudication, published...
  • Group multiple documents under a corpus
  • Control XML, spreadsheet and metadata validity from a central dashboard

Annotation and Validation

XML mode

In XML mode, the GitDox editor lets you use a browser based XML editor to enter annotated text in any annotation scheme. You can assign an XSD schema to validate each document and report results to the user in the editor, or for all documents in the project dashboard.

You can also use CodeMirror's built in syntax highlighting and auto-complete options to define your own tags.

Spreadsheet mode

You can use the spreadsheet mode to annotate tokens of running text, in a one token per line format. Other columns express various annotations, with multirow spans used to represent annotations of multiple tokens (e.g. entity annotations, chunks, paragraphs, etc.). Multiple users can collaborate in the spreadsheet simultaneously using EtherCalc's backend.

Validation control

Beyond using XSD schemas, you can define validation rules for metadata and spreadsheet content using rules that apply to all documents, individual corpora, or individual documents. Rules can require metadata and annotation columns to be:

  • present - the column or metadatum must exist, or
  • match a certain regular expression pattern

Spreadsheet annotations can also be required to:

  • have the same span (rowspan must match across columns)
  • nest - one column may never be smaller, or break across, spans in another column
  • have a specific length (e.g. some annotation must always have a row span of 1, 2, ... etc.)

Validation violations are reported in the editor window and dashboard with violating line numbers, and the corresponding cells are automatically highlighted

Version Control

Gitdox stores data locally on the server, but also uses GitHub as a backend for version control. Users with committing permissions can commit serialized XML files of their annotations, or SGML files of spreadsheets with conflicting spans, using the Corpus Workbench vertical format. GitDox users can be paired with GitHub user names and passwords to allow use of GitHub's own version history tools and user tracking: commit messages and committing users are logged as if they committed to GitHub using Git themselves.

Because only one branch and no forks are used in the repo configured for each document, it is impossible to create version conflicts, and annotators do not need to learn how to use GitHub: the latest committed version always overwrites the previous one, and in case of errors, data can be restored from GitHub itself.

Acknowledgments

GitDox has been developed at Georgetown University primarily by:

  • Emma Manning
  • Amir Zeldes
  • Shuo Zhang

Support for GitDox development has been provided by the National Endowment for the Humanities (NEH) and German Research Council (DFG) as part of the bilateral KELLIA project.


© 2017 Amir Zeldes. Code released under the Apache 2.0 License.