A version controlled annotation interace for XML and spreadsheet data.
GitDox is the Git Data-storage online XML and spreadsheet linguistic annotation editor. Its main features are:
For more information and to cite this tool, please refer to the following paper:
GitDox lets you track progress on annotation projects:
In XML mode, the GitDox editor lets you use a browser based XML editor to enter annotated text in any annotation scheme. You can assign an XSD schema to validate each document and report results to the user in the editor, or for all documents in the project dashboard.
You can also use CodeMirror's built in syntax highlighting and auto-complete options to define your own tags.
You can use the spreadsheet mode to annotate tokens of running text, in a one token per line format. Other columns express various annotations, with multirow spans used to represent annotations of multiple tokens (e.g. entity annotations, chunks, paragraphs, etc.). Multiple users can collaborate in the spreadsheet simultaneously using EtherCalc's backend.
Beyond using XSD schemas, you can define validation rules for metadata and spreadsheet content using rules that apply to all documents, individual corpora, or individual documents. Rules can require metadata and annotation columns to be:
Spreadsheet annotations can also be required to:
Validation violations are reported in the editor window and dashboard with violating line numbers, and the corresponding cells are automatically highlighted
Gitdox stores data locally on the server, but also uses GitHub as a backend for version control. Users with committing permissions can commit serialized XML files of their annotations, or SGML files of spreadsheets with conflicting spans, using the Corpus Workbench vertical format. GitDox users can be paired with GitHub user names and passwords to allow use of GitHub's own version history tools and user tracking: commit messages and committing users are logged as if they committed to GitHub using Git themselves.
Because only one branch and no forks are used in the repo configured for each document, it is impossible to create version conflicts, and annotators do not need to learn how to use GitHub: the latest committed version always overwrites the previous one, and in case of errors, data can be restored from GitHub itself.
GitDox has been developed at Georgetown University primarily by:
Support for GitDox development has been provided by the National Endowment for the Humanities (NEH) and German Research Council (DFG) as part of the bilateral KELLIA project.
© 2017 Amir Zeldes. Code released under the Apache 2.0 License.