GitDox - Documentation

Disclaimer

GitDox is still at a very preliminary release stage, and at some points, it may still be hard-wired to certain configuration options on our Georgetown server. We are working to reduce/eliminate these dependencies and make everything maximally configurable, but this is a work in progress.

Please contact Amir Zeldes if you require assistance in getting GitDox up and running.

Installing

Requirements

Apache or similar web server
Python 2.7
github3.py
EtherCalc (only if using the spreadsheet editor)

Instructions

To install GitDox itself, simply download the GitHub repository's master branch and unpack to a directory accessible to your Web server (e.g. /var/www/html/gitdox/), then install github3.py:

pip install github3.py

If you plan to use the EtherCalc spreadsheet editor, you must also install EtherCalc following the instructions on EtherCalc's website. It is advisable to run both EtherCalc and GitDox from the same host to avoid cross-site scripting warnings for users or access becoming disabled due to security settings.

Log in

User files are stored in users/ as plain text files with encrypted passwords. The initial password for the admin user is pass1. You should be able to login and change your password by clicking on "admin". The admin user can create more user files, with three permission levels:

User - can save files, but not create/delete documents or commit to GitHub
Committer - can also commit to GitHub
Admin - super user

Configuration

Main configuration

Most project specific configuration is found in users/config.ini. You will probably want to edit the following entries:

skin = css/your_stylesheet.css project = your_projectname banner = http://mysite/nav.html # OR: banner.html, a file in templates/ editor_help_link = <p>HTML with a help link to display in editor</p> xml_nlp_button = """<i class="fa fa-cool"/> Caption1""" # Caption for XML NLP button spreadsheet_nlp_button = """<i class="fa fa-cool2"/> Caption2""" # Same for Ether NLP xml_nlp_api = http://myserver.org/api # API target for XML NLP function spreadsheet_nlp_api = http://myserver.org/api2 # API target for Ether NLP function

CodeMirror auto-complete

The XML editor uses CodeMirror for syntax highlighting, and can be configured to suggest XML tags and attributes (auto-complete). The specification follows the documentation here and should replace the default one in templates/codemirror.html

There is also an experimental tool that converts XSD schemas to CodeMirror specifications here: http://q42jaap.github.io/xsd2codemirror/

Spreadsheets

In spreadsheet mode, it is assumed that tokenized data is being annotated, such that every token (i.e. word form) occupies its own non-empty row. The token's string value is placed by convention in the leftmost column (A), which is given the header tok. This column may not contain spans (merged cells).

Subsequent annotation layers are each given a colum with a header containing no spaces, which specifies the annotation's name. Headers are unique, i.e. multiple columns with the same name are not supported. Annotation columns may contain spans merged across rows, but cells from multiple columns may not be merged.

Merged cells need not nest properly across columns, although this can be enforced using validation rules.

Validations

GitDox can validate three types of information:

XML schemas
Spreadsheet annotations
Metadata

Single documents can be validated from their editor interface. To validate all documents, go to the document list and click the validate button. The validation column will show green for passed validations, and red for failing documents. Hover on a red icon to see validation errors.

XML schemas

All XML schemas (.xsd files) used by the interface should be placed in the schemas/ directory. They will then become available in the document's edit mode in the schema drop down. Once set, the document will validate against that schema both in the editor and from the document list dashboard.

Spreadsheet and metadata validation

You can configure validation rules for metadata and spreadsheet annotations from the validation management interface, accessible to administrators from the document list.

The editable validation table contains the following columns:

corpus: a string or regular expression specifying corpora for which this rule applies. Leave empty (null) to apply to all corpora
document: same as corpus, but filters by document name
domain: either meta for metadata, or ether for spreadsheet validation
name: name of the metadata key or annotation column header being checked
operator:
exists: the specified annotation must exist
~ (regex match): the annotation value must match this regular expression
| (span length, ether only): exact number of rows that annotations on this layer may span
= (span match, ether only): annotations in this column must have the exact same row span as those in the object column (e.g. pos=tok)
> (nest, ether only): annotations in this column must have the same or larger row span as those in the object column (e.g. sent>phrase)

Committing

In XML mode, users can save their work using the save button. In spreadsheet mode, any change is automatically saved to the server.

Users with the commit permission may also provide a commit message and push their current version to GitHub using the credentials supplied in their user file. Two-factor authentication may also be turned on, in which case a further box for entering a one time code will appear in the editor.

XML data is committed to GitHub as is, while spreadsheet data is serialized as CWB SGML. Metadata is placed in a tag surrounding the document, with key-value pairs for each metadatum. If you want to suppress a sensitive metadatum from being committed, you can prefix its key with ignore:.

Importing data

In addition to simply creating a new document and pasting in content, there are several ways to import data into the interface.

You can import CWB SGML into a spreadsheet by clicking the upload and import buttons in the editor. To include metadata, wrap your SGML in a tag containing key-value pairs as attributes.

You can also batch import data in the same format from the admin interface, specifying whether it should be imported into the XML editor or as a spreadsheet.

To use batch import, place the files to be imported in a zip archive. Document names will be generated from file names inside the zip, stripping away their extensions and replacing spaces with underscores. If there is a corpus metadatum, it will be used to determine the corpus name.

NLP services

GitDox is currently built to integrate two types of NLP services:

XML services
Spreadsheet services

XML services take data from a document's XML editor, process it, and return output that is placed back in the XML editor's content box. This can be used to tokenize data in the XML editor, enrich the XML in some way, etc.

Spreadsheet services take content from the XML editor and return either EtherCalc's SocialCalc format, or Corpus Workbench vertical SGML, which is automatically transformed into SocialCalc format by GitDox. The resulting content is then placed in the spreadsheet editor. When using CWB SGML, this means that SGML spans become spreadsheet spans.

Note that GitDox does not "know" what the services do - it just sends content back and forth and expects it to be in a compatible format. It can be important to make sure that error messages from services are handled in sensible ways.

Skins

The main CSS skin for each installation is configured in config.ini. You can use any CSS stylesheet (for example bootstrap or bootswatch) and the built-in GitDox styles will usually take over where important.

There are also a few other graphical resources that can be easily configured:

templates/header.html - the default location for top navigation bars or other headers. You can also specify a web resource as a header, which is fetched via requests
img/logo.png - your project logo, displayed e.g. on the login screen
favicon.ico - tab and favorites icon fur the browser window