GitDox is still at a very preliminary release stage, and at some points, it may still be hard-wired to certain configuration options on our Georgetown server. We are working to reduce/eliminate these dependencies and make everything maximally configurable, but this is a work in progress.
Please contact Amir Zeldes if you require assistance in getting GitDox up and running.
To install GitDox itself, simply download the GitHub repository's master branch and unpack to a directory accessible to your Web server (e.g. /var/www/html/gitdox/), then install github3.py:
pip install github3.py
If you plan to use the EtherCalc spreadsheet editor, you must also install EtherCalc following the instructions on EtherCalc's website. It is advisable to run both EtherCalc and GitDox from the same host to avoid cross-site scripting warnings for users or access becoming disabled due to security settings.
User files are stored in users/ as plain text files with encrypted passwords. The initial password for the admin user is pass1. You should be able to login and change your password by clicking on "admin". The admin user can create more user files, with three permission levels:
Most project specific configuration is found in users/config.ini. You will probably want to edit the following entries:
The XML editor uses CodeMirror for syntax highlighting, and can be configured to suggest XML tags and attributes (auto-complete). The specification follows the documentation here and should replace the default one in templates/codemirror.html
There is also an experimental tool that converts XSD schemas to CodeMirror specifications here: http://q42jaap.github.io/xsd2codemirror/
In spreadsheet mode, it is assumed that tokenized data is being annotated, such that every token (i.e. word form) occupies its own non-empty row. The token's string value is placed by convention in the leftmost column (A), which is given the header tok. This column may not contain spans (merged cells).
Subsequent annotation layers are each given a colum with a header containing no spaces, which specifies the annotation's name. Headers are unique, i.e. multiple columns with the same name are not supported. Annotation columns may contain spans merged across rows, but cells from multiple columns may not be merged.
Merged cells need not nest properly across columns, although this can be enforced using validation rules.
GitDox can validate three types of information:
Single documents can be validated from their editor interface. To validate all documents, go to the document list and click the validate button. The validation column will show green for passed validations, and red for failing documents. Hover on a red icon to see validation errors.
All XML schemas (.xsd files) used by the interface should be placed in the schemas/ directory. They will then become available in the document's edit mode in the schema drop down. Once set, the document will validate against that schema both in the editor and from the document list dashboard.
You can configure validation rules for metadata and spreadsheet annotations from the validation management interface, accessible to administrators from the document list.
The editable validation table contains the following columns:
In XML mode, users can save their work using the save button. In spreadsheet mode, any change is automatically saved to the server.
Users with the commit permission may also provide a commit message and push their current version to GitHub using the credentials supplied in their user file. Two-factor authentication may also be turned on, in which case a further box for entering a one time code will appear in the editor.
XML data is committed to GitHub as is, while spreadsheet data is serialized as CWB SGML. Metadata is placed in a tag surrounding the document, with key-value pairs for each metadatum. If you want to suppress a sensitive metadatum from being committed, you can prefix its key with ignore:.
In addition to simply creating a new document and pasting in content, there are several ways to import data into the interface.
You can import CWB SGML into a spreadsheet by clicking the upload and import buttons in the editor. To include metadata, wrap your SGML in a tag containing key-value pairs as attributes.
You can also batch import data in the same format from the admin interface, specifying whether it should be imported into the XML editor or as a spreadsheet.
To use batch import, place the files to be imported in a zip archive. Document names will be generated from file names inside the zip, stripping away their extensions and replacing spaces with underscores. If there is a corpus metadatum, it will be used to determine the corpus name.
GitDox is currently built to integrate two types of NLP services:
XML services take data from a document's XML editor, process it, and return output that is placed back in the XML editor's content box. This can be used to tokenize data in the XML editor, enrich the XML in some way, etc.
Spreadsheet services take content from the XML editor and return either EtherCalc's SocialCalc format, or Corpus Workbench vertical SGML, which is automatically transformed into SocialCalc format by GitDox. The resulting content is then placed in the spreadsheet editor. When using CWB SGML, this means that SGML spans become spreadsheet spans.
Note that GitDox does not "know" what the services do - it just sends content back and forth and expects it to be in a compatible format. It can be important to make sure that error messages from services are handled in sensible ways.
The main CSS skin for each installation is configured in config.ini. You can use any CSS stylesheet (for example bootstrap or bootswatch) and the built-in GitDox styles will usually take over where important.
There are also a few other graphical resources that can be easily configured:
© 2017 Amir Zeldes. Code released under the Apache 2.0 License.