DepEdit

A simple configurable tool for manipulating dependency trees

Overview

DepEdit reads and writes files encoded in the CoNLL dependency format (10 columns). It's a simple Python script which can:

Change token attributes:
- token text
- part of speech
- dependency function
- other columns (typically used for lemma, morphology)
Connect different tokens in the tree by setting their head feature
Base its decisions on dependency sub-graphs or token distance
Use external configuration files for different scenarios
No language or schema specific details are hardwired into the system

You can also import it into your projects as a preprocessing module.

For detailed instructions please see the User Guide

What is it good for?

Here are some example scenarios in which DepEdit can be helpful:

You want to encode 'want to..' as an auxiliary, but your corpus has it as a main verb
An NLP component you use expects a slightly different tree
You discover your annotators treated a fixed expression like 'the world *over*' inconsistently
There is no parser for the language you are annotating, but you want to annotate some trivial cases automatically
The latest parser has a new schema, but your old data is in an older schema with subtle differences
You need to build a very rudimentary rule based parser (not exactly recommended, but somewhat possible)

How can I use it?

DepEdit is a self-contained Python script that is compatible with Python 2.X and 3.X and only needs a configuration file to run.

You can download the script itself (just depedit.py, without installing), or optionally install it via pip and run it as a module in your project (see below). Command line usage is either file by file, or using a glob pattern (e.g. *.conll10), in which case output files are created with a configurable suffix such as '.depedit' before the extension:

Usage

> python depedit.py -c config_file.ini INPUT.conll10 > OUTPUT.conll10 > python depedit.py -c config_file.ini *.conllu

Configuration files are text files with one instruction per line and optional blank lines and comments (beginning with ';' or '#'). Each instruction contains 3 columns, as in the following example:

config_file.ini

;Connect nouns to a preceding article or possessive pronoun with the 'det' function
pos=/DT\|PRP\$/;pos=/NNS?/	#1.#2	#2>#1;#1:func=det

;Change to-infinitive from aux to mark
text=/^[Tt]o$/&func=/aux/	none	#1:func=mark

Column 1: node definitions

The first column describes the tokens to be matched using regular expressions.

Constraints are given as regular expressions over the fields:
- num (column 1 of CoNLL format)
- text (column 2)
- lemma (column 3)
- pos (column 4, alias upos / upostag)
- cpos (column 5, alias xpos / xpostag)
- morph (column 6, alias feats)
- head (column 7) – this is the literal parent token’s ID number. Mostly useful when matching roots (head=/0/)
- func (dependency function, column 8, alias deprel)
- head2 (secondary head, for enhanced trees, alias deps)
- func2 (secondary function, for enhanced trees, alias misc)
- position – this is a special constraint which does not correspond to any column, but indicates the token’s position in the sentence. Possible values: first, last, and mid, matching the first token, last token, or neither first not last respectively
Multiple tokens are separated by ';'
You can specify multiple criteria using '&', as in the second rule
You may specify negative criteria using !=, e.g. lemma!=/able/
Constraints on sentence annotations are applied like this: #S:s_type=/decl/. Note that the operator to use with such definitions is ‘>’ (see below).
You can use capturing groups in parentheses, which will be referenceable in the actions (third) column as $1, etc.

Column 2: relation definitions

The middle column defines relationships between tokens. It refers to each token in the definition by number
(#1, #2...) and specifies:

Adjacency (.): #1.#2 means the first token in column 1 is followed by the second
Distance (.n or .n,m): #1.4#2 means 4 tokens distance, and #1.1,4#2 means a distance of 1-4. You can also use the shorthand #1.*#2 (indirect precedence, which is the same as #1.1,1000#2.
Parentage (>): #1>#2 means the first token in column 1 is the head of the second token. This operator is also used for sentence annotations (#1>#2, where #1 is a sentence annotation and #2 is a token).
Column identity (field==): in addition to a distance/parentage constraint, two nodes may also specify value identity constraints. For example, #1:text==#2 means that #1 and #2 must have exactly the same text (replace ‘text’ with other fields as needed)
If the instruction refers to only one token, as in the second example, the middle column says 'none'.

Column 3: action definitions

The third column specifies what to do if a rule matches:

Change a property of token:
- text
- part of speech
- lemma
- dependency function
- morphological analysis
Make some token in the definitions the head of another: #1>#2
Add a sentence annotation with the special pointer #S:
- #S:new_anno_name=somevalue
You can refer back to values in capturing groups from the first column by using the number of that group, e.g. $1:
- text=/(.*)/&pos=/IN/ … #1:func=prep_$1
You can also convert the contents of $1, $2 etc. to lower or upper case by using $1L (the contents of $1, in lower case), or $1U (for upper case)
You can use an equals sign (‘=’) in the actions column, so the following works as expected (only the first ‘=’ separates the key and value):
- pos=/NEG/ … #1:morph=Polarity=Negative
The special instruction ‘last’ makes this rule the last rule to apply to a sentence if it is matched, e.g. the following means ‘set the lemma to NONE and stop processing this sentence’:
- #1:lemma=NONE;last

Variables

it is also possible to define variables for frequently used (parts of) regular expressions.

Variables can be declared at the beginning of the configuration file (before rules are listed), and named using the notation:
- {varname}=/regex/
For example, suppose you want to make a rule depend on the animacy of a head noun or pronoun, and you have a long list of nouns known to represent humans (just a few are given in this example), which you can encode using a variable named 'person':
- {person}=/I|you|s?he|people|friend|child/
You can then use this variable within subsequent DepEdit rules:
- pos=/V.*/;lemma=/{person}/&func=/obj/ #1>#2 #2:misc=AnimObj
- pos=/V.*/;lemma=/{person}/&func=/nsubj/ #1>#2 #2:misc=AnimSubj
You can use multiple variables within the same rule, and inside the same key value, combined with normal text, e.g. lemma=/{var1}abc{var2}/.

Importing as a module

To import DepEdit into an existing project, you may want to install depedit as a module, rather than including depedit.py in your own codebase. You can install from PyPI via pip:

Installation

> pip install depedit

In your project, use the function run_depedit, which expects two file handles for the input and configuration files, or a string representation of the dependency data to iterate over:

some_program.py

from depedit import DepEdit

infile = open("path/to/infile.txt")

config_file = open("path/to/config.ini")

d = DepEdit(config_file)

result = d.run_depedit(infile)

Alternatively, you can also create a configuration inside your module, without reading it from a text file. There are several ways of doing this, which all achieve the same result:

transformation_api_demo.py

from depedit import DepEdit

d = DepEdit()

##############################

# Ways to add transformations:

##############################

# From a single string per instruction

d.add_transformation("pos=/V/\tnone\t#1:func=x")

# From args

d.add_transformation("pos=/V/\tnone\t#1:func=z","pos=/V/\tnone\t#1:func=y")

# From a list

d.add_transformation(["pos=/V/\tnone\t#1:func=a","pos=/V/\tnone\t#1:func=b"])

# From a dictionary

d.add_transformation({"nodes":"pos=/V/","rels": "none","actions":"#1:pos=a"})

Citing

If you are using DepEdit in a scholarly paper, please cite the following reference:

 @InProceedings{PengZeldes2020,
   author    = {Siyao Peng and Amir Zeldes},
   title     = {All Roads Lead to {UD}: Converting {S}tanford and {P}enn Parses 
       to {E}nglish {U}niversal {D}ependencies with Multilayer Annotations},
   booktitle = {Proceedings of the Joint Workshop on Linguistic Annotation, 
       Multiword Expressions and Constructions ({LAW}-{MWE}-{C}x{G}-2018)},
   year      = {2018},
   pages     = {167--177},
   address   = {Santa Fe, NM},
   url       = {https://www.aclweb.org/anthology/W18-4918}
 }