Depedit
  • Download
  • Documentation
  • corpling@GU

DepEdit

A simple configurable tool for manipulating dependency trees


auto-edit dependency trees

Overview

DepEdit reads and writes files encoded in the CoNLL dependency format (10 columns). It's a simple Python script which can:

  • Change token attributes:
    • token text
    • part of speech
    • dependency function
    • other columns (typically used for lemma, morphology)
  • Connect different tokens in the tree by setting their head feature
  • Base its decisions on dependency sub-graphs or token distance
  • Use external configuration files for different scenarios
  • No language or schema specific details are hardwired into the system

You can also import it into your projects as a preprocessing module.


For detailed instructions please see the User Guide


What is it good for?

Here are some example scenarios in which DepEdit can be helpful:

  • You want to encode 'want to..' as an auxiliary, but your corpus has it as a main verb
  • An NLP component you use expects a slightly different tree
  • You discover your annotators treated a fixed expression like 'the world *over*' inconsistently
  • There is no parser for the language you are annotating, but you want to annotate some trivial cases automatically
  • The latest parser has a new schema, but your old data is in an older schema with subtle differences
  • You need to build a very rudimentary rule based parser (not exactly recommended, but somewhat possible)

How can I use it?

DepEdit is a self-contained Python script that is compatible with Python 2.X and 3.X and only needs a configuration file to run.

You can download the script itself (just depedit.py, without installing), or optionally install it via pip and run it as a module in your project (see below). Command line usage is either file by file, or using a glob pattern (e.g. *.conll10), in which case output files are created with a configurable suffix such as '.depedit' before the extension:

Usage

> python depedit.py -c config_file.ini INPUT.conll10 > OUTPUT.conll10 > python depedit.py -c config_file.ini *.conllu

Configuration files are text files with one instruction per line and optional blank lines and comments (beginning with ';' or '#'). Each instruction contains 3 columns, as in the following example:

config_file.ini

;Connect nouns to a preceding article or possessive pronoun with the 'det' function
pos=/DT|PRP\$/;pos=/NNS?/#1.#2#2>#1;#1:func=det
 
;Change to-infinitive from aux to mark
text=/^[Tt]o$/&func=/aux/none#1:func=mark

Column 1: node definitions

The first column describes the tokens to be matched using regular expressions.

  • Constraints are given as regular expressions over the fields:
    • num (column 1 of CoNLL format)
    • text (column 2)
    • lemma (column 3)
    • pos (column 4, alias upos / upostag)
    • cpos (column 5, alias xpos / xpostag)
    • morph (column 6, alias feats)
    • head (column 7) – this is the literal parent token’s ID number. Mostly useful when matching roots (head=/0/)
    • func (dependency function, column 8, alias deprel)
    • head2 (secondary head, for enhanced trees, alias deps)
    • func2 (secondary function, for enhanced trees, alias misc)
    • position – this is a special constraint which does not correspond to any column, but indicates the token’s position in the sentence. Possible values: first, last, and mid, matching the first token, last token, or neither first not last respectively
  • Multiple tokens are separated by ';'
  • You can specify multiple criteria using '&', as in the second rule
  • You may specify negative criteria using !=, e.g. lemma!=/able/
  • Constraints on sentence annotations are applied like this: #S:s_type=/decl/. Note that the operator to use with such definitions is ‘>’ (see below).
  • You can use capturing groups in parentheses, which will be referenceable in the actions (third) column as $1, etc.

Column 2: relation definitions

The middle column defines relationships between tokens. It refers to each token in the definition by number
(#1, #2...) and specifies:

  • Adjacency (.): #1.#2 means the first token in column 1 is followed by the second
  • Distance (.n or .n,m): #1.4#2 means 4 tokens distance, and #1.1,4#2 means a distance of 1-4. You can also use the shorthand #1.*#2 (indirect precedence, which is the same as #1.1,1000#2.
  • Parentage (>): #1>#2 means the first token in column 1 is the head of the second token. This operator is also used for sentence annotations (#1>#2, where #1 is a sentence annotation and #2 is a token).
  • Column identity (field==): in addition to a distance/parentage constraint, two nodes may also specify value identity constraints. For example, #1:text==#2 means that #1 and #2 must have exactly the same text (replace ‘text’ with other fields as needed)
  • If the instruction refers to only one token, as in the second example, the middle column says 'none'.

Column 3: action definitions

The third column specifies what to do if a rule matches:

  • Change a property of token:
    • text
    • part of speech
    • lemma
    • dependency function
    • morphological analysis
  • Make some token in the definitions the head of another: #1>#2
  • Add a sentence annotation with the special pointer #S:
    • #S:new_anno_name=somevalue
  • You can refer back to values in capturing groups from the first column by using the number of that group, e.g. $1:
    • text=/(.*)/&pos=/IN/ … #1:func=prep_$1
  • You can also convert the contents of $1, $2 etc. to lower or upper case by using $1L (the contents of $1, in lower case), or $1U (for upper case)
  • You can use an equals sign (‘=’) in the actions column, so the following works as expected (only the first ‘=’ separates the key and value):
    • pos=/NEG/ … #1:morph=Polarity=Negative
  • The special instruction ‘last’ makes this rule the last rule to apply to a sentence if it is matched, e.g. the following means ‘set the lemma to NONE and stop processing this sentence’:
    • #1:lemma=NONE;last

Variables

it is also possible to define variables for frequently used (parts of) regular expressions.

  • Variables can be declared at the beginning of the configuration file (before rules are listed), and named using the notation:
    • {varname}=/regex/
  • For example, suppose you want to make a rule depend on the animacy of a head noun or pronoun, and you have a long list of nouns known to represent humans (just a few are given in this example), which you can encode using a variable named 'person':
    • {person}=/I|you|s?he|people|friend|child/
  • You can then use this variable within subsequent DepEdit rules:
    • pos=/V.*/;lemma=/{person}/&func=/obj/ #1>#2 #2:misc=AnimObj
    • pos=/V.*/;lemma=/{person}/&func=/nsubj/ #1>#2 #2:misc=AnimSubj
  • You can use multiple variables within the same rule, and inside the same key value, combined with normal text, e.g. lemma=/{var1}abc{var2}/.

Importing as a module

To import DepEdit into an existing project, you may want to install depedit as a module, rather than including depedit.py in your own codebase. You can install from PyPI via pip:

Installation

> pip install depedit

In your project, use the function run_depedit, which expects two file handles for the input and configuration files, or a string representation of the dependency data to iterate over:

some_program.py

from depedit import DepEdit
 
infile = open("path/to/infile.txt")
config_file = open("path/to/config.ini")
d = DepEdit(config_file)
result = d.run_depedit(infile)

Alternatively, you can also create a configuration inside your module, without reading it from a text file. There are several ways of doing this, which all achieve the same result:

transformation_api_demo.py

from depedit import DepEdit
d = DepEdit()
 
##############################
# Ways to add transformations:
##############################
 
# From a single string per instruction
d.add_transformation("pos=/V/\tnone\t#1:func=x")
# From args
d.add_transformation("pos=/V/\tnone\t#1:func=z","pos=/V/\tnone\t#1:func=y")
# From a list
d.add_transformation(["pos=/V/\tnone\t#1:func=a","pos=/V/\tnone\t#1:func=b"])
# From a dictionary
d.add_transformation({"nodes":"pos=/V/","rels": "none","actions":"#1:pos=a"})

Citing

If you are using DepEdit in a scholarly paper, please cite the following reference:

 @InProceedings{PengZeldes2020,
   author    = {Siyao Peng and Amir Zeldes},
   title     = {All Roads Lead to {UD}: Converting {S}tanford and {P}enn Parses 
       to {E}nglish {U}niversal {D}ependencies with Multilayer Annotations},
   booktitle = {Proceedings of the Joint Workshop on Linguistic Annotation, 
       Multiword Expressions and Constructions ({LAW}-{MWE}-{C}x{G}-2018)},
   year      = {2018},
   pages     = {167--177},
   address   = {Santa Fe, NM},
   url       = {https://www.aclweb.org/anthology/W18-4918}
 }

© 2015-2021 Amir Zeldes. Code released under the Apache 2.0 License.