DepEdit

The depedit module offers a powerful tool for manipulating input dependency trees before they are analyzed by the system. It is also available as a standalone script at: http://gucorpling.org/depedit/. For more detailed documentation and advanced DepEdit rules, see the documentation on the website.

The manipulations carried out by the script are determined by the file depedit.ini in the model directory, which may be left empty or missing if depedit should not be used (depedit takes a moment to run, so not using it is an option if very high performance on large amounts of documents is desired). Each rule in the depedit file is a tab-delimited triple, with semicolon delimited instructions in this configuration:

feature1=/value/;feature2=/value/...  #1op#2;#2op#3...        #1:feature3=value;...#1>#2;...

Where features include:

  • text - the token text
  • lemma - the token’s lemma
  • upos - the token’s POS tag (column 4)
  • xpos - the token’s other POS tag (column 5)
  • morph - morphological information (column 6)
  • func - the dependency label
  • head - the parent feature, only used if specifically targeting roots (head=0)

And operators op include:

  • > - dominance (#1>#2 means #1 is the head of #2)
  • . - adjacency (#1.#2 means #1 precedes #2)
  • .n - distance (#1.3#2 means #1 is three tokens before #2)
  • .n,m - distance range (#1.3,5#2 means #1 is three-five tokens before #2)

Below we describe some common scenarios using depedit to preprocess data for coreference resolution; for further operators and more advanced scenarios see the full documentation on the depedit website.

Common scenarios

Fixing probable parser errors

If you know of some consistenly wrong or problematic parser output in your language, you can correct it using depedit by defining the subgraph features identifying the problem and manipulating the tree. Here’s an example:

discourse marker chains - sometimes a sequence of discourse markers is recognized at its head with the label discourse, but subsequent dependents are labeled as dep. We could decide to fix this by labeling all dep dependents of discourse as further cases of discourse. If we then rule out discourse as a possible markable head label, we can get rid of the whole chain, regardless of POS tags.

# Dep of discourse is also discourse
func=/discourse/;func=/dep/     #1>#2   #2:func=discourse

Flagging special constructions

Even if a tree is correct, you may be interested in flagging certain constellations as special with new labels, and catching those labels later on with coref_rules.tab or some other configuration. For example:

cataphora - suppose you want to catch ‘empty’ or cataphoric pronouns such as it’s good that she came. You can recognize dependency subgraphs corresponding to these and change the pronoun’s function to a new one.

text=/^[Ii]t$/&func=/nsubj/;xpos=/JJ/;func=/ccomp/      #2>#1;#2>#3     #3:func=cata

This token can then be ignored by adding the function to forbidden Mark func in config.ini, or it can be caught with a coref rule in coref_rules.tab.

Adding morphological information

Above and beyond the rudimentary morphological substitutions that the morph_rules setting in config.ini, depedit can assign morphological information based on complex subgraphs. For example:

possessor definiteness - suppose you want both the possessor and possessed NP in an example such as the man’s brother to be treated as definite. You can recognize this subgraph and add a ‘def’ feature to the morphology column of the dependency tree on each head:

# Attach preceding article to the possessor noun and mark as definite
func=/det/;func=/nmod:poss/;xpos=/POS/;text=/.*/        #2>#1;#2>#3;#4>#2       #4>#1;#2:morph=def

After this rule runs, the fourth token in the definition governs the first (article is attached to the main noun), and the possessor (#2) will be marked as definite, despite the fact that the article is attached to #4.

Adding automatic sentence types for classification

If you plan on building a classifier using sentence types as a feature (e.g. tense, modality etc. - see Building classifiers), you can use depedit to give an approximate analysis. Here is an example of rough sentence type annotation with depedit. Note the use of the special instruction last to stop processing further rules for this sentences, and the assignment of sentence annotations to #S:

# Rough sentence type based on GUM schema
#WH:: what/who/how verbed?
func=/root|ROOT/;pos=/WP|WRB/;text=/\?/&position=/last/ #1>#2;#1.*#3    #S:s_type=wh;last
#WH:: how x is y?
func=/root|ROOT/;pos=/WP|WRB|WDT/;pos=/^N.*|^J.*/;text=/\?/&position=/last/     #1>#3>#2;#1.*#4 #S:s_type=wh;last
#WH:: how x does y verb z?
func=/root|ROOT/;func=/dobj|nsubj/;pos=/WP|WRB|WDT/;pos=/^N.*|^J.*/;text=/\?/&position=/last/   #1>#2>#4>#3;#1.*#5      #S:s_type=wh;last
#Q:: other sent with question mark
func=/root|ROOT/;text=/\?/&position=/last/      #1.*#2  #S:s_type=q;last
#SUB:: x might verb (has subj and modal verb)
func=/root|ROOT/;func=/nsubj/;pos=/MD/&text!=/.*ll/     #1>#2;#1>#3     #S:s_type=sub;last
#DECL:: x might verb (has subj and verb or cop)
func=/root|ROOT/&pos=/V.*/;func=/nsubj/ #1>#2   #S:s_type=decl;last
func=/root|ROOT/;func=/nsubj/;func=/cop/        #1>#2;#1>#3     #S:s_type=decl;last
#INF:: has base form verb with 'to'
func=/root|ROOT/&pos=/^V.$/;pos=/TO/    #1>#2   #S:s_type=inf;last
#IMP:: has base form verb, and no subject (otherwise decl rule should have matched)
func=/root|ROOT/&pos=/^V.$/     none    #S:s_type=imp;last
#GER:: has gerund form verb, and no subject (otherwise decl rule should have matched)
func=/root|ROOT/&pos=/^V.G$/    none    #S:s_type=ger;last
#INTJ:: has interjection as root
func=/root|ROOT/&pos=/^UH$/     none    #S:s_type=intj;last
#FRAG::
pos=/N.*|IN|PR?P/&func=/root|ROOT/      none    #S:s_type=frag;last
text=/.*/       none    #S:s_type=decl;last