config.ini

The main configuration file for a language model is config.ini (either in the model directory, or inside the compressed model archive). It contains several kinds of settings:

  • regex - regular expressions, always stand between two slashes, e.g. property=/[Vv]alue/
  • literal - exact values, have no delimiters, e.g. property=value
  • boolean - True or False
  • special - complex setting with special formats, typically containing arrays of settings with multiple separators (see individual entries below)

The config.ini file has several part, outlined below.

General labels and settings

This section defines main dependency label functions that are interpreted by xrenner in special ways. Functions such as subjecthood, apposition, coordination or possession have special meanings for the workflow (for example, possession is checked against the hasa table). This section also determines the thousands and decimal separators for the language for cardinality recognition.

### General Function Label Settings ###
subject_func=/^nsubj.*/
apposition_func=/appos/
possessive_func=/^nmod:poss|nmod_of$/
definite_possessive_func=/^nmod:poss$/
# Function for the second conjunct, NOT the conjunction; used to propagate function from first conjunct
conjunct_func=/^conj$/
# Dependency function identifying a coordinating conjunction
coord_func=/^cc$/
# Whether cc attaches left to right (from the first coordinate, e.g. Stanford Dependencies) or right to left (from second coordinate, e.g. Universal Dependencies)
cc_left_to_right=False
# Dependency function for determiners (if this function also matches functions in mod_func it will NOT used to rule out incompatible modifiers)
det_func=/^(det.*|nummod)$/
# Dependency function indicating negation
neg_func=/^advmod_neg$/
# Decimal point separator in this language, e.g. period or comma
decimal_sep=.
# Thousands separator in this language, e.g. comma or period
thousand_sep=,

Markable detection

Settings in this section are used to determin what tokens should be candidates for inclusion in a markable (i.e. an entity mention). These include settings for part of speech tags that indicate being a markable head, dependency function labels of descendents of that head to include in the markable span, and more.

# POS categories to consider for markable heads
mark_head_pos=/^(NN?P?S?|PRP\$|PP\$|NOUN|PROPN)$/
# POS categories to consider for markable head only if dependency function is not (!) ...
# This is NOT a regular expression
pos_func_heads=CD!amod;CD!nummod;CD!compound;CD!dep;CD!discourse;DT+nsubj;DT+nsubj:pass;DT+obj;DT+obl;DT+nmod;DT+obl_in;DT+obl_with;DT+obl_to;DT+obl_at;VB+cata;VV+cata;PRP!obl:npmod;PP!obl:npmod;$!dep;
# POS categories for verbs, only considered if seeking verbal antecedents for definite NPs with no antecedent is switched on
verb_head_pos=/^(V[BV][PZD]?)$/
# Dependency functions that rule out being a markable head
mark_forbidden_func=/amod|mark|punct|discourse|obl_as|nmod_as|flat|nummod|goeswith/
# Dependency functions that terminate markable sequence propagation
non_link_func=/^(aux|aux:pass|expl|nsubj|nsubj:pass|cop|dep|punct|appos|discourse|parataxis|orphan|advmod_neg|case|mark|csubj|csubj:pass|cc|list|nmod:tmod|obl.*|attribution)$/
# Token values that terminate markable sequence propagation
non_link_tok=/[Ww]hen|[Ii]f|[Aa]lso|[Tt]hen/
# Dependency functions for which stoplist is checked to rule out spurious markables and markable extensions are forbidden
stop_func=/amod|compound|xcomp|fixed|dep/
# Postprocess parser input to ensure tokens inside known entities depend on the entity head?
postprocess_parser=True
# Func substitutions - replace function depending on POS and direction of head - NOT a regular expression
func_substitute_backward=NN/dep/appos/;NNS/dep/appos;NNP/dep/appos;NNPS/dep/appos;NP/dep/appos;NPS/dep/appos;CD/dep/appos
func_substitute_forward=NN/dep/compound/;NNS/dep/compound;NNP/dep/compound;NNPS/dep/compound;NP/dep/compound;NPS/dep/compound;CD/dep/nummod
# Affix tokens to strip from markable core text
core_prefixes=/^ ?([Tt]h([oe](se)?|is|at)|an?|some|all|many|[Mm]y|[Yy]our|[Hh]is|[Hh]er|[Oo]ur|[Tt]heir|[Ii]ts)( |$)/
core_suffixes=/ (('s|'|[Cc]orp\.?|Inc\.?|[Cc]o\.?|&) ?)*$/
# Functions that can be stripped when determining atomicity, e.g. if 'police department' is atomic, then a modified 'rich police department' is too
non_essential_mod_func=/amod|nummod|det.*|nmod_of|advcl/

Entity recognition

This section contains settings related to the recognition of the kind of markable we are dealing with, once a markable has been detected. This includes NP form (common, pronoun, or proper noun), entity types, such as person and subtypes, such as politician.

Special values:

  • lemma_rules: a cascade of regex replacement rules applied to generate the lemma field of each token, if not provided in the input. This can be used as a very rudimentary lemmatization engine. Rules are separated by semicolons, and each rule consists of three parts followed by slashes:
    • A part of speech regex, determining the POS tags to apply the rule to
    • A regular expression to find
    • The string to replace the match with
    • example: NNP?S/xes$/x/ - replaces word final xes with x, if the tag is NNPS or NNS
  • morph_rules: a cascade of regex replacement rules applied to the morph field of each token. Unlike lemma_rules, this is not POS tag dependent. Rules are separated by semicolon, and find and replace expressions are separated by slash. This is mainly useful for adapting the format of some morphological analysis in the input tree to the categories used in the model’s lexical data. Note that for simple, non-conditional substitutions, this is faster than using the depedit module.
    • example: 3SgM/Masc - suppose our input has categories like 3SgM for 3rd person masculine, but our lexicon has gender information like Masc. This rule adapts the input tag to the tag we want.
# Default entity
default_entity=abstract
# Parts of speech for proper nouns
proper_pos=/NN?PS?/
# Parts of speech for pronouns
pronoun_pos=/PR?P\$?|DT/
# What can an article look like which is not part of a person's actual name?
articles=/^([Aa]n?|[Tt]he)$/
# What can articles or modifiers look like which entail definiteness?
definite_articles=/^([Tt]h(e|is|at|[oe]se)|[Mm]y|[Yy]our|[Hh](is|er)|[Oo]ur|[Tt]heir|'|'s)$/
# Are people and place names in this language capitalized?
cap_names=True
# Ad hoc lemmatization rules - cascade of POS dependent string replace rules to use if no lemma is available in input
lemma_rules=NNP?S/xes$/x/;NNP?S/ches$/ch/;NNP?S/sses$/ss/;NNP?S/ies$/y/;NNP?S/s$//;NPS/xes$/x/;NPS/ches$/ch/;NPS/sses$/ss/;NPS/ies$/y/;NPS/s$//
# Edit morphology information - cascade of string replace rules to use on the morph field in conll data if available
morph_rules=_/_
# Auto lower case bahvior for lemmas when using lemma rules. Possible values: all, except_all_caps, none
auto_lower_lemma=except_all_caps
# Entity to guess for unknown acronyms
all_caps_entity=organization
# Maximum of substring suffix used on unknown heads to establish category (e.g. 4 catches -ness as usually abstract)
max_suffix_length=8
# Optional path to serialized pre-trained sequence classifier for entity head classification
sequencer=eng_flair_nner_distilbert.pt
# Probability threshold to believe sequencer classification as non-referential; use 1.0 to disable
sequencer_nonref_thresh=0.99
# POS tags which the sequencer is allowed to discard as entity heads if sequencer_nonref_thresh is exceeded
sequencer_nonref_pos=/^NN?P?$/
# Child function expression which prevents an entity head from being removed, even if sequencer_nonref_thresh is exceeded
sequencer_nonref_forbidden_childfunc=/flat/

Agreement class detection

This section contains settings for detecting the morphological agreement class (e.g. feminine, masculine, plural)

Special values:

  • pos_agree_mapping: a special list of semicolon-separated mappings of POS categories to agreement classes, internally separated by angle brackets. For example: NNS>plural means the tag NNS gets mapped to the agreement class plural
  • agree_entity_mapping: similarly, a mapping of agreement classes to entity types. For example, in English, mapping female to person is usually correct: female>person;male>person
  • never_agree_pairs: semicolon-separated list of plus-separated pairs of agreement classes that are never compatible, e.g. male+plural
pos_agree_mapping=NPS>plural;NNPS>plural;NNS>plural
agree_entity_mapping=female>person;male>person
default_agree=inanim
# Agreement classes compatible with unknown agreement - all other classes will require strict agreement
agree_with_unknown=/inanim|male|female/
# Strict disagree sets, e.g. male+female; note that putting singular/plural here disrupts possible isa-table connections between the two.
# Separate each pair internally with plus, and multiple pairs with semicolon
never_agree_pairs=male+female
# Does isa matching of subclasses require morphological agreement?
isa_subclass_agreement=True
# Agreement classes incompatible with person type entities
no_person_agree=/inanim/
# Agreement class for coordinate/aggregate entities; plain string NOT a regex (use underscore to assign no special class)
aggregate_agree=plural
# Default name and agreement for persons and places - set to global default if not using
person_def_entity=person
place_def_entity=place
time_def_entity=time
object_def_entity=object
organization_def_entity=organization
quantity_def_entity=quantity
event_def_entity=event
abstract_def_entity=abstract
person_def_agree=male
place_def_agree=inanim
time_def_agree=inanim
organization_def_agree=inanim
object_def_agree=inanim
quantity_def_agree=plural

Coreference detection

Settings affecting coreference matching.

Special values:

no_antecedent:

A detailed description of markables forbidden to have antecedents (e.g. bare plurals in English, or indefinites in general). A semicolon separated series of patterns of initial (^x), final ($x) or head(@x) POS tags + text specification that rule out antecedents. Use ! for negation, and & for multiple criteria. Use e.g. @none/none to switch this off. For example the following specification forbids markables headed by a percent token (with text ‘%’) from having an antecedent:

@.*/^%$

The following specification says that a markable headed (@) by the POS tag NNS and having any text (.*), but also not (!) starting (^) with a POS tag matching DT or PRP is forbidden from having an antecedent (English bare plural).

@NNS/.*&^!(DT|PRP)/.*

# Whether to use machine learning classifiers, if available
use_classifiers=True
# Default score threshold for classifier positive decision (lower threshold promotes more liberal coreference decisions)
# Note that best scoring candidate wins regardless of this; the threshold just means NO antecedent is accepted if nothing scores above the threshold
score_thresh=0.35
# Override/flavor specific classifier file suffix for specific versions of a model (e.g. _GUM in gbm_GUM3.pkl, instead of gbm3.pkl). Usually empty except in an override profile.
classifier_suffix=
# Modifier functions that require strictly identical heads for coreference
ident_mod_func=/^(nmod:poss|titlemod|nmod_of)$/
# What POS categories should allow lemma matching of heads for coreference? e.g. /^NNS?$/ to allow singular and plural nouns to match based on lemma
lemma_match_pos=/none/
# Dependency functions that are interpreted as modifiers for compatible modifier, and other modifier checks
mod_func=/amod|compound|flat|nummod|nmod_of/
# Tokens that mark opening and closing quotation for direct speech detection
open_quote=/^(["'”’]|``)$/
close_quote=/^(["'”’]|'')$/
# Tokens used to identify a question sentence
question_mark=/^\?$/
# Markables forbidden to have antecedents - semicolon separated patterns of initial (^x), final ($x) or head(@x) POS + text that rule out antecedent. 
# Use ! for negation, & for multiple criteria; use e.g. @none/none to switch off
no_antecedent=@NNS/.*&^!(DT|PR?P\$)/.*;@.*/^%$
# Are indefinite markables allowed to be anaphors (i.e. can they have antecedents)?
allow_indef_anaphor=False
# Are indefinite markables allowed to be anaphors via isa? (only used if allow_indef_anaphor is True)
allow_indef_isa=False
# Should we match Title Case entities to acronyms made of each token's initial letters?
match_acronyms=True
# Words to ignore in acronym matching, e.g. 'of' in Federal Bureau *of* Investigation <- FBI
ignore_in_acronym=/of/
# Should it be possible for a later mention to include lexical modifiers not present in the antecedent?
no_new_modifiers=True
# Should exceptional new modifiers listed in the new_modifiers file be allowed in subsequent mentions? (only used if no_new_modifiers=True)
use_new_modifier_exceptions=True
# Do proper noun modifiers have to match exactly across mentions? (NB: this may include proper modifiers such as Mr.!! Often leaving this False is better)
proper_mod_must_match=False

Postprocessing settings

Settings for sanitizing or altering output, such as removing singletons, or markables with specific grammatical functions which are nevertheless used internally during processing (e.g. predicatives).

remove_head_func=/compound/
remove_child_func=/cop/
remove_singletons=True
remove_cataphora=True
# Should we add a markable around adjacent appositional mentions?
add_appos_envelopes=True
# Should we attempt to match definites with no antecedent to a verb of the same stem?
seek_verb_for_defs=True
# Specify affixes that the rudimentary stemmer should delete when attempting verbal coreference matching
stemmer_deletes=/((:?at)ion|e?ment|ed?|ing|th|al|e?s)$/
# Should we allow an additional plural markable around coordinate entities, even if there is no later aggregate reference?
remove_coordinate_envelopes=True
# Semi-colon separated nested entity types to remove, e.g. person,nn,person;... removes person with function nn nested in a person
remove_nested_entities=person,compound,person;person,flat,person