Developers’ module documentation¶

Classes¶

Xrenner¶

class modules.xrenner_xrenner.Xrenner(model='eng', override=None, rule_based=False, no_seq=False)[source]¶

analyze(infile, out_format)[source]¶

Method to run coreference analysis with loaded model

Parameters:

infile – file name of the parse file in the conll10 format, or the pre-read parse itself

format – format to determine output type, one of: html, paula, webanno, conll, onto, unittest

Returns:
output based on requested format

analyze_markable(mark, lex)[source]¶

Find entity, agreement and cardinality information for a markable :param mark: The Markable object to analyze :param lex: the LexData object with gazetteer information and model settings :return: void

load(model='eng', override=None)[source]¶

Method to load model data. Normally invoked by constructor, but can be repeated to change models later.

Parameters:

model – model directory in models/ specifying settings and gazetteers for this language (default: eng)

override – name of a section in models/override.ini if configuration overrides should be applied

Returns:
None

process_sentence(tokoffset, sentence)[source]¶

Function to analyze a single sentence

Parameters:

tokoffset – the offset in tokens for the beginning of the current sentence within all input tokens

sentence – the Sentence object containing mood, speaker and other information about this sentence

Returns:
void

serialize_output(out_format, parse=None)[source]¶

Return a string representation of the output in some format, or generate PAULA directory structure as output

Parameters:

out_format – the format to generate, one of: html, paula, webanno, conll, onto, unittest

parse – the original parse input fed to xrenner; only needed for unittest output

Returns:
specified output format string, or void for paula

set_doc_name(name)[source]¶

Method to manually set the name of the document being processed, rather than deriving it from an input file name.

Parameters: name – string, the name to give the document

Returns: None

ParsedToken¶

class modules.xrenner_classes.ParsedToken(tok_id, text, lemma, pos, morph, head, func, sentence, modifiers, child_funcs, child_strings, lex, quoted=False, head2='_', func2='_')[source]¶

Markable¶

class modules.xrenner_classes.Markable(mark_id, head, form, definiteness, start, end, text, core_text, entity, entity_certainty, subclass, infstat, agree, sentence, antecedent, coref_type, group, alt_entities, alt_subclasses, alt_agree, cardinality=0, submarks=[], coordinate=False, agree_certainty='')[source]¶

extract_features(lex, antecedent=None, candidate_list=[], dump_position=False)[source]¶

Function to generate feature representation of markables or markable-antecedent pairs for classifiers

Parameters:

lex – the LexData object with gazetteer information and model settings

antecedent – The antecedent Markable potentially coreferring to self

candidate_list – The list of candidate markables under consideration, used to extract cohort size

dump_position – Whether document name + token positions are dumped for each markable to compare to gold

Returns:
dictionary of markable properties

CorefRule¶

class modules.xrenner_rule.CorefRule(rule_string, rule_num)[source]¶

ConstraintMatcher¶

class modules.xrenner_rule.ConstraintMatcher(constraint)[source]¶

LexData¶

class modules.xrenner_lex.LexData(model, xrenner, override=None, rule_based=False, no_seq=False)[source]¶

Class to hold lexical information from gazetteers and training data. Use model argument to define subdirectory under models/ for reading different sets of configuration files.

get_atoms()[source]¶

Function to compile atom list for atomic markable recognition. Currently treats listed persons, places, organizations and inanimate objects from lexical data as atomic by default.

Returns: dictionary of atoms.

get_filters(override=None)[source]¶

Reads model settings from config.ini and possibly overrides from override.ini

Parameters: override – optional section name in override.ini

Returns: filters - dictionary of settings from config.ini with possible overrides

static get_first_last_names(names)[source]¶

Collects separate first and last name data from the collection in names.tab

Parameters: names – The complete names dictionary from names.tab, mapping full name to agreement

Returns: [firsts, lasts] - list containing dictionary of first names to agreement and set of last names

get_func_substitutes()[source]¶

Function for semi-hard-wired function substitutions based on function label and dependency direction. Uses func_substitute_forward and func_substitute_backward settings in config.ini

Returns: list of compiled substitutions_forward, substitutions_backward

get_morph()[source]¶

Compiles morphlogical affix dictionary based on members of entity_heads.tab

Returns: dictionary from affixes to dictionaries mapping classes to type frequencies

get_pos_agree_mappings()[source]¶

Gets dictionary mapping POS categories to default agreement classes, e.g. NNS > plural

Returns: mapping dictionary

lemmatize(token)[source]¶

Simple lemmatization function using rules from lemma_rules in config.ini

Parameters: token – ParsedToken object to be lemmatized

Returns: string - the lemma

parse_coref_rules(rule_list)[source]¶

Reader function to pass coref_rules.tab into CorefRule objects in two lists: one for general rules and one also including rules to use when speaker info is available.

Parameters: rule_list – textual list of rules

Returns: two separate lists of compiled CorefRule objects with and without speaker specifications

process_morph(token)[source]¶

Simple mechanism for substituting values in morph feature of input tokens. For more elaborate sub-graph dependent manipultations, use depedit module

Parameters: token – ParsedToken object to edit morph feature

Returns: string - the edited morph feature

read_antonyms()[source]¶

Function to created dictionary from each word to all its antonyms in antonyms.tab

Returns: dictionary from words to antonym sets

read_delim(filename, mode='normal', atom_list_name='atoms', add_to_sums=False, sep=', ')[source]¶

Generic file reader for lexical data in model directory

Parameters:

filename – string - name of the file

mode – double, triple, quadruple, quadruple_numeric, triple_numeric or low reading mode

atom_list_name – list of atoms to use for triple reader mode

add_to_sums – whether to sum numbers from multiple instances of the same key

sep – separator for double_with_sep mode

Returns:
compiled lexical data, usually a structured dictionary or set depending on number of columns

read_isa()[source]¶

Reads isa.tab into a dictionary from words to lists of isa-matches

Returns: dictionary from words to lists of corresponding isa-matches

Modules¶

depedit¶

DepEdit - A simple configurable tool for manipulating dependency trees

Input: CoNLL10 or CoNLLU (10 columns, tab-delimited, blank line between sentences, comments with pound sign #)

Author: Amir Zeldes

xrenner_compatible¶

modules.xrenner_compatible.acronym_match(mark, candidate, lex)[source]¶

Check whether a Markable’s text is an acronym of a candidate Markable’s text

Parameters:	mark – The Markable object to test candidate – The candidate Markable with potentially acronym-matching text lex – the LexData object with gazetteer information and model settings
Returns:	bool

modules.xrenner_compatible.agree_compatible(mark1, mark2, lex)[source]¶

Checks if the agree property of two markables is compatible for possible coreference

Parameters:	mark1 – the first of two markables to compare agreement mark2 – the second of two markables to compare agreement lex – the LexData object with gazetteer information and model settings
Returns:	bool

modules.xrenner_compatible.best_candidate(markable, candidate_set, lex, rule, take_first=False)[source]¶

Parameters:

markable – markable to find best antecedent for
candidate_set – set of markables which are possible antecedents based on some coref_rule
lex – the LexData object with gazetteer information and model settings
propagate – string with feature propagation instructions from coref_rules.tab in lex
rule_num – the rule number of the rule producing the match in coref_rules.tab
clf_name – name of the pickled classifier to use for this rule, or “_default_” to use heuristic matching
take_first – boolean, whether to skip matching and use the most recent candidate (minimum token distance). This saves time if a rule is guaranteed to produce a unique, correct candidate (e.g. reflexives)

Returns:

Markable object or None (the selected best antecedent markable, if available)

modules.xrenner_compatible.entities_compatible(mark1, mark2, lex)[source]¶

Checks if the entity property of two markables is compatible for possible coreference

Parameters:	mark1 – the first of two markables to compare entities mark2 – the second of two markables to compare entities lex – the LexData object with gazetteer information and model settings
Returns:	bool

modules.xrenner_compatible.group_agree_compatible(markable, candidate, previous_markables, lex)[source]¶

Parameters:	markable – markable whose group the candidate might be joined to candidate – candidate to check for compatibility with all group members previous_markables – all previous markables which may need to inherit from the model/host lex – the LexData object with gazetteer information and model settings
Returns:	bool

modules.xrenner_compatible.isa(markable, candidate, lex)[source]¶

Staging function to check for and store new cached isa information. Calls actual run_isa() function if pair is still viable for new isa match.

Parameters:	markable – one of two markables to compare lexical isa relationship with candidate – the second markable, which is a candidate antecedent for the other markable lex – the LexData object with gazetteer information and model settings
Returns:	bool

modules.xrenner_compatible.merge_entities(mark1, mark2, previous_markables, lex)[source]¶

Negotiates entity mismatches across coreferent markables and their groups. Returns True if merging has occurred.

Parameters:	mark1 – the first of two markables to merge entities for mark2 – the second of two markables to merge entities for previous_markables – all previous markables which may need to inherit from the model/host lex – the LexData object with gazetteer information and model settings
Returns:	bool

modules.xrenner_compatible.modifiers_compatible(markable, candidate, lex, allow_force_proper_mod_match=True)[source]¶

Checks whether the dependents of two markables are compatible for possible coreference

Parameters:	markable – `Markable` one of two markables to compare dependents for candidate – `Markable` the second markable, which is a candidate antecedent for the other markable lex – the LexData object with gazetteer information and model settings
Returns:	bool

modules.xrenner_compatible.run_isa(markable, candidate, lex)[source]¶

Checks whether two markables are compatible for coreference via the isa-relation

Parameters:	markable – one of two markables to compare lexical isa relationship with candidate – the second markable, which is a candidate antecedent for the other markable lex – the LexData object with gazetteer information and model settings
Returns:	bool

modules.xrenner_compatible.score_match_heuristic(markable, candidate, features, lex)[source]¶

Basic fall-back function for heuristic match scoring when no classifier is available

Parameters:	makrable – candidate – features –
Returns:

modules.xrenner_compatible.update_group(host, model, previous_markables, lex)[source]¶

Attempts to update entire coreference group of a host markable with information gathered from a model markable discovered to be possibly coreferent with the host. If incompatible modifiers are discovered the process fails and returns False. Otherwise updating succeeds and the update_group returns true

Parameters:	host – the first markable discovered to be coreferent with the model model – the model markable, containing new information for the group previous_markables – all previous markables which may need to inherit from the model/host lex – the LexData object with gazetteer information and model settings
Returns:	bool

xrenner_coref¶

modules.xrenner_coref.antecedent_prohibited(markable, conll_tokens, lex)[source]¶

Check whether a Markable object is prohibited from having an antecedent

Parameters:	markable – The Markable object to check conll_tokens – The list of ParsedToken objects up to and including the current sentence lex – the LexData object with gazetteer information and model settings
Returns:	bool

modules.xrenner_coref.coref_rule_applies(lex, constraints, mark, anaphor=None)[source]¶

Check whether a markable definition from a coref rule applies to this markable

Parameters:	lex – the LexData object with gazetteer information and model settings constraints – the constraints defining the relevant Markable mark – the Markable object to check constraints against anaphor – if this is an antecedent check, the anaphor is passed for $1-style constraint checks
Returns:	bool: True if ‘mark’ fits all constraints, False if any of them fail

modules.xrenner_coref.find_antecedent(markable, previous_markables, lex, restrict_rule='')[source]¶

Search for antecedents by cycling through coref rules for previous markables

Parameters:	markable – Markable object to find an antecedent for previous_markables – Markables in all sentences up to and including current sentence lex – the LexData object with gazetteer information and model settings restrict_rule – a string specifying a subset of rules that should be checked (e.g. only rules with ‘appos’)
Returns:	candidate, matching_rule - the best antecedent and the rule that matched it

modules.xrenner_coref.search_prev_markables(markable, previous_markables, rule, lex)[source]¶

Search for antecedent to specified markable using a specified rule

Parameters:

markable – The markable object to find an antecedent for
previous_markables – The list of know markables up to and including the current sentence; markables beyond current markable but in its sentence are included for cataphora.
ante_constraints – A list of ContraintMatcher objects describing the antecedent
ante_spec – The antecedent specification part of the coref rule being checked, as a string
lex – the LexData object with gazetteer information and model settings
max_dist – Maximum distance in sentences for the antecedent search (0 for search within sentence)
propagate – Whether to progpagate features upon match and in which direction

Returns:

the selected candidate Markable object

xrenner_marker¶

modules.xrenner_marker.assign_coordinate_entity(mark, markables_by_head)[source]¶

Checks if all constituents of a coordinate markable have the same entity and subclass and if so, propagates these to the coordinate markable.

Parameters:	mark – a coordinate markable to check the entities of its constituents markables_by_head – dictionary of markables by head id
Returns:	void

modules.xrenner_marker.construct_modifier_substring(modifier)[source]¶

Creates a list of tokens representing a modifier and all of its submodifiers in sequence

Parameters:	modifier – A ParsedToken object from the modifier list of the head of some markable
Returns:	Text of that modifier together with its modifiers in sequence

modules.xrenner_marker.disambiguate_entity(mark, lex)[source]¶

Selects prefered entity for a Markable with multiple alt_entities based on dependency information or more common type

Parameters:	mark – the Markable object lex – the `LexData` object with gazetteer information and model settings
Returns:	predicted entity type as string

modules.xrenner_marker.get_mod_ordered_dict(mod)[source]¶

Retrieves the (sub)modifiers of a modifier token

Parameters:	mod – A `ParsedToken` object representing a modifier of the head of some markable
Returns:	Recursive ordered dictionary of that modifier’s own modifiers

modules.xrenner_marker.is_atomic(mark, atoms, lex)[source]¶

Checks if nested markables are allowed within this markable

Parameters:	mark – the `Markable` to be checked for atomicity atoms – list of atomic markable text strings lex – the `LexData` object with gazetteer information and model settings
Returns:	bool

modules.xrenner_marker.lookup_has_entity(text, lemma, entity, lex)[source]¶

Checks if a certain token text or lemma have the specific entity listed in the entities or entity_heads lists

Parameters:	text – text of the token lemma – lemma of the token entity – entity to check for lex – the `LexData` object with gazetteer information and model settings
Returns:	bool

modules.xrenner_marker.markables_overlap(mark1, mark2, lex=None)[source]¶

Helper function to check if two markables cover some of the same tokens. Note that if the lex argument is specified, it is used to recognize possessives, which behave exceptionally. Possessive pronouns beginning after a main markable has started are tolerated in case of markable definitions including relative clauses, e.g. [Mr. Pickwick, who was looking for [his] hat]

Parameters:	mark1 – First `Markable` mark2 – Second `Markable` lex – the `LexData` object with gazetteer information and model settings or None
Returns:	bool

modules.xrenner_marker.parse_entity(entity_text, certainty='uncertain')[source]¶

Parses: entity -tab- subclass(/agree) + certainty into a tuple

Parameters:	entity_text – the string to parse, must contain excatly two tabs certainty – the certainty string at end of tuple, default ‘uncertain’
Returns:	quadruple of (entity, subclass, agree, certainty)

modules.xrenner_marker.pos_func_combo(pos, func, pos_func_heads_string)[source]¶

Returns:	bool

modules.xrenner_marker.recognize_entity_by_mod(mark, lex, mark_atoms=False)[source]¶

Attempt to recognize entity type based on modifiers

Parameters:	mark – `Markable` for which to identify the entity type modifier_lexicon – The `LexData` object’s modifier list
Returns:	String (entity type, possibly including subtype and agreement)

modules.xrenner_marker.remove_infix_tokens(marktext, lex)[source]¶

Remove infix tokens such as dashes, interfixed articles (in Semitic construct state) etc.

Parameters:	marktext – the markable text string to remove tokens from lex – the `LexData` object with gazetteer information and model settings
Returns:	potentially truncated text

modules.xrenner_marker.remove_prefix_tokens(marktext, lex)[source]¶

Remove leading tokens such as articles and other tokens configured as potentially redundant to citation form

Parameters:	marktext – the markable text string to remove tokens from lex – the `LexData` object with gazetteer information and model settings
Returns:	potentially truncated text

modules.xrenner_marker.remove_suffix_tokens(marktext, lex)[source]¶

Remove trailing tokens such as genitive ‘s and other tokens configured as potentially redundant to citation form

Parameters:	marktext – the markable text string to remove tokens from lex – the `LexData` object with gazetteer information and model settings
Returns:	potentially truncated text

modules.xrenner_marker.resolve_cardinality(mark, lex)[source]¶

Find cardinality for Markable based on numerical modifiers or number words

Parameters:	mark – The `Markable` to resolve agreement for lex – the `LexData` object with gazetteer information and model settings
Returns:	Cardinality as float, zero if unknown

modules.xrenner_marker.resolve_entity_cascade(entity_text, mark, lex)[source]¶

Retrieve possible entity types for a given text fragment based on entities list, entity heads and names list.

Parameters:	entity_text – The text to determine the entity for mark – The `Markable` hosting the text fragment to retrieve context information from (e.g. dependency) lex – the `LexData` object with gazetteer information and model settings
Returns:	entity type; note that this is used to decide whether to stop the search, but the Markable’s entity is already set during processing together with matching subclass and agree information

modules.xrenner_marker.resolve_mark_agree(mark, lex)[source]¶

Resolve Markable agreement based on morph information in tokens or gazetteer data

Parameters:	mark – The `Markable` to resolve agreement for lex – the `LexData` object with gazetteer information and model settings
Returns:	void

modules.xrenner_marker.resolve_mark_entity(mark, lex)[source]¶

Main function to set entity type based on progressively less restricted parts of a markable’s text

Parameters:	mark – The `Markable` object to get the entity type for lex – the `LexData` object with gazetteer information and model settings
Returns:	void

xrenner_postprocess¶

Postprocessing module. Alters results of coreference analysis based on model settings, such as deleting certain markables or re-wiring coreference relations according to a particular annotation scheme

Author: Amir Zeldes and Shuo Zhang

modules.xrenner_postprocess.kill_zero_marks(markables, markstart_dict, markend_dict)[source]¶

Removes markables whose id has been set to 0 in postprocessing

Parameters:	markables – All Markable objects markstart_dict – Dictionary of token span start ids to lists of markables starting at that id markend_dict – Dictionary of token span end ids to lists of markables ending at that id
Returns:	void

xrenner_preprocess¶

modules/xrenner_preprocess.py

Prepare parser output for entity and coreference resolution

Author: Amir Zeldes

modules.xrenner_preprocess.add_child_info(conll_tokens, child_funcs, child_strings, lex)[source]¶

Adds a list of all dependent functions and token strings to each parent token

Parameters:	conll_tokens – The ParsedToken list so far child_funcs – Dictionary from ids to child functions child_strings – Dictionary from ids to child strings
Returns:	void

modules.xrenner_preprocess.add_negated_parents(conll_tokens, offset)[source]¶

Sets the neg_parent property on tokens whose head dominates a negation

Parameters:	conll_tokens – token list for this document offset – token ID reached in last sentence
Returns:	None

modules.xrenner_preprocess.replace_conj_func(conll_tokens, tokoffset, lex)[source]¶

Function to replace functions of tokens matching the conjunction function with their parent’s function

Parameters:	conll_tokens – The ParsedToken list so far tokoffset – The starting token for this sentence lex – the LexData object with gazetteer information and model settings
Returns:	void

xrenner_propagate¶

modules/xrenner_propagate.py

Feature propagation module. Propagates entity and agreement features for coreferring markables.

Author: Amir Zeldes

modules.xrenner_propagate.propagate_agree(markable, candidate)[source]¶

Progpagate agreement between to markables if one has unknown agreement

Parameters:	markable – Markable object candidate – Coreferent antecdedent Markable object
Returns:	void

modules.xrenner_propagate.propagate_entity(markable, candidate, direction='propagate')[source]¶

Propagate class and agreement features between coreferent markables

Parameters:	markable – a Markable object candidate – a coreferent antecedent Markable object direction – propagation direction; by default, data can be propagated in either direction from the more certain markable to the less certain one, but direction can be forced, e.g. ‘propagate_forward’
Returns:	void

Unit tests¶

class modules.xrenner_test.Test1Model(methodName='runTest')[source]¶

classmethod setUpClass()[source]¶

classmethod tearDownClass()[source]¶

test_model_files()[source]¶

class modules.xrenner_test.Test2MarkableMethods(methodName='runTest')[source]¶

classmethod setUpClass()[source]¶

classmethod tearDownClass()[source]¶

test_atomic_mod()[source]¶

test_name()[source]¶

class modules.xrenner_test.Test3CorefMethods(methodName='runTest')[source]¶

classmethod setUpClass()[source]¶

test_affix_morphology()[source]¶

test_appos_envelope()[source]¶

test_cardinality()[source]¶

test_dynamic_hasa()[source]¶

test_entity_dep()[source]¶

test_hasa()[source]¶

test_isa()[source]¶

test_verbal_event_stem()[source]¶

class modules.xrenner_test.Case(case_string)[source]¶