Developers’ module documentation¶
Classes¶
Xrenner¶
- class
modules.xrenner_xrenner.
Xrenner
(model='eng', override=None, rule_based=False, no_seq=False)[source]¶
analyze
(infile, out_format)[source]¶Method to run coreference analysis with loaded model
Parameters:
- infile – file name of the parse file in the conll10 format, or the pre-read parse itself
- format – format to determine output type, one of: html, paula, webanno, conll, onto, unittest
Returns: output based on requested format
analyze_markable
(mark, lex)[source]¶Find entity, agreement and cardinality information for a markable :param mark: The
Markable
object to analyze :param lex: theLexData
object with gazetteer information and model settings :return: void
load
(model='eng', override=None)[source]¶Method to load model data. Normally invoked by constructor, but can be repeated to change models later.
Parameters:
- model – model directory in models/ specifying settings and gazetteers for this language (default: eng)
- override – name of a section in models/override.ini if configuration overrides should be applied
Returns: None
process_sentence
(tokoffset, sentence)[source]¶Function to analyze a single sentence
Parameters:
- tokoffset – the offset in tokens for the beginning of the current sentence within all input tokens
- sentence – the Sentence object containing mood, speaker and other information about this sentence
Returns: void
serialize_output
(out_format, parse=None)[source]¶Return a string representation of the output in some format, or generate PAULA directory structure as output
Parameters:
- out_format – the format to generate, one of: html, paula, webanno, conll, onto, unittest
- parse – the original parse input fed to xrenner; only needed for unittest output
Returns: specified output format string, or void for paula
ParsedToken¶
Markable¶
- class
modules.xrenner_classes.
Markable
(mark_id, head, form, definiteness, start, end, text, core_text, entity, entity_certainty, subclass, infstat, agree, sentence, antecedent, coref_type, group, alt_entities, alt_subclasses, alt_agree, cardinality=0, submarks=[], coordinate=False, agree_certainty='')[source]¶
extract_features
(lex, antecedent=None, candidate_list=[], dump_position=False)[source]¶Function to generate feature representation of markables or markable-antecedent pairs for classifiers
Parameters:
- lex – the LexData object with gazetteer information and model settings
- antecedent – The antecedent Markable potentially coreferring to self
- candidate_list – The list of candidate markables under consideration, used to extract cohort size
- dump_position – Whether document name + token positions are dumped for each markable to compare to gold
Returns: dictionary of markable properties
LexData¶
- class
modules.xrenner_lex.
LexData
(model, xrenner, override=None, rule_based=False, no_seq=False)[source]¶Class to hold lexical information from gazetteers and training data. Use model argument to define subdirectory under models/ for reading different sets of configuration files.
get_atoms
()[source]¶Function to compile atom list for atomic markable recognition. Currently treats listed persons, places, organizations and inanimate objects from lexical data as atomic by default.
Returns: dictionary of atoms.
get_filters
(override=None)[source]¶Reads model settings from config.ini and possibly overrides from override.ini
Parameters: override – optional section name in override.ini Returns: filters - dictionary of settings from config.ini with possible overrides
- static
get_first_last_names
(names)[source]¶Collects separate first and last name data from the collection in names.tab
Parameters: names – The complete names dictionary from names.tab, mapping full name to agreement Returns: [firsts, lasts] - list containing dictionary of first names to agreement and set of last names
get_func_substitutes
()[source]¶Function for semi-hard-wired function substitutions based on function label and dependency direction. Uses func_substitute_forward and func_substitute_backward settings in config.ini
Returns: list of compiled substitutions_forward, substitutions_backward
get_morph
()[source]¶Compiles morphlogical affix dictionary based on members of entity_heads.tab
Returns: dictionary from affixes to dictionaries mapping classes to type frequencies
get_pos_agree_mappings
()[source]¶Gets dictionary mapping POS categories to default agreement classes, e.g. NNS > plural
Returns: mapping dictionary
lemmatize
(token)[source]¶Simple lemmatization function using rules from lemma_rules in config.ini
Parameters: token – ParsedToken object to be lemmatized Returns: string - the lemma
parse_coref_rules
(rule_list)[source]¶Reader function to pass coref_rules.tab into CorefRule objects in two lists: one for general rules and one also including rules to use when speaker info is available.
Parameters: rule_list – textual list of rules Returns: two separate lists of compiled CorefRule objects with and without speaker specifications
process_morph
(token)[source]¶Simple mechanism for substituting values in morph feature of input tokens. For more elaborate sub-graph dependent manipultations, use depedit module
Parameters: token – ParsedToken object to edit morph feature Returns: string - the edited morph feature
read_antonyms
()[source]¶Function to created dictionary from each word to all its antonyms in antonyms.tab
Returns: dictionary from words to antonym sets
read_delim
(filename, mode='normal', atom_list_name='atoms', add_to_sums=False, sep=', ')[source]¶Generic file reader for lexical data in model directory
Parameters:
- filename – string - name of the file
- mode – double, triple, quadruple, quadruple_numeric, triple_numeric or low reading mode
- atom_list_name – list of atoms to use for triple reader mode
- add_to_sums – whether to sum numbers from multiple instances of the same key
- sep – separator for double_with_sep mode
Returns: compiled lexical data, usually a structured dictionary or set depending on number of columns
Modules¶
depedit¶
DepEdit - A simple configurable tool for manipulating dependency trees
Input: CoNLL10 or CoNLLU (10 columns, tab-delimited, blank line between sentences, comments with pound sign #)
Author: Amir Zeldes
xrenner_compatible¶
-
modules.xrenner_compatible.
acronym_match
(mark, candidate, lex)[source]¶ Check whether a Markable’s text is an acronym of a candidate Markable’s text
Parameters: - mark – The Markable object to test
- candidate – The candidate Markable with potentially acronym-matching text
- lex – the LexData object with gazetteer information and model settings
Returns: bool
-
modules.xrenner_compatible.
agree_compatible
(mark1, mark2, lex)[source]¶ Checks if the agree property of two markables is compatible for possible coreference
Parameters: - mark1 – the first of two markables to compare agreement
- mark2 – the second of two markables to compare agreement
- lex – the LexData object with gazetteer information and model settings
Returns: bool
-
modules.xrenner_compatible.
best_candidate
(markable, candidate_set, lex, rule, take_first=False)[source]¶ Parameters: - markable – markable to find best antecedent for
- candidate_set – set of markables which are possible antecedents based on some coref_rule
- lex – the LexData object with gazetteer information and model settings
- propagate – string with feature propagation instructions from coref_rules.tab in lex
- rule_num – the rule number of the rule producing the match in coref_rules.tab
- clf_name – name of the pickled classifier to use for this rule, or “_default_” to use heuristic matching
- take_first – boolean, whether to skip matching and use the most recent candidate (minimum token distance). This saves time if a rule is guaranteed to produce a unique, correct candidate (e.g. reflexives)
Returns: Markable object or None (the selected best antecedent markable, if available)
-
modules.xrenner_compatible.
entities_compatible
(mark1, mark2, lex)[source]¶ Checks if the entity property of two markables is compatible for possible coreference
Parameters: - mark1 – the first of two markables to compare entities
- mark2 – the second of two markables to compare entities
- lex – the LexData object with gazetteer information and model settings
Returns: bool
-
modules.xrenner_compatible.
group_agree_compatible
(markable, candidate, previous_markables, lex)[source]¶ Parameters: - markable – markable whose group the candidate might be joined to
- candidate – candidate to check for compatibility with all group members
- previous_markables – all previous markables which may need to inherit from the model/host
- lex – the LexData object with gazetteer information and model settings
Returns: bool
-
modules.xrenner_compatible.
isa
(markable, candidate, lex)[source]¶ Staging function to check for and store new cached isa information. Calls actual
run_isa()
function if pair is still viable for new isa match.Parameters: - markable – one of two markables to compare lexical isa relationship with
- candidate – the second markable, which is a candidate antecedent for the other markable
- lex – the LexData object with gazetteer information and model settings
Returns: bool
-
modules.xrenner_compatible.
merge_entities
(mark1, mark2, previous_markables, lex)[source]¶ Negotiates entity mismatches across coreferent markables and their groups. Returns True if merging has occurred.
Parameters: - mark1 – the first of two markables to merge entities for
- mark2 – the second of two markables to merge entities for
- previous_markables – all previous markables which may need to inherit from the model/host
- lex – the LexData object with gazetteer information and model settings
Returns: bool
-
modules.xrenner_compatible.
modifiers_compatible
(markable, candidate, lex, allow_force_proper_mod_match=True)[source]¶ Checks whether the dependents of two markables are compatible for possible coreference
Parameters: - markable –
Markable
one of two markables to compare dependents for - candidate –
Markable
the second markable, which is a candidate antecedent for the other markable - lex – the LexData object with gazetteer information and model settings
Returns: bool
- markable –
-
modules.xrenner_compatible.
run_isa
(markable, candidate, lex)[source]¶ Checks whether two markables are compatible for coreference via the isa-relation
Parameters: - markable – one of two markables to compare lexical isa relationship with
- candidate – the second markable, which is a candidate antecedent for the other markable
- lex – the LexData object with gazetteer information and model settings
Returns: bool
-
modules.xrenner_compatible.
score_match_heuristic
(markable, candidate, features, lex)[source]¶ Basic fall-back function for heuristic match scoring when no classifier is available
Parameters: - makrable –
- candidate –
- features –
Returns:
-
modules.xrenner_compatible.
update_group
(host, model, previous_markables, lex)[source]¶ Attempts to update entire coreference group of a host markable with information gathered from a model markable discovered to be possibly coreferent with the host. If incompatible modifiers are discovered the process fails and returns False. Otherwise updating succeeds and the update_group returns true
Parameters: - host – the first markable discovered to be coreferent with the model
- model – the model markable, containing new information for the group
- previous_markables – all previous markables which may need to inherit from the model/host
- lex – the LexData object with gazetteer information and model settings
Returns: bool
xrenner_coref¶
-
modules.xrenner_coref.
antecedent_prohibited
(markable, conll_tokens, lex)[source]¶ Check whether a Markable object is prohibited from having an antecedent
Parameters: - markable – The Markable object to check
- conll_tokens – The list of ParsedToken objects up to and including the current sentence
- lex – the LexData object with gazetteer information and model settings
Returns: bool
-
modules.xrenner_coref.
coref_rule_applies
(lex, constraints, mark, anaphor=None)[source]¶ Check whether a markable definition from a coref rule applies to this markable
Parameters: - lex – the LexData object with gazetteer information and model settings
- constraints – the constraints defining the relevant Markable
- mark – the Markable object to check constraints against
- anaphor – if this is an antecedent check, the anaphor is passed for $1-style constraint checks
Returns: bool: True if ‘mark’ fits all constraints, False if any of them fail
-
modules.xrenner_coref.
find_antecedent
(markable, previous_markables, lex, restrict_rule='')[source]¶ Search for antecedents by cycling through coref rules for previous markables
Parameters: - markable – Markable object to find an antecedent for
- previous_markables – Markables in all sentences up to and including current sentence
- lex – the LexData object with gazetteer information and model settings
- restrict_rule – a string specifying a subset of rules that should be checked (e.g. only rules with ‘appos’)
Returns: candidate, matching_rule - the best antecedent and the rule that matched it
-
modules.xrenner_coref.
search_prev_markables
(markable, previous_markables, rule, lex)[source]¶ Search for antecedent to specified markable using a specified rule
Parameters: - markable – The markable object to find an antecedent for
- previous_markables – The list of know markables up to and including the current sentence; markables beyond current markable but in its sentence are included for cataphora.
- ante_constraints – A list of ContraintMatcher objects describing the antecedent
- ante_spec – The antecedent specification part of the coref rule being checked, as a string
- lex – the LexData object with gazetteer information and model settings
- max_dist – Maximum distance in sentences for the antecedent search (0 for search within sentence)
- propagate – Whether to progpagate features upon match and in which direction
Returns: the selected candidate Markable object
xrenner_marker¶
-
modules.xrenner_marker.
assign_coordinate_entity
(mark, markables_by_head)[source]¶ Checks if all constituents of a coordinate markable have the same entity and subclass and if so, propagates these to the coordinate markable.
Parameters: - mark – a coordinate markable to check the entities of its constituents
- markables_by_head – dictionary of markables by head id
Returns: void
-
modules.xrenner_marker.
construct_modifier_substring
(modifier)[source]¶ Creates a list of tokens representing a modifier and all of its submodifiers in sequence
Parameters: modifier – A ParsedToken object from the modifier list of the head of some markable Returns: Text of that modifier together with its modifiers in sequence
-
modules.xrenner_marker.
disambiguate_entity
(mark, lex)[source]¶ Selects prefered entity for a Markable with multiple alt_entities based on dependency information or more common type
Parameters: - mark – the Markable object
- lex – the
LexData
object with gazetteer information and model settings
Returns: predicted entity type as string
-
modules.xrenner_marker.
get_mod_ordered_dict
(mod)[source]¶ Retrieves the (sub)modifiers of a modifier token
Parameters: mod – A ParsedToken
object representing a modifier of the head of some markableReturns: Recursive ordered dictionary of that modifier’s own modifiers
-
modules.xrenner_marker.
is_atomic
(mark, atoms, lex)[source]¶ Checks if nested markables are allowed within this markable
Parameters: Returns: bool
-
modules.xrenner_marker.
lookup_has_entity
(text, lemma, entity, lex)[source]¶ Checks if a certain token text or lemma have the specific entity listed in the entities or entity_heads lists
Parameters: - text – text of the token
- lemma – lemma of the token
- entity – entity to check for
- lex – the
LexData
object with gazetteer information and model settings
Returns: bool
-
modules.xrenner_marker.
markables_overlap
(mark1, mark2, lex=None)[source]¶ Helper function to check if two markables cover some of the same tokens. Note that if the lex argument is specified, it is used to recognize possessives, which behave exceptionally. Possessive pronouns beginning after a main markable has started are tolerated in case of markable definitions including relative clauses, e.g. [Mr. Pickwick, who was looking for [his] hat]
Parameters: Returns: bool
-
modules.xrenner_marker.
parse_entity
(entity_text, certainty='uncertain')[source]¶ Parses: entity -tab- subclass(/agree) + certainty into a tuple
Parameters: - entity_text – the string to parse, must contain excatly two tabs
- certainty – the certainty string at end of tuple, default ‘uncertain’
Returns: quadruple of (entity, subclass, agree, certainty)
-
modules.xrenner_marker.
recognize_entity_by_mod
(mark, lex, mark_atoms=False)[source]¶ Attempt to recognize entity type based on modifiers
Parameters: Returns: String (entity type, possibly including subtype and agreement)
-
modules.xrenner_marker.
remove_infix_tokens
(marktext, lex)[source]¶ Remove infix tokens such as dashes, interfixed articles (in Semitic construct state) etc.
Parameters: - marktext – the markable text string to remove tokens from
- lex – the
LexData
object with gazetteer information and model settings
Returns: potentially truncated text
-
modules.xrenner_marker.
remove_prefix_tokens
(marktext, lex)[source]¶ Remove leading tokens such as articles and other tokens configured as potentially redundant to citation form
Parameters: - marktext – the markable text string to remove tokens from
- lex – the
LexData
object with gazetteer information and model settings
Returns: potentially truncated text
-
modules.xrenner_marker.
remove_suffix_tokens
(marktext, lex)[source]¶ Remove trailing tokens such as genitive ‘s and other tokens configured as potentially redundant to citation form
Parameters: - marktext – the markable text string to remove tokens from
- lex – the
LexData
object with gazetteer information and model settings
Returns: potentially truncated text
-
modules.xrenner_marker.
resolve_cardinality
(mark, lex)[source]¶ Find cardinality for Markable based on numerical modifiers or number words
Parameters: Returns: Cardinality as float, zero if unknown
-
modules.xrenner_marker.
resolve_entity_cascade
(entity_text, mark, lex)[source]¶ Retrieve possible entity types for a given text fragment based on entities list, entity heads and names list.
Parameters: Returns: entity type; note that this is used to decide whether to stop the search, but the Markable’s entity is already set during processing together with matching subclass and agree information
-
modules.xrenner_marker.
resolve_mark_agree
(mark, lex)[source]¶ Resolve Markable agreement based on morph information in tokens or gazetteer data
Parameters: Returns: void
xrenner_postprocess¶
Postprocessing module. Alters results of coreference analysis based on model settings, such as deleting certain markables or re-wiring coreference relations according to a particular annotation scheme
Author: Amir Zeldes and Shuo Zhang
-
modules.xrenner_postprocess.
kill_zero_marks
(markables, markstart_dict, markend_dict)[source]¶ Removes markables whose id has been set to 0 in postprocessing
Parameters: - markables – All Markable objects
- markstart_dict – Dictionary of token span start ids to lists of markables starting at that id
- markend_dict – Dictionary of token span end ids to lists of markables ending at that id
Returns: void
xrenner_preprocess¶
modules/xrenner_preprocess.py
Prepare parser output for entity and coreference resolution
Author: Amir Zeldes
-
modules.xrenner_preprocess.
add_child_info
(conll_tokens, child_funcs, child_strings, lex)[source]¶ Adds a list of all dependent functions and token strings to each parent token
Parameters: - conll_tokens – The ParsedToken list so far
- child_funcs – Dictionary from ids to child functions
- child_strings – Dictionary from ids to child strings
Returns: void
-
modules.xrenner_preprocess.
add_negated_parents
(conll_tokens, offset)[source]¶ Sets the neg_parent property on tokens whose head dominates a negation
Parameters: - conll_tokens – token list for this document
- offset – token ID reached in last sentence
Returns: None
-
modules.xrenner_preprocess.
replace_conj_func
(conll_tokens, tokoffset, lex)[source]¶ Function to replace functions of tokens matching the conjunction function with their parent’s function
Parameters: - conll_tokens – The ParsedToken list so far
- tokoffset – The starting token for this sentence
- lex – the LexData object with gazetteer information and model settings
Returns: void
xrenner_propagate¶
modules/xrenner_propagate.py
Feature propagation module. Propagates entity and agreement features for coreferring markables.
Author: Amir Zeldes
-
modules.xrenner_propagate.
propagate_agree
(markable, candidate)[source]¶ Progpagate agreement between to markables if one has unknown agreement
Parameters: - markable – Markable object
- candidate – Coreferent antecdedent Markable object
Returns: void
-
modules.xrenner_propagate.
propagate_entity
(markable, candidate, direction='propagate')[source]¶ Propagate class and agreement features between coreferent markables
Parameters: - markable – a Markable object
- candidate – a coreferent antecedent Markable object
- direction – propagation direction; by default, data can be propagated in either direction from the more certain markable to the less certain one, but direction can be forced, e.g. ‘propagate_forward’
Returns: void