Download software and data
Software & Demos
- GumDrop - a multilingual elementary discourse unit (EDU) segmentation and connective detection system (this system got first–second places at the DISRPT2019 shared task).
[download] - xrenner - an externally configurable, language-independent coreferencer and entity recognizer
[demo] [download] - rstWeb - an open source, browser-based tool for collaborative online annotation in Rhetorical Structure Theory
[demo] [download] [info] - DepEdit - a configurable script for automatic editing of dependency trees
[download & documentation] - Excel annotation add-ins - a suite of freely available Excel Add-Ins for corpus annotation, including import and export scripts for EXMARaLDA, PAULA, TreeTagger/CWB, CoNLL10 and more.
[downloads & documentation] -
Lexiconless IPA Transcription and Syllable Analysis - approximate phonetic transcription and tree-based syllable analysis for German and Polish words based purely on
orthographic rules (no lexicon is used).
[demo] [download for German] [download for Polish] - Sound Change Transducers - a web page with transducers modeling Indo-European sound change laws. [demo] [sound correspondence table]
Corpora
Download
- GUM - The Georgetown University Multilayer corpus
- AMALGUM - A Machine Annotated Lookalike of GUM
- IAHLT HTB - A Revised Version of UD Hebrew
- Coptic SCRIPTORIUM - tagged Sahidic Coptic corpora
Search
- CQPWeb interface for large/flat annotated corpora
- ANNIS interface for richly annotated multilayer corpora
Annotate
- Arborator interface for dependency syntax trees
- GitDox interface for version controlled XML, spreadsheet and entity annotation
- Webanno interface for entity, information structure and coreference annotation
- SynAnnotri interface for constituent syntax trees
- rstWeb interface for Rhetorical Structure Theory annotation
Documentation
- corpling@GU Wiki
Part of Speech Tagging Models
Trained models for use with the freely available TreeTagger. Offered free under the Apache 2.0 software license with no warranty of any kind :)
- Hausa (Boko)
- Coptic (Sahidic) (see Coptic SCRIPTORIUM for more information)
Datasets
Some datasets from my publications and teaching. Instructions on how to cite the use of the data are given for each dataset.Falko Noun Compounding Data
This dataset includes all nouns (POS tag "NN") in the Falko essay corpora FalkoEssayL2v2.2 (advanced German learners) and FalkoEssayL1v2.2 (comparable native speaker data). The data gives for each noun a classification into 'compound' or 'simplex', as well as lemma, head and modifier (for compounds) and the first native language of the writer.
The corpus itself is described in Reznicek et al. (2010) and the extraction and analysis of the nouns is described in Zeldes (2018). Please cite both references when making use of data
Datasets:
- compound_noun_falkoL1_v2.2.tab
- compound_noun_falkoL2_v2.2.tab
- Or all in one file: compound_noun_falko_all_v2.2.tab
References:
- Reznicek, Marc, Maik Walter, Karin Schmid, Anke Lüdeling, Hagen Hirschmann, Cedric Krummes & Thorsten Andreas (2010). Das Falko-Handbuch. Korpusaufbau und Annotationen. Version 1.0.1. Technical report, Humboldt-Universität zu Berlin.
- Zeldes, Amir (2018) "Compounds and Productivity in Advanced L2 German Writing: A Constructional Approach". In: Ortega, Lourdes, Tyler, Andrea, Uno, Mariko and Park, Hae In (eds.), Usage-inspired L2 Instruction: Researched Pedagogy. Amsterdam: Benjamins, 237-265.
German Synthetic Compounds
This data includes compounds of the form NOUN-VERBer in German, along with the automatically extracted noun and verb lemmas, as well as the frequencies for the compound and verb-object attestation frequency based on verb final clauses. Data was extracted from the DEWAC Web corpus (Baroni et al. 2009), which is automatically POS tagged - the raw results may (and do!) contain errors. The compounds were analyzed in Gaeta & Zeldes (2017) - please cite both papers if using these datasets.
Datasets:
- Frequencies for VP vs. NOUN-VERB-er for compounds attested 5+ times
- Frequencies for OV candidates (superset of identified compound stems)
- Type counts of distinct objects vs. distinct compound modifiers for each verb stem
- Hapax compound candidates
- Compounds with known verbs not attested as VPs
References:
- Gaeta, Livio & Amir Zeldes (2017) "Between VP and NN: On the Constructional Types of German -er Compounds". Constructions and Frames 9(1), 1-40.
- Baroni, Marco, Silvia Bernardini, Adriano Ferraresi & Eros Zanchetta (2009). "The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora". Language Resources and Evaluation 43(3), 209–226.
PCC11 Information Status and Topicality Data
This dataset includes information structural annotations based on the guidelines in Dipper et al. (2007) for all discourse referents from pcc11, a sample of the Potsdam Commentary Corpus (Stede 2004). The data was extracted using ANNIS (Zeldes et al. 2009).
The data contains the string representation for each referent (replacing underscores for spaces), information status using the values "giv" (given), "new", "acc" (accessible) and "idiom" (for non-referential idiomatic phrases), and topicality with the values "ab" (aboutness topic), "fs" (framesetter) and "nt" (non-topic). A further column gives a more fine-grained information status tagset according to Dipper et al. (2007), adding subtypes for given active and inactive, accessible inferable, general, situational and aggregate.
Dataset:
References:
- Dipper, Stefanie, Michael Götze and Stavros Skopetead (eds.) (2007), "Information Structure in Cross-Linguistic Corpora: Annotation Guidelines for Phonology, Morphology, Syntax, Semantics, and Information Structure". Interdisciplinary Studies on Information Structure 7.
-
Stede, Manfred (2004), The Potsdam Commentary Corpus. In: Bonnie Webber & Donna K. Byron (eds.), Proceeding of the ACL-04 Workshop on Discourse Annotation. Barcelona, Spain, 96–102.
-
Zeldes, Amir, Ritz, Julia, Lüdeling, Anke & Chiarcos, Christian (2009), "ANNIS: A Search Tool for Multi-Layer Annotated Corpora". In: Proceedings of Corpus Linguistics 2009, July 20-23, Liverpool, UK.
Sahidic Coptic Corpora
Several richly annotated corpora of Sahidic Coptic created in collaboration with Prof. Caroline T. Schroeder (University of the Pacific) are now available under CC-BY license. See details here: