AMALGUM - A Machine Annotated Lookalike of GUM

Overview

GUM's big brother

Latest stable version: 0.1
Latest commit:

Automatic entity and coreference resolution in AMALGUM

AMALGUM is a machine annotated multilayer corpus following the same design and annotation layers as GUM, but substantially larger (around 4M tokens). The goal of this corpus is to close the gap between high quality, richly annotated, but small datasets, and the larger but shallowly annotated corpora that are often scraped from the Web. In particular, we aim to make data available to support:

Pretraining on large scale, silver quality data before fine tuning on smaller gold standard datasets
Active learning to supplement training data and iteratively improve AMALGUM's own data
Better-than-out-of-the-box quality NLP, using every possible trick as a tool and a target for NLP research

Composition

AMALGUM follows the same corpus design as GUM and currently contains the text types from the GUM version 6 series, with some different sources to allow for the larger scale:

Text type	Source	Docs	Tokens
Interviews	Wikinews	778	500,090
News stories	Wikinews	778	500,090
Travel guides	Wikivoyage	482	500,680
How-to guides	wikiHow	613	500,014
Academic writing	MDPI	662	500,285
Biographies	Wikipedia	600	500,760
Fiction	Project Gutenberg	457	500,088
Forum discussions	reddit	682	500,412
Total		4,960	4,002,929

AMALGUM:

Overview

Download

Citing

GUM in Universal Dependencies

GUM in Universal Anaphora

More from our lab

Overview

Composition