Penn Parsed Corpora of Historical English (PPCHE) was developed at the
University of Pennsylvania and consists of running texts and text
samples of British English prose from the earliest Middle English
documents (1100 CE) up to the First World War (1914 CE).  PPCHE contains
three corpora covering traditionally recognized periods of English:

- Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2)
- Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME)
- Penn Parsed Corpus of Modern British English, second edition (PPCMBE2)

Each text comes in two forms: syntactically annotated (parsed) and
part-of-speech tagged.  The current release does not include unannotated
versions of the texts.  The annotations have been carefully reviewed
over many years by expert human annotators for accuracy and consistency.
(Please report remaining errors to beatrice DOT santorini AT gmail DOT
com.)  Each text also has an associated file with philological
information.  PPCHE was originally intended to aid research in the
history of English, especially the historical syntax of the language.
More recently, computational linguists have begun to exploit PPCHE's 
great range of stylistic and orthographic variation for research in 
domain adaptation.

The 2025 release is a corrected, revised, and slightly augmented version 
of the 2020 release.  The annotation guidelines have been streamlined 
across time periods and for consistency with other historical corpora
using the same guidelines.

Each of the three subcorpora has its own directory and should be cited
individually as follows:

Kroch, Anthony, and Ann Taylor.  2000-.  Penn-Helsinki Parsed Corpus of
Middle English, second edition (PPCME2), release 5.  LDC2025XXXX.  Web
download file.  Philadelphia, PA: Linguistic Data Consortium.

Kroch, Anthony, Beatrice Santorini, and Lauren Delfs.  2004-.
Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), release 4.
LDC2025XXXX.  Web download file.  Philadelphia, PA: Linguistic Data
Consortium.

Kroch, Anthony, Beatrice Santorini, and Ariel Diertani.  2016-.
Penn-Helsinki Parsed Corpus of Modern British English, second edition
(PPCMBE2), release 2.  LDC2025XXXX.  Web download file.  Philadelphia,
PA: Linguistic Data Consortium.

The directory for each subcorpus in turn has two directories: data and
docs.  The data directory contains the directories with the parsed
and POS-tagged files.  The docs directory contains a description of each
subcorpus and a philological_info_files directory with detailed
philological information for each text.

Finally, the release includes:

- the annotation guidelines, and
- the CorpusSearch 2 search program (which allows users to search the
  corpora for syntactic structures, word sequences and words), along
  with documentation

Authors:

Anthony Kroch, Beatrice Santorini, Ann Taylor, Ariel Diertani (Lauren Delfs)

Languages:

Middle English (1100-1500) (enm), 20.2%
Early Modern English (1500-1700) (eng), 31.3%
Modern British English (1700-1914) (eng) 48.3%

Expected use of corpus:

Linguistic research on historical English; domain adaptation for NLP

Collection procedure:

PPCHE is based in part on the Helsinki Corpus of English Texts.  PPCME2
(ca. 1.2M words) includes most of the Middle English texts the Helsinki
Corpus and adds some not included in that corpus.  See the documentation
for PPCME2 for details.  PPCEME (over 1.7M words) includes all of the
Early Modern English texts from the Helsinki Corpus as well as
additional texts selected to give the same genre balance as the original
Helsinki Corpus; the additional texts are twice the size of the original
texts.  PPCMBE2 (ca. 2.8M words) covers a later time period than that
covered by the Helsinki Corpus, but the texts were selected to give the
same genre balance as the Early Modern English part.

Data:

All data is encoded in UTF-8.  The data files are presented as plain
text, and all philological information as html.  The parsed data are in
Penn Treebank format.

