Welcome to CQPweb!
Legend for symbols
- - Open access corpus, guest access allowed
- - Restricted access corpus, see here for obtaining a login
- - Licensed for GU students and staff (you will still need a login)
- - Limited access license, please inquire on a case-by-case basis
- - corpus language
- - number of tokens in corpus
- - number of documents in corpus
For all questions and details about obtaining a login to restricted corpora, see this page.
For richly annotated multilayer corpora/treebanks, see our ANNIS interface.
This page is maintain by the Corpus Linguistics lab, Corpling@GU
English Reference Corpora
ACL Anthology (1983-2022) (aclanthology)
eng / 294,834,441 / 64,755
A Free, Balanced, Multilayer English Web Corpus (amalgum)
eng / 3,852,110 / 4,723
British National Corpus (bnc)
eng-uk / 111,246,947 / 4,054
The Brown Corpus (brown)
eng-us / 1,172,053 / 500
COCA - Corpus of Contemporary American English - Academic (cocaacademic)
eng-us / 122,445,373 / 26,203
COCA - Corpus of Contemporary American English - Blog (cocablog)
eng-us / 108,622,152 / 82,021
COCA - Corpus of Contemporary American English - Fiction (cocafiction)
eng-us / 126,484,317 / 25,993
COCA - Corpus of Contemporary American English - Magazine (cocamagazine)
eng-us / 129,459,423 / 86,225
COCA - Corpus of Contemporary American English - News (cocanews)
eng-us / 126,231,664 / 90,243
COCA - Corpus of Contemporary American English - Spoken (cocaspoken)
eng-us / 130,081,788 / 44,803
COCA - Corpus of Contemporary American English - TV and Movies (cocatvmovies)
eng-us / 153 / 23,975
COCA - Corpus of Contemporary American English - Web (cocaweb)
eng-us / 139 / 88,989
Corpus of Regional African American Language (CORAAL) (coraal)
eng-aae / 2,112,226 / 271
Frown Corpus J (frown_j)
eng-uk / 185,276 / 80
Georgetown University Multilayer Corpus v10 (gum10)
eng / 228,284 / 235
Penn Treebank CQPfied (gold tagged) (ptbcqp_gold)
eng-us / 2,467,944 / 3,153
Spoken Language Corpora
CIEMPIESS: A New Open-Sourced Mexican Spanish Radio Corpus (ciempiess)
spa / 349,321 / 174
HCRC Map Task Corpus (hcrcmap_2)
eng-sct / 188,898 / 128
The Michigan Corpus of Academic Spoken English (MICASE) (micase)
eng-us / 2,082,259 / 152
Switchboard Corpus (switchboard)
eng-us / 1,159,308 / 649
TED talks English (ted)
eng / 5,159,586 / 2,085
Non-English Reference Corpora
Lancaster Corpus of Mandarin Chinese (lcmc)
zho / 1,002,340 / 500
Literary Corpora
The Complete Works of Jane Austen (austen_complete)
eng-uk / 1,012,185 / 9
Charles Dickens Corpus (dickens)
eng-uk / 3,407,087 / 14
Don Quijote (don_quijote_spa)
spa / 429,855 / 1
Quotables: corpus of quotes by famous people (quotes)
eng / 1,081,762 / 39,269
Tom Sawyer (tom_sawyer_eng)
eng-us / 86,747 / 1
Political Corpora
Bush and Kerry Presidential Debate (bush_kerry_debate)
eng-us / 48,230 / 2
Inaugural Address Corpus (inaugural)
eng-us / 144,980 / 56
The Mueller Report Corpus (mueller)
eng-us / 228,799 / 2
German Bundestag Protocols (parlament)
deu / 36,723,139 / 837
State of the Union Corpus (1790-2021) (sotu2021)
eng-us / 1,962,452 / 235
State of the Union (1790-2023) (sotu2023)
eng-us / 2,004,687 / 237
Web Corpora
DECOW - COrpora from the Web - German - Part 01 (decow01)
deu / 300,002,861 / 198,608
DECOW - COrpora from the Web - German - Part 02 (decow02)
deu / 300,003,990 / 231,532
DECOW - COrpora from the Web - German - Part 03 (decow03)
deu / 300,008,102 / 325,463
DE Web as Corpus - Part 1 (dewac01)
deu / 268,848,124 / 289,824
DE Web as Corpus - Part 2 (dewac02)
deu / 268,848,124 / 288,223
DE Web as Corpus - Part 3 (dewac03)
deu / 268,884,554 / 290,941
DE Web as Corpus - Part 4 (dewac04)
deu / 268,931,207 / 289,139
DE Web as Corpus - Part 5 (dewac05)
deu / 268,908,956 / 288,386
DE Web as Corpus - Part 6 (dewac06)
deu / 282,733,943 / 305,382
ENCOW2016 - COrpora from the Web - English - Part 01 (encow01)
eng / 300,004,068 / 225,073
ENCOW2016 - COrpora from the Web - English - Part 02 (encow02)
eng / 300,000,358 / 222,638
ENCOW2016 - COrpora from the Web - English - Part 03 (encow03)
eng / 214,108,271 / 163,651
ENCOW2016 - COrpora from the Web - English - Part 04 (encow04)
eng / 289,139 / 289,139
ENCOW2016 - COrpora from the Web - English - Part 05 (encow05)
eng / 288,386 / 288,386
ENCOW2016 - COrpora from the Web - English - Part 06 (encow06)
eng / 161,936 / 161,936
Corpus of Web-Based Global English - US Blogs (glowbe_usblog)
eng-us / 142,425,833 / 106,385
Corpus of Web-Based Global English - US General Web (glowbe_usgenl)
eng-us / 272,905,012 / 168,771
Russian Internet Corpus Sampler (i_ru_sample)
rus / 5,231,112 / 843
Stanford Sentiment Analyzed Twitter Corpus (sentiment140)
eng / 24,473,485 / 2
UK Web as Corpus - Part 1 (ukwac01)
eng-uk / 277,566,848 / 330,390
UK Web as Corpus - Part 2 (ukwac02)
eng-uk / 277,590,843 / 331,233
UK Web as Corpus - Part 3 (ukwac03)
eng-uk / 277,580,108 / 332,942
UK Web as Corpus - Part 4 (ukwac04)
eng-uk / 277,586,138 / 332,744
UK Web as Corpus - Part 5 (ukwac05)
eng-uk / 277,569,079 / 333,030
UK Web as Corpus - Part 6 (ukwac06)
eng-uk / 277,609,694 / 331,553
UK Web as Corpus - Part 7 (ukwac07)
eng-uk / 277,587,296 / 330,634
UK Web as Corpus - Part 8 (ukwac08)
eng-uk / 308,472,497 / 370,107
Newspaper Corpora
Arabic Treebank CQPfied (arabictb)
ara / 168,722 / 734
Chinese Treebank 9.0 (chinese_treebank9)
zho / 2,080,333 / 3,726
New York Times - Arts Subcorpus (nyt_arts)
eng-us / 101,087,365 / 118,433
Slate Magazine Corpus (slate_alt)
eng-us / 4,929,752 / 4,531
Learner Corpora and Native Controls
Arabic Learner Corpus (arablearn)
ara / 444,321 / 1,585
Hong Kong City University Corpus of English Learner Academic Drafts (cityu)
eng-L2 / 7,720,912 / 11,170
English Language Questions and Answers (elqa)
eng / 36,656,346 / 71,052
The Gachon Korean EFL Learner Corpus (gachon)
eng-L2 / 1,824,373 / 16,111
International Corpus of Learner English (icle)
eng-L2 / 2,808,577 / 3,701
International Corpus Network of Asian Learners of English (icnale)
eng-L2 / 1,963,147 / 9,836
Louvain Corpus of Native English Student Essays (LOCNESS)
eng-uk/us / 346,906 / 388
The Michigan Corpus of Upper-Level Student Papers (micusp)
eng-us / 3,063,640 / 829
Spanish Learner Language Oral Corpora (SPLLOC) (splloc)
spa-L2 / 372,567 / 561
Bible Corpora
The King James Bible Corpus (biblekjv)
eng-uk / 915,179 / 66
The Luther Bible Corpus (bibleluther)
deu / 813,333 / 66
The Dutch Statenvertaling Bible (biblestv)
nld / 920,759 / 66
The World English Bible Corpus (bibleweb)
eng-us / 901,701 / 66
World English Corpora
Corpus of Web-Based Global English - Hong Kong (glowbe_hk)
eng-hk / 42,979,217 / 43,936
ICE Jamaica (ice_ja)
eng-ja / 1,156,149 / 500
ICE Singapore (ice_sg)
eng-sg / 1,163,008 / 500
National University of Singapore SMS Corpus (nus_sms)
eng-sg / 150,397 / 10,117
Historical Corpora
Ancient Chinese Corpus - Zuozhuan (acc)
zho-lzh / 194,258 / 2
Corpus of Historical American English (coha)
eng-us / 448,200,483 / 116,773
Georgetown University Historical Reddit Corpus 2020 (guhrc) (guhrc2020)
eng / 43,858,955 / 557,579
Penn Parsed Corpus of Middle English v2 (ppcme2)
eng-enm / 1,351,054 / 56
Sheffield Corpus of Chinese - Historical Chinese Sample (sheffieldchinese)
zho / 14,282 / 3
Coptic Scriptorium Corpora
Besa Letters Corpus (besa)
cop / 1,907 / 2
Chatino Zapotec Corpus (chazap)
zap / 721,976 / 290,292
Gradable Modal Expressions (for CQP, build 76) (GMEv01b76)
eng-us / 301,243 / 533
Gradable Modal Expressions (for CQP, build 78) (GMEv01b78)
eng-us / 301,243 / 533
Gradable Modal Expression (Version 1.0, Build 79) (GMEv10b79)
eng-us / 301,090 / 534
Rohingya News Corpus (rohingya)
eng / 168,372 / 412
Indexing a total of 7,962,997,456 tokens in 92 corpora.