Have you ever wondered how people concede a counterpoint that is contrary to what they want to say? How does it work in speech versus in writing? How about in Spanish or Chinese? Our recently released DiscoExplorer project can help you to find out!
What does DiscoExplorer do?
At its core DiscoExplorer is a search and visualization interface for data about discourse relations, which types of connections between two pieces of dialog or text. For example, one part can express the cause of another part, or form a concession to another part. Here are some examples of concessions in English speech and text, contrasted with some Chinese examples - notice how the language differs in writing in many ways, and how we only get explicit markers like "although" or "" in some cases (marked in blue), while other cases remain implicit.
- English:
- [Although many theoretical and empirical models have been developed,]concession [some problems are still unsolved.] (fiction, narrative present, marker in conceded span)
- [She wails,]concession [but let's me disentangle her hands] (fiction, narrative present, marker in main point)
- [I told him a small.]concession [It's huge!] (YouTube vlog, no explicit marker - contrast marked by antonyms, "small" : "huge")
- Chinese:
- [虽然他并没有公开其大纲,]concession [但是他在一封信中却对之作了解释。] (biography, notice the double marker strategy: 虽然...但是 'although ... but ...')
- [新鲜的罗勒和干罗勒的味道很不同,]concession [当然,新鲜的罗勒比干罗勒更加可口。] (how-to guide on growing basil, no explicit concessive marker)
To find examples like these and more check out the interface!
What kinds of data are available?
DiscoExplorer indexes all 38 discourse relation datasets in 16 languages from the DISRPT shared task, an international competition on discourse structure and relation recognition. The task maps a variety of theoretical relation label inventories to a common set of 17 universal labels, allowing comparisons across languages and theoretical frameworks.
| Corpus | Language | Framework | Labels | Relations | Sentences | Tokens | Documents | Signals |
|---|---|---|---|---|---|---|---|---|
| ces.rst.crdt | Czech | RST | 17 | 1,249 | 835 | 14,664 | 54 | -- |
| deu.pdtb.pcc | German | PDTB | 11 | 2,109 | 2,193 | 33,222 | 176 | types |
| deu.rst.pcc | German | RST | 16 | 2,882 | 1,944 | 32,836 | 176 | -- |
| eng.dep.covdtb | English | dependencies | 11 | 4,985 | 2,343 | 60,907 | 300 | -- |
| eng.dep.scidtb | English | dependencies | 14 | 9,903 | 4,202 | 102,534 | 798 | -- |
| eng.erst.gentle | English | eRST | 17 | 2,552 | 1,334 | 17,979 | 26 | subtypes |
| eng.erst.gum | English | eRST | 17 | 30,747 | 14,158 | 254,890 | 255 | subtypes |
| eng.pdtb.gentle | English | PDTB | 12 | 786 | 1,334 | 17,979 | 26 | types |
| eng.pdtb.gum | English | PDTB | 13 | 13,879 | 14,158 | 254,890 | 255 | types |
| *eng.pdtb.pdtb | English | PDTB | 13 | 47,792 | 48,630 | 1,173,379 | 2,162 | types |
| eng.pdtb.tedm | English | PDTB | 13 | 529 | 381 | 8,185 | 6 | types |
| eng.rst.oll | English | RST | 17 | 2,751 | 2,156 | 46,471 | 327 | -- |
| *eng.rst.rstdt | English | RST | 17 | 19,778 | 8,318 | 208,912 | 385 | -- |
| eng.rst.sts | English | RST | 17 | 3,058 | 2,591 | 71,206 | 150 | -- |
| eng.rst.umuc | English | RST | 15 | 4,997 | 2,424 | 61,590 | 87 | -- |
| eng.sdrt.msdc | English | SDRT | 10 | 27,848 | 14,744 | 231,352 | 440 | -- |
| eng.sdrt.stac | English | SDRT | 11 | 12,271 | 7,394 | 52,271 | 1,101 | -- |
| eus.rst.ert | Basque | RST | 16 | 3,632 | 2,380 | 45,780 | 164 | -- |
| fas.rst.prstc | Farsi | RST | 14 | 5,191 | 2,179 | 66,926 | 150 | -- |
| fra.sdrt.annodis | French | SDRT | 12 | 3,321 | 1,507 | 32,699 | 86 | -- |
| ita.pdtb.luna | Italian | PDTB | 11 | 1,525 | 3,750 | 25,242 | 60 | types |
| nld.rst.nldt | Dutch | RST | 16 | 2,264 | 1,651 | 24,898 | 80 | -- |
| pcm.pdtb.disconaija | Naija | PDTB | 13 | 9,903 | 9,242 | 140,729 | 176 | types |
| pol.iso.pdc | Polish | ISO | 12 | 8,543 | 9,142 | 156,980 | 556 | types |
| por.pdtb.crpc | Portuguese | PDTB | 12 | 11,327 | 5,194 | 186,849 | 302 | types |
| por.pdtb.tedm | Portuguese | PDTB | 13 | 554 | 394 | 8,190 | 6 | types |
| por.rst.cstn | Portuguese | RST | 15 | 4,993 | 2,221 | 63,332 | 140 | -- |
| rus.rst.rrt | Russian | RST | 15 | 25,095 | 13,131 | 262,495 | 234 | -- |
| spa.rst.rststb | Spanish | RST | 16 | 3,049 | 2,089 | 58,717 | 267 | -- |
| spa.rst.sctb | Spanish | RST | 16 | 692 | 516 | 16,515 | 50 | -- |
| tha.pdtb.tdtb | Thai | PDTB | 12 | 10,861 | 6,534 | 256,523 | 180 | -- |
| *tur.pdtb.tdb | Turkish | PDTB | 13 | 3,176 | 31,197 | 496,358 | 197 | -- |
| tur.pdtb.tedm | Turkish | PDTB | 13 | 574 | 410 | 6,286 | 6 | types |
| zho.dep.scidtb | Mandarin | dependencies | 14 | 1,297 | 500 | 18,761 | 109 | -- |
| zho.pdtb.cdtb | Mandarin | PDTB | 9 | 5,270 | 2,891 | 73,314 | 164 | -- |
| zho.pdtb.ted | Mandarin | PDTB | 15 | 13,308 | 8,671 | 181,910 | 72 | types |
| zho.rst.gcdt | Mandarin | RST | 17 | 8,413 | 2,692 | 62,905 | 50 | -- |
| zho.rst.sctb | Mandarin | RST | 17 | 692 | 580 | 15,496 | 50 | -- |
| Total | 16 | 6 | 17 | 311,796 | 257,705 | 5,139,564 | 9,890 | 14 datasets |
Quantitative interface
You can also use the interface to run quantitative comparisons. Here is a plot comparing discourse relation label prevalence in English and Chinese news articles:
Run this comparison live at this link!
