|
|
|
Corpora
|
|
Shortcuts
|
|
|
|
|
|
|
|
|
|
General
information about corpora
|
|
|
|
My homepage only
contains links to corpora that are freely accessible on the
internet.
Information on other corpora can be gathered on the following sites: |
|
|
|
|
|
Link
collection |
|
David Lee's site
http://tiny.cc/corpora
contains the presumably most comprehensive link
collection regarding corpora, including links to software, frequency lists, etc. |
|
|
|
|
|
Gateway
to Corpus Linguistics |
|
Yvonne
Breyer's Gateway to
Corpus Linguistics on the Internet offers a
structured overview of the most
important corpus sites. Among other things, you will find a list
of spoken-language corpora as well as information on the accessibility of
the different text collections. |
|
|
|
|
|
Corpus
Survey |
|
Richard
Xiao's Corpus
Survey gives you a detailed account of the history of corpus
linguistics and the "classical" corpora - together with links to
the respective homepages. |
|
|
|
|
|
|
|
|
|
|
|
English corpora
|
|
BNC |
|
The
100-million-word, balanced British National
Corpus is one of the most important English language corpora. |
|
|
|
|
|
|
|
| You can even search
the BNC for free on Mark Davies' BNC
web interface. Try this site - it is great! |
|
|
|
|
For BNC-based word frequency lists,
see below. |
|
|
|
|
|
Google
n-gram viewer |
|
The Google
n-gram viewer permits you to discover trends in published
books in English and some
of its regional varieties between 1800 and 2000.
The frequency results for several search items can be displayed
simultaneously on a timeline. More information about this exciting corpus
can be found here. |
|
|
|
|
|
Collins
Wordbank |
|
Collins
WordbanksOnline offers you the
opportunity to search a tagged 56-million word corpus of contemporary
spoken and written English. |
|
|
|
|
|
Time
Corpus |
|
Mark Davies' Time Corpus
contains 100 million words of written American English
from
Time magazine (since 1923). As with the other corpora on his corpus
platform http://corpus.byu.edu,
the Time Corpus can also be searched for free. |
|
|
|
|
|
COCA |
|
Even better than that, Mark Davies' 400-million-word Corpus
of Contemporary American English is balanced and contains 20
million words for each year from 1990-2009.
It is updated once or twice a year and therefore extremely suitable for
looking at current, ongoing changes in American English. |
|
|
|
|
|
COHA |
|
The Corpus
of Historical American English, by the same author, contains 400
million words from the period between
1810 and 2009. |
|
|
|
|
|
MICASE |
|
As the name suggests, the Michigan
Corpus of Academic Spoken English contains transcriptions
of 152 spoken English texts from an academic
background (e.g. lectures or seminars). They can be searched according to
different variables, such as the speakers' native language or a particular
academic discipline. |
|
|
|
|
|
MICUSP |
|
The Michigan
Corpus of Upper-level Student Papers comprises about 830
grade A papers produced at the University of Michigan, Ann
Arbor. This corresponds to roughly 2.6 million
words from a range of academic
disciplines. |
|
|
|
|
|
SCoSE |
|
The Saarbrücken
Corpus of Spoken English contains transcripts of everyday
English conversations. Particularly recommendable are the
transcribed jokes. |
|
|
|
|
|
|
|
|
|
|
|
English word frequency lists
|
|
BNC |
|
|
|
|
|
| The most aesthetic account of word
frequencies in the BNC will be found at www.wordcount.org,
which shows the 86,800 most frequent words with their rank and
their neighbours in a very appealing way.
There is also a querycount that
indicates the most frequently sought words and a collection of
particularly interesting word sequences
- such as ranks 4304-4307, microsoft acquire salary tremendous. |
|
|
|
|
|
|
COCA |
|
Get different frequency lists from the Corpus
of Contemporary American English (see above). |
|
|
|
|
|
|
|
The site www.lextutor.ca
offers a number of famous frequency lists for download, such as |
|
GSL |
|
|
|
AWL |
|
|
|
|
|
|
|
Basic
English |
|
Another classic that can be found online is Charles K.
Ogden's Basic
English word list. |
|
|
|
|
|
|
|
German corpora
|
|
Cosmas
II |
|
The classic among the German-language corpora: Cosmas
II, by the Institut für deutsche Sprache
(IDS) Mannheim. Use it for example to check whether your
translations into German are idiomatic.
The corpus contains mainly newspaper texts, but it is possible to compile
your own subcorpora. |
|
|
|
|
|
DWDS-
Kernkorpus |
|
With 100 million words, the balanced DWDS-Kernkorpus
by the Berlin-Brandenburgische Akademie der Wissenschaft (BBAW) is the German
equivalent of the British National Corpus - or at least as
close as you can get. Make sure to read the introduction
to the search syntax, since it is not entirely intuitive. |
|
|
|
|
|
Text
messages |
|
Take a look at an interesting but unfortunately not
well-documented German
corpus of 1500 text messages by German pupils. |
|
|
|
|
|
|
|
|
|
|
|
The internet as a corpus
|
|
Advantages
and disadvantages |
|
Having worked with electronic corpora, you may want to
consider the internet as the largest existent collection of texts.
However, this has both advantages and disadvantages. Read more on this
issue in texts by |
|
|
|
|
|
|
|
|
|
|
|
|
|
Search
engines |
|
The most straightforward way to carry out a linguistic
internet search is to use a search engine
such as google
or yahoo - or
maybe even several ones, in order to check differences between their
results.
Note these tips: |
|
|
|
| Using quotation marks ("
") around a search term finds the precise
sequence of letters only. |
| Using a plus sign (+) between
search terms makes the search engine look for instances where the two
terms are relatively close to each other
in the text. |
| Sometimes you may want to exclude
a particular word, e.g. when you are looking for internet pages that
contain A but not B. In this case, place a minus
sign (-) in front of B. |
|
|
|
|
|
|
Googlefight |
|
To make life easier (and funnier), there is also googlefight.
In case you want to compare two similar words with regard to their general
frequency (e.g. Bewerbungsbrief vs. Bewerbungsschreiben),
just insert them in the two boxes and watch them fight. :-) |
|
|
|
|
|
Webcorp |
|
This is the search engine linguists had been waiting for -
and its latest version is a lot faster than before: Webcorp
allows you to search the internet like a corpus
- and returns concordances, word
frequency lists for individual pages etc. |
|
|
|
|
|
|
|
|
|
|
|
How to create your own corpus
|
|
|
|
Nothing easier than that: first, determine which
texts you need for your research question. Then find them and
digitalize them - or download them
from a virtual library in the very first place. |
|
Virtual
libraries |
|
| Project
Gutenberg attempts to digitalize all important printed works
and make them generally accessible online. You can download them in
different formats - of which .txt is the most suitable one for corpus
creation. |
|
|
|
|
|
|
|
|
|
|
Concordancer |
|
In order to be able to search your corpus, you will need a concordancer,
i.e. a programme that calculates frequencies, outputs concordances,
collocations etc. |
|
|
|
|
A very good free programme is AntConc,
by Laurence Anthony from Waseda University (Japan). It is easy to use and
comes with a detailed instruction
manual. |
|
|
|
|
| Alternatively, you could also use KwicKwic. |
|
|
|
|
|