English Language & LINGUISTI Christina Sanchez-Stockhammer

About me
English language

Home > Links > Corpora






General information about corpora
English corpora
English word frequency lists
German corpora
The internet as a corpus
How to create your own corpus



General information about corpora

      My homepage only contains links to corpora that are freely accessible on the internet. 
Information on other corpora can be gathered on the following sites:


Link collection   David Lee's site http://tiny.cc/corpora contains the presumably most comprehensive link collection regarding corpora, including links to software, frequency lists, etc.


Gateway to Corpus Linguistics   Yvonne Breyer's Gateway to Corpus Linguistics on the Internet offers a structured overview of the most important corpus sites. Among other things, you will find a list of spoken-language corpora as well as information on the accessibility of the different text collections. 




Corpus Survey   Richard Xiao's Corpus Survey gives you a detailed account of the history of corpus linguistics and the "classical" corpora - together with links to the respective homepages.







English corpora


BNC   The 100-million-word, balanced British National Corpus is one of the most important English language corpora.


For more information about the corpus, take a look at the BNC's official homepage.


You can even search the BNC for free on Mark Davies' BNC web interface. Try this site - it is great!


    For BNC-based word frequency lists, see below.


  Google n-gram viewer   The Google n-gram viewer permits you to discover trends in published books in English and some of its regional varieties between 1800 and 2000. The frequency results for several search items can be displayed simultaneously on a timeline. More information about this exciting corpus can be found here.
  Collins Wordbank   Collins WordbanksOnline offers you the opportunity to search a tagged 56-million word corpus of contemporary spoken and written English.




Time Corpus   Mark Davies' Time Corpus contains 100 million words of written American English from Time magazine (since 1923). As with the other corpora on his corpus platform http://corpus.byu.edu, the Time Corpus can also be searched for free.


  COCA   Even better than that, Mark Davies' 400-million-word Corpus of Contemporary American English is balanced and contains 20 million words for each year from 1990-2009. 
It is updated once or twice a year and therefore extremely suitable for looking at current, ongoing changes in American English.
  COHA   The Corpus of Historical American English, by the same author, contains 400 million words from the period between 1810 and 2009.


MICASE   As the name suggests, the Michigan Corpus of Academic Spoken English contains transcriptions of 152 spoken English texts from an academic background (e.g. lectures or seminars). They can be searched according to different variables, such as the speakers' native language or a particular academic discipline.


  MICUSP   The Michigan Corpus of Upper-level Student Papers comprises about 830 grade A papers produced at the University of Michigan, Ann Arbor. This corresponds to roughly 2.6 million words from a range of academic disciplines.


SCoSE   The Saarbrücken Corpus of Spoken English contains transcripts of everyday English conversations. Particularly recommendable are the transcribed jokes.

English word frequency lists

Both Adam Kilgarriff and Leech, Rayson & Wilson offer you ready-made word frequency lists (the latter on a companion website for their book Frequencies in Spoken and Written English) .
The most aesthetic account of word frequencies in the BNC will be found at www.wordcount.org, which shows the 86,800 most frequent words with their rank and their neighbours in a very appealing way. 
There is also a querycount that indicates the most frequently sought words and a collection of particularly interesting word sequences - such as ranks 4304-4307, microsoft acquire salary tremendous.
  COCA   Get different frequency lists from the Corpus of Contemporary American English (see above).
      The site www.lextutor.ca offers a number of famous frequency lists for download, such as
Several versions of Michael West's A General Service List of English Words
Several versions of Averil Coxhead's Academic Word List.
There are also other sites about the AWL that offer additional information, such as a very beautiful one by the University of Plymouth.
  Basic English   Another classic that can be found online is Charles K. Ogden's Basic English word list.





German corpora


Cosmas II   The classic among the German-language corpora: Cosmas II, by the Institut für deutsche Sprache (IDS) Mannheim. Use it for example to check whether your translations into German are idiomatic.
The corpus contains mainly newspaper texts, but it is possible to compile your own subcorpora.




DWDS- Kernkorpus   With 100 million words, the balanced DWDS-Kernkorpus by the Berlin-Brandenburgische Akademie der Wissenschaft (BBAW) is the German equivalent of the British National Corpus - or at least as close as you can get. Make sure to read the introduction to the search syntax, since it is not entirely intuitive.


  Text messages   Take a look at an interesting but unfortunately not well-documented German corpus of 1500 text messages by German pupils. 





The internet as a corpus


Advantages and disadvantages   Having worked with electronic corpora, you may want to consider the internet as the largest existent collection of texts. However, this has both advantages and disadvantages. Read more on this issue in texts by
Mike Rundell
Thomas Robb.
  Search engines   The most straightforward way to carry out a linguistic internet search is to use a search engine such as google or yahoo - or maybe even several ones, in order to check differences between their results.

Note these tips:

Using quotation marks (" ") around a search term finds the precise sequence of letters only.
Using a plus sign (+) between search terms makes the search engine look for instances where the two terms are relatively close to each other in the text.
Sometimes you may want to exclude a particular word, e.g. when you are looking for internet pages that contain A but not B. In this case, place a minus sign (-) in front of B.
  Googlefight   To make life easier (and funnier), there is also googlefight. In case you want to compare two similar words with regard to their general frequency (e.g. Bewerbungsbrief vs. Bewerbungsschreiben), just insert them in the two boxes and watch them fight. :-)


  Webcorp   This is the search engine linguists had been waiting for - and its latest version is a lot faster than before: Webcorp allows you to search the internet like a corpus - and returns concordances, word frequency lists for individual pages etc.



How to create your own corpus


    Nothing easier than that: first, determine which texts you need for your research question. Then find them and digitalize them - or download them from a virtual library in the very first place.
  Virtual libraries  
Project Gutenberg attempts to digitalize all important printed works and make them generally accessible online. You can download them in different formats - of which .txt is the most suitable one for corpus creation.
There are also other virtual libraries, e.g. the Alex catalogue of electronic texts.
  Concordancer   In order to be able to search your corpus, you will need a concordancer, i.e. a programme that calculates frequencies, outputs concordances, collocations etc.
A very good free programme is AntConc, by Laurence Anthony from Waseda University (Japan). It is easy to use and comes with a detailed instruction manual.
Alternatively, you could also use KwicKwic.