Corpora

Introduction

When Samuel Johnson penned the first English Language Dictionary the only arbitrator of what words should or shouldn’t be included in the dictionary was Johnson himself. Nowadays the choice of words for inclusion into a dictionary is far more objective and scientific. The use of corpora is not simply limited to the recording of the frequency of a word's use but can also provide us with patterns of usage, enabling subtle differences in use and meaning to become clearer.

With increased access and use of the Internet this tool has become even more powerful. It is now possible for all of us to become amateur lexicographers and the Internet provides us with an extremely useful research and teaching tool.

What we have tried to do below is to provide you with a guide to corpora on the Internet. We have divided the sites into categories including Corpora sites; Articles on corpora; and Materials for teaching.

If you feel we have left out any sites that should be included we would be grateful if you could let us know. Click here to contact us.

Terminology

Don’t understand some of the terminology being used? Don’t worry – help is at hand!

http://donelaitis.vdu.lt/publikacijos/SDoCL.htm
One of the aspects of Corpora Linguistics that often makes it less appealing than it could be, is the large amount of terminology that is used. This basic online dictionary is devoted to simple explanations of the most common terminology employed in the field of Corpora Linguistics.

Corpora sites

The first section of this Web Guide takes a look at a variety of corpora and concordance sites. These sites are designed to give you access, although sometimes limited, to a variety of corpora and concordance facilities. Each site mentioned has a brief description. For a more detailed outline of what each site contains and what can be accessed, click on the link and continue to read.

http://devoted.to/corpora
This meta-site aims to be a comprehensive and up-to-date index to corpora and corpus-related software and references. It contains annotated links which are meant mainly for linguists and language teachers who work with corpora, not computational linguists/NLP (natural language processing) researchers. It concentrates on English-language corpora and associated tools and research, although major corpora and tools for many other languages are also included.

http://www.natcorp.ox.ac.uk/
The British National Corpus online. Follow the on-screen instructions to use this facility. For the full service you can subscribe online at a cost of around £60. If you wish to try out the BNC World Edition and see what you can do with it, then click on the ‘simple search interface’ icon. You will then be given a ‘search’ box in which you can type you query before clicking on the ‘Solve it!’ button. A display of up to 50 hits will be shown.

http://www.titania.bham.ac.uk/docs
Online access to the Bank of English corpus. You can sample the concordance programme free of charge or subscribe to the full version for an annual fee.

http://www.ucl.ac.uk/english-usage/ice-gb/sampler/download.htm
This site gives you a downloadable version of the International Corpus of English (ICE-GB). The sample corpus contains 10 texts (of over 20,000 words) from the ICE-GB Corpus. This is split into spoken texts (e.g. private and public dialogues, scripted and unscripted monologues) and various types of written texts (e.g. student essays, academic writing, correspondence, news reports, instructional writing). To have access to the full ICE-GB Corpus you need to purchase a CD-ROM, which contains a further 490 texts. The Corpus site also includes a number of facilities to explore the corpus in detail. These include a corpus map, text browser, fuzzy tree fragments, and syntactic trees.

For more information on ICE and other research projects, go to http://www.ucl.ac.uk/english-usage/

http://www.liv.ac.uk/~ms2928/
Mike Scott’s Web page includes various ‘Wordsmith’ tools. There is a downloadable demo of the six ‘tools’ available, and a complete version can be acquired for a fee. The six ‘tools’ include:
Wordlist – generates wordlists in either alphabetical or frequency order enabling text comparison at a lexical level.
Concord – a fairly comprehensive concordance program.
Keywords – enables you to compare frequency of words in a text in comparison to general frequency and therefore helps define texts by genre through lexis.
The other three ‘tools’ are file management tools enabling you to organise texts to make analysis easier.

http://www.longman.com/dictionaries/corpus/lancaster.html
Lots of information on the Longman/Lancaster corpus, includes some good examples of the type of information used to create a corpus, for example. There are also links and information on the Longman Learner’s Corpus, The Longman Written and Spoken Corpora and samples from the BNC. At this time there does not appear to be public access to the corpora.

http://www.cambridge.org/elt/corpus/default.htm
Lots of information on the Cambridge International Corpus and Cambridge Learner Corpus, with examples and explanation of how the corpora are used for language research and dictionary writing. At this time there does not appear to be public access to the corpora.

http://www.eisu.bham.ac.uk/johnstf/timconc.htm
This is the start of an online Virtual DDL (Data-driven Learning) Library being put together by Tim Johns at Birmingham University. It contains samples of concordance-based materials for teaching. The material contains some interesting examples of concordance patterns using corpora, as well as being a basis for practical ideas that can be used in the classroom.

Also access to 76 samples focusing predominately on academic writing can be found at http://web.bham.ac.uk/johnstf/timeap3.htm#revision

http://vlc.polyu.edu.hk/
This site includes an Internet based concordance program. Simply type the word or phrase you are looking for into the search box and press ‘Go’. A few seconds later you will be shown examples of the word/phrase from a large corpus of over 2 million texts.

http://www.dundee.ac.uk/english/wics/wics.htm
Limited to the work of six of the most famous British poets, this site, from Dundee University, gives you the opportunity to use concordancing for analysis of literary language. You can download a free 30-day trial of the software used for this research from: http://www.concordancesoftware.co.uk/, and can purchase it for approximately £55.

http://nora.hd.uib.no/icame.html
This address links you to the ICAME (International Computer Archive of Modern and Medieval English). There are details here of many aspects of ICAME including how to access the corpora (and how to become a registered user).

http://www.nsknet.or.jp/~peterr-s/
Click on the Concordancing icon to gain access to a wealth of information, from a section on terminology to suggestions on how to make the best use of concordance programs.

http://www.webcorp.org.uk/
This site uses the Internet and the World Wide Web as its database. It takes a little bit of time for the results to come up as it is searching such a vast bank of material, but the results are interesting when you finally see them.

Other languages

Although this Macmillan English Dictionary site is predominantly concerned with English corpora, we felt that some people might well wish to access corpora and concordance tools for other languages. Here are just a few that we have found.

http://tractor.bham.ac.uk/tractor/
Maybe you want to use a concordance program with another European language (not just English). The TRACTOR archive, collected and collated by the Centre of Corpus Linguistics at Birmingham University, provides monolingual and multilingual language resources on-line in the following languages: Bulgarian, Croatian, Czech, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Polish, Romanian, Russian, Serbian, Slovak, Slovene, Swedish, Turkish, Ukrainian and Uzbek.

To use the TRACTOR archives you need to become a member of the Tractor User Community (TUC). This can be done by either contributing to the database (in which case membership is free) or paying a small fee. Details of all this can be found on the site.

http://www.ldc.upenn.edu/
The Linguistic Data Consortium has a number of corpora for sale including Arabic and Chinese.

http://www.elra.info/
The European Language Resource Association offers corpora resources for a number of European Languages.

Programs

Do you want to create your own corpora, or perhaps make your own concordance program? Well, here are a selection of online programs to help get you started.

http://www.concordancesoftware.co.uk/
If you would like to design your own concordancer then here is a site which allows you to buy the software necessary (at a cost of £55). Simply follow the online instructions.

http://myweb.tiscali.co.uk/wordscape/software-new.html
This site includes a number of freeware programs that can be used for ‘text patterns’ that in many cases complement concordance ideas.

Corpora resources

What do we mean by corpora resources? Well, one of the features of the Programs sections was the opportunity for you to design and create your own concordance programs. In order to do this effectively you will need to have access to corpora texts.

http://promo.net/pg/
The Gutenburg project aims to provide online access to as many texts as possible that no longer have copyright restrictions. Use these texts as the basis for any concordancing program you create (note the Dundee University Concordance site mentioned in the Corpora Site section of this Web Guide).

http://info.ox.ac.uk/ctitext/service/workshop/etext/
This web page gives some practical advice on how to find electronic texts.

Articles

There is a lot to learn about corpora, from the technical side of things to the practical applications. Here are a series of links to articles that should give you lots of food for thought, as well as answering many of your questions. If you thought corpora couldn’t be interesting, just click and read.

http://www.hltmag.co.uk/prev.asp
Lots of fascinating articles on corpora, many of them written by Michael Rundell. To access them, you need to click on the ‘View by Categories’ and click on ‘Corpora ideas’ then ‘Show’. This will bring up a list of all the articles from previous editions of this online ELT magazine.

http://helmer.hit.uib.no/icame.html
The International Computer Archive of Modern and Medieval English Journal.

http://icame.uib.no/journal.html contains lots of articles on every aspect of corpora, available in Acrobat reader format online. If you have registered using the CD-ROM you will also have access to the ICAME corpora online.

http://www.hf.uib.no/i/Engelsk/colt/COLTinfo.html
At the bottom of this page you will find abstracts from seven articles based on Corpus of London Teenage Language. Read the articles and compare the information to your expectations of language use, or to examples taken from other corpora. COLT is available on CD-ROM.

http://www.bangkokpost.net/education/site2002/cvap0202.htm
Start off by reading a general article on dictionaries and technology. The article begins by talking about the Cobuild corpus and has a brief interview with Ramesh Krishnamurthy who still works on the Cobuild project. This is followed by an interview with Gwyneth Fox in which she gives a concrete example of how a corpus can be used to check lexical usage. Finally the article discusses the need for more sophisticated software as the size of corpora gets ever bigger.

There are lots of other interesting articles and ideas for teaching on this site – some linked to vocabulary while others cover other aspects of language teaching.

http://www.eisu.bham.ac.uk/johnstf/stevens.htm
An article that looks at three key questions connected with using concordancing with language learners; why? when? and what? The article first appeared in CAELL Journal, vol 6 #2, Summer 1995 pp. 2-10.

http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/contents.htm
This site is designed to supplement Corpus Linguistics by McEnery, T & Wilson, T. Edinburgh University Press (2001). They take a look at four key areas connected with the topic: Early Corpus Linguistics and the Chomskyan revolution; What is a Corpus and what is in it?; Quantitative data; and, The use of Corpora in Language Studies.

Materials for teaching

“I’ve looked at everything so far and I’m still not convinced this is for me. I’m a classroom teacher and I need to be able to apply this stuff to my daily teaching.” Well, here are a few links that bring the use of corpora ever nearer to the chalk face (or computer screen).

http://www.onestopenglish.com
The onestopenglish section on Grammar and Vocabulary offers a set of lessons focusing on British and American English words, metaphor, and phrasal verbs. The lessons are prepared using the Macmillan English Dictionary.

http://www.visualthesaurus.com/
This ‘Virtual’ thesaurus gives a new perspective on the relationship between words. Words will be displayed in a mind map format with fine lines showing the relationship between them. Click on a word to show words related to it while those unrelated disappear from the display. Users can search for any word or phrase via a simple text-entry box. You can also search for words based on part of speech, for example, similar nouns or verbs.

http://www.worldwidewords.org/
A really interesting site all about words. You can sign up to receive a weekly newsletter that tells you about weird words, neologisms and answers questions sent in by readers about etymology and usage.

http://rdues.uce.ac.uk/newwords.shtml
Are you interested in new words? Take a look at this site dedicated to neologisms taken from the UK’s Independent newspaper.

For more useful sites on English Language Teaching see this web guide.

Other

Some odds and ends that don’t fit under any of our previous categories, but that we think may still be of some interest.

http://nora.hd.uib.no/fileserv.html
One of the many ‘lists’ concerned with corpora. Follow the instructions to join.

http://www.telenex.hku.hk/telec/mainmenu.htm
Interesting site based in Hong Kong. Some parts have open access, some need registration – which can be done for free – and other parts are only open to teachers working locally.