Macmillan English Dictionary Resource Site – Web Guide

Corpora Web Guide

Introduction

When Samuel Johnson penned the first English Language Dictionary the only arbitrator of what words should or shouldn’t be included in the dictionary was Johnson himself. Nowadays the choice of words for inclusion into a dictionary is far more objective and scientific. The use of corpora is not simply limited to the recording of the frequency of a word use but can also provide us with patterns of usage enabling subtle differences in use and meaning to become clearer.

With the increased access and use of the Internet this tool has become even more powerful. It is now possible for all of us to become amateur lexicographers and the Internet provides us with an extremely useful research and teaching tool.

What we have tried to do below is provide you with a guide to corpora on the Internet. We have divided the sites into categories including Corpora sites; Articles on corpora; and Materials for teaching. Of course, there is an overlap, for example, the Cobuild site is both a corpora site and a teaching aid and some of the articles include ideas that can be used in the classroom.  

If you feel we have left out any sites that should be included we would be grateful if you could let us know. Click here to contact us.

Terminology

Don’t understand some of the terminology being used? Don’t worry – help is at hand!

http://donelaitis.vdu.lt/publikacijos/SDoCL.htm

One of the aspects of Corpora Linguistics that often makes it less appealing than it could be is the large amount of terminology that is used. This basic online dictionary is devoted to simple explanations of the most common terminology employed in the field of Corpora Linguistics.

Corpora Sites

The first section of this Web Guide takes a look at a variety of corpora and concordance sites. These sites are designed to give you access, although sometimes limited, to a variety of corpora and concordance facilities. Each site mentioned has a brief description. For a more detailed outline of what each site contains and what can be accessed click on the link and continue to read.

http://devoted.to/corpora Added September 2003
This meta-site aims to be a comprehensive and up-to-date index to corpora and corpus-related software and references. It contains annotated links which are meant mainly for linguists and language teachers who work with corpora, not computational linguists/NLP (natural language processing) researchers. It concentrates on English-language corpora and associated tools and research, although major corpora and tools for many other languages are also included.

http://www.hcu.ox.ac.uk/BNC/

The British National Corpus online. Follow the on-screen instructions to be able to use the facility. For the full service you can subscribe online at a cost of around £50. If you wish to try out the BNC World Edition and see what you can do with it then click on the ‘simple search interface’ icon. You will then be given a ‘search’ box in which you can type you query before clicking on the ‘Solve it!’ button. A display of up to 50 hits will be shown.

http://titania.cobuild.collins.co.uk

Online access to the Bank of English corpus. You can sample the concordance programme free of charge or subscribe to the full version for a fee of £500 per year.

For the sampler click on simple concordance demo and then in the query box type in the word you want to check. Decide on the corpora basis that you would like to use (British books, American books, British transcribed speech) and finally click on ‘Show Concs’ button to get forty lines of concordance samples.

Another feature well worth visiting is the ‘Wordwatch’ feature which takes a look each week at a word or phrase using the Cobuild concordance program. http://titania.cobuild.collins.co.uk/wordwatch.html

The site also contains an archive feature giving you access to over 300 articles.

http://www.ucl.ac.uk/english-usage/ice-gb/sampler/download.htm

This site gives you a downloadable version of the International Corpus of English. The sample corpus contains 10 texts (of over 20,000 words) from the ICE-GB Corpus. This is split into five spoken texts (dialogues, monologues, scripted and unscripted) and five written texts (one each from student essay, academic writing, correspondence, news report and instructional writing). To have access to the full ICE-GB Corpus you need to purchase a CD-ROM which contains a further 490 texts. The Corpus site also includes a number of facilities to explore the corpus in detail. These include a corpus map, text browser, fuzzy tree fragments, FTF matches and syntactic trees.

For more information on ICE and other research projects, go to http://www.ucl.ac.uk/english-usage/

http://www.liv.ac.uk/~ms2928/

Mike Scott’s Web page includes various ‘Wordsmith’ tools. There is a downloadable demo of the six ‘tools’ available and a complete version can be acquired for a fee. The six ‘tools’ include:

Wordlist – generates wordlists in either alphabetical or frequency order enabling text comparison at a lexical level.

Concord – a fairly comprehensive concordance program.

Keywords – enables you to compare frequency of words in a text in comparison to general frequency and therefore helps define texts by genre through lexis.

The other three ‘tools’ are file management tools enabling you to organise texts to make analysis easier.

http://www.longman-elt.com/dictionaries/corpus/lccont.html

Lots of information on the Longman/Lancaster corpus, includes some good examples of the type of information used to create a corpus, for example http://www.longman-elt.com/dictionaries/corpus/sound.html. There are also links and information on the Longman Learner’s Corpus, The Longman Written and Spoken Corpora and samples from the BNC. At this time there does not appear to be public access to the corpora.

http://web.bham.ac.uk/johnstf/ddl_lib.htm

This is the start of an online Virtual DDL Library being put together by Tim Johns at Birmingham University. It contains samples of concordance based materials for teaching. The material contains some interesting examples of concordance patterns using corpora as well as being a basis for practical ideas that can be used.

More examples produced by some participants at a workshop run in North Bohemia can be seen at http://web.bham.ac.uk/johnstf/unl_ddl.htm

Also access to 76 samples focusing predominately on academic writing can be found at http://web.bham.ac.uk/johnstf/timeap3.htm#revision

Also available are two pieces of freeware (Cloze and Context) that can be downloaded and used to generate activities and exercises based on corpus material. These programs can be found at http://web.bham.ac.uk/johnstf/timcall.htm#cloze

http://vlc.polyu.edu.hk/

This site includes an Internet based concordance program. Simply type the word or phrase you are looking for into the search box and press ‘Go’, a few seconds later you will be shown examples of the word/phrase from a large Corpus of over 2 million texts.

There are also parallel corpora available as well as Chinese, French and Japanese corpora.

The site also contains a selection of other ‘authoring tools’ – one of these, the ‘Cloze Maker’, would be a useful tool to use in combination with the concordance feature.

http://www.kamakuranet.ne.jp/~someya/

This site has an online concordance program that is almost entirely devoted to business letters – simply click on the ‘Online BLC KWIC Concordancer (E)’ to gain access to the program. The corpus is one-million words taken from a Business Letter Corpus. You can adjust the search type, line width and sort type as well as accessing a few other corpora (including some letters from famous people as well as a few works of literature).

http://www.dundee.ac.uk/english/wics/wics.htm

Limited to the work of six of the most famous British poets, this site, from Dundee University, gives you the opportunity to use concordancing for analysis of literary language.

http://nora.hd.uib.no/icame.html

This address links you to the ICAME (International Organisation of Linguists and Information Scientists of Machine Readable Texts). There are details here of many aspects of ICAME including how to access the corpora (and how to become a registered user).

http://www.nsknet.or.jp/~peterr-s/

Click on the Concordancing icon to gain access to a wealth of information from a section on terminology to suggestions on how to make the best use of concordance programs. There is also a link to a JAVA based concordancer. Click on the relevant icon and then select from a wealth of print texts from the Guardian and a few classical novels.

Other Languages

Although this site is predominantly concerned with English corpora we felt that some people may well wish to access corpora and concordance tools for other languages. Here are just a few that we have found.

http://www.tractor.de/

Maybe you want to use a concordance program with another European language (not just English). The TRACTOR archive, collected and collated by the Centre of Corpus Linguistics at Birmingham University, provides monolingual and multilingual language resources available on-line in the following languages: Bulgarian, Croatian, Czech, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Polish, Romanian, Russian, Serbian, Slovak, Slovene, Swedish, Turkish, Ukrainian and Uzbek.

To use the TRACTOR archives you need to become a member of (TUC). This can be done by either contributing to the database (in which case membership is free) or paying a small fee. Details of all this can be found on the site.

http://www.ldc.upenn.edu/

The Linguistic Data Consortium has a number of corpora for sale including Arabic and Chinese.

http://www.icp.grenet.fr/ELRA/

The European Language Resource Association offers corpora resources for a number of European Languages. This association is mainly aimed at institutions and, like many European organisation sites, is not easy to navigate around.

http://www.ruf.rice.edu/~barlow/corpus.html

This site, maintained by Mike Barlow, contains lots of links to various corpora sites for a variety of languages.

Programs

Do you want to create your own corpora, or perhaps make your own concordance program? Well, here are a selection of online programs to help get you started.

http://www.rjcw.freeserve.co.uk/

If you would like to design your own concordancer then here is a site which allows you to buy the software necessary (at a cost of $89). Simply follow the online instructions.

http://www.marlodge.supanet.com/software.html

This site includes a number of freeware programs that can be used for ‘text patterns’ that in many cases complement concordance ideas.

http://ourworld.compuserve.com/homepages/Christopher_Tribble/

This personal homepage includes links to a few articles on using corpora in language learning as well as a few programs that can be downloaded. For the freeware click on the ‘Using Corpora in English Language Education’ icon and then scroll down to the section entitled ‘Word Macros for English Language Teachers’.

Corpora Resources

What do we mean by corpora resources? Well, one of the features of the Programs sections was the opportunity for you to design and create your own concordance programs. In order to do this effectively you will need to have access to corpora texts.

http://promo.net/pg/

The Gutenburg project aims to provide online access to as many texts as possible that no longer have copyright restrictions. Use these texts as the basis for any concordancing program you create (note the Dundee University Concordance site mentioned in the Corpora Site section of this Web Guide).

http://harvest.rutgers.edu/ceth/etext_directory/

A directory giving you access to hundreds of sites containing electronic texts that could be used as the basis for your own corpus.

http://info.ox.ac.uk/ctitext/service/workshop/etext/

This webpage gives some practical advice on how to find electronic texts.

Articles

There is a lot to learn about corpora from the technical side of things to the practical applications. Here are a series of links to articles that should give you lots of food for thought as well as answering many of your questions. If you thought corpora couldn’t be interesting just click and read.

http://www.hltmag.co.uk/prev.asp

Lots of fascinating articles on corpora, most of them written by Michael Rundell. To access them you need to click on the ‘View by Categories’ and click on ‘Corpora ideas’ then ‘Show’. This will bring up a list of all the articles from previous editions of this online ELT magazine.

http://helmer.hit.uib.no/icame.html

The International Computer Archive of Modern and Medieval English Journal contains lots of articles on every aspect of corpora available in Acrobat reader format online. If you have registered using the CD-ROM you will also have access to use the ICAME corpora online.

http://www.hf.uib.no/i/Engelsk/colt/COLTinfo.html

At the bottom of this page you will find seven short articles based on Corpus of London Teenage Language. Read the articles and compare the information here to your expectations of language use or with examples taken from other corpora. COLT is available on CD-ROM.

http://www.bangkokpost.net/education/site2002/cvap0202.htm

Start off by reading a general article on dictionaries and technology. The article begins by talking about the Cobuild corpus and has a brief interview with Ramesh Krishnamurthy who still works on the Cobuild project. This is followed by an interview with Gwyneth Fox in which she gives a concrete example of how a corpus can be used to check lexical usage. Finally the article discusses the need for more sophisticated software as the size corpora gets ever bigger.

There are lots of other interesting articles and ideas for teaching on this site – some linked to vocabulary while others cover other aspects of language teaching.

http://www.longman-elt.com/dictionaries/corpus/lrcorpus1.html

An interesting article by Michael Rundell which looks at the implications of corpus for ELT. Rundell mentions some limitations and pitfalls of corpus as well as looking at the benefits.

http://www.ruf.rice.edu/~barlow/stevens.html

An article that looks at three key questions connected with using concordancing with language learners; why? when? and what? The article first appeared in CAELL Journal, vol 6 #2, Summer 1995 pp. 2-10.

http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/contents.htm

This site is designed to supplement Corpus Linguistics by McEnery, T & Wilson, T.  Edinburgh University Press (2001). They take a look at four main areas connected with the topic: Early Corpus Linguistics and the Chomskyan revolution; What is a Corpus and what is in it?; Quantitative data; and, The use of Corpora in Language Studies.

Materials for teaching

"I’ve looked at everything so far and I’m still not convinced this is for me. I’m a classroom teacher and I need to be able to apply this stuff to my daily teaching. Well, here are a few links that bring the use of corpora ever nearer to the chalk face (or computer screen)."

http://www.onestopenglish.com/News/Magazine/Vocab/vocab2.htm

http://www.onestopenglish.com/News/Magazine/Vocab/collocationmain.htm

The Onestopenglish Magazine contains a section on vocabulary teaching. Here are a set of lessons focusing on Metaphor in English and on Phrasal verbs and collocation. The lessons here are prepared using the Macmillan English Dictionary.

http://www.plumbdesign.com/thesaurus/thinkmap.html

This ‘Virtual’ thesaurus gives a new perspective into the relationship between words. Once the homepage has loaded click on the ‘loaded, click to launch’ icon to enter the display. Words will be displayed in a mind map format with fine lines showing the relationship between the words. Click on a word to show words related to that particular one while those unrelated disappear from the display. Users can search for any word or phrase by using a simple text-entry box. You can also search words based on part of speech, for example, similar nouns or verbs.

http://titania.cobuild.collins.co.uk

Click on ‘The Definitions Game’ icon to play a wonderful word game based on sentences from the corpus. You are given a sentence with a word missing (####). Your task is to guess that missing word. Once you’ve tried, click for the answer and then move on to another sentence and another word.

http://www.worldwidewords.org/

A really interesting site all about words.

http://rdues.uce.ac.uk/newwords.shtml Link updated December 2004

Are you interested in new words? Take a look at this site dedicated to neologisms taken from the Independent newspaper from the UK.

Other

Some odds and ends that don’t fit under any of our previous categories, but that we think may still be of some interest.

http://www.marlodge.supanet.com/wscape/index.html

The results of concordancing are designed to show language patterns but can also show other patterns. ‘Wordscapes’ is an intriguing idea of how ‘artistic’ corpora and concordancing can be.

http://nora.hd.uib.no/fileserv.html

One of the many ‘lists’ concerned with corpora. Follow the instructions to join.

http://www.ltg.ed.ac.uk/helpdesk/faq/index.html

This includes some useful links connected to FAQs (Frequently Asked Questions) on corpora and corpus analysis tools and software. Check this site out as it may already contain answers to your questions.

http://www.telenex.hku.hk/telec/mainmenu.htm

Interesting site based in Hong Kong. Some parts have open access, some need registration – which can be done for free – and other parts are only open to teachers working locally.