Corpus

Using innovative software, lexicographers based the Macmillan English Dictionary (MED) on a unique modern corpus of over 200 million words – the World English Corpus. The second edition of the MED added to this corpus through a collaboration with the Centre for English Corpus Linguistics at the Université catholique de Louvain in Belgium.

  • World English Corpus
  • Centre for English Corpus Linguistics
  • Thanks to its academic collaboration with CECL, the second edition of the Macmillan English Dictionary was the first ELT dictionary to use learner data in a systematic way.

What does the World English Corpus contain?

  • A general corpus – examples of real spoken and written English, taken from a variety of sources across the English-speaking world, which illustrate how and when a word is used.
  • A learner corpus – examples of the English written by advanced learners from all over the world, which highlight the most common problems that learners experience, and so help us to focus on the information they need.
  • An ELT corpus – examples of the English currently found in English language coursebooks and readers, which help to identify the core vocabulary that teachers and learners need in the classroom.

What is the CECL?

The Centre for English Corpus Linguistics (CECL), under its Director, Sylviane Granger, focuses on the development and exploitation of learner corpora. These corpora, which include data from a worldwide mix of learners, provided a wealth of information to the dictionary writers about well-attested learners’ problems, in particular with words in the Macmillan English Dictionary’s top 7,500 list. This information has been used to provide help to learners by, for example:

  • highlighting and exemplifying correct collocations;
  • presenting alternatives to frequently used core-vocabulary items;
  • providing notes that point out the differences between easily confused words;
  • offering specific warnings to alert learners to typical incorrect usage.

Their research has provided unique and exciting material to help learners improve their writing, in the form of the Improve your Writing Skills section, the Get it right boxes at individual headwords, and the exercises on the CD-ROM.

What is unique about the CECL learner corpora?

CECL’s learner corpora data comes from untimed argumentative essays, rather than from exam scripts. This gives a much fairer picture of what learners are capable of, as the exam situation tends to foster avoidance strategies, with learners often only using structures and vocabulary with which they feel confident. The information about learners’ writing that can be analysed from exam scripts is therefore limited.

The second edition of the Macmillan English Dictionary was the first ELT dictionary to use learner data in such a systematic way.

Would you like to know more?

  • For answers to more specific questions about the corpus, click on the Corpus FAQs link below.
  • For examples of how corpora, concordances and word-querying software are used in dictionary writing, see the How Dictionaries are Written section.

Corpus FAQs