Using innovative software, lexicographers have based the Macmillan English Dictionary on a unique modern corpus of over 200 million words created in the late 1990s – the World English Corpus. The second edition of the MED has added to this corpus through a collaboration with the Centre for English Corpus Linguistics at the Université catholique de Louvain in Belgium.
![]() |
![]() |
What does the World English Corpus contain?
A general corpus – examples of real spoken and written English, taken from a variety of sources across the English-speaking world, which illustrate how and when a word is used.
A learner corpus – examples of the English written by advanced learners from all over the world, which highlight the most common problems that learners experience, and so help us to focus on the information they need.
An ELT corpus – examples of the English currently found in English language coursebooks and readers, which help to identify the core vocabulary that teachers and learners need in the classroom.
What is the CECL?
The Centre for English Corpus Linguistics (CECL), under its Director, Sylviane Granger, focuses on the development and exploitation of learner corpora. These corpora, which include data from a worldwide mix of learners, have provided a wealth of information to the dictionary writers about well-attested learners’ problems, in particular with words in MED’s top 7,500 list. This information has been used to provide help to learners by, for example, highlighting and exemplifying correct collocations; presenting alternatives to frequently-used core-vocabulary items; providing notes that point out the differences between easily confused words; and offering specific warnings to alert learners to typical incorrect usage. Their research has provided unique and exciting material to help learners improve their writing, in the form of the Improve your writing skills section, the Get it right boxes at individual headwords.
The Second Edition of the Macmillan English Dictionary is the first ELT dictionary to use learner data in such a systematic way.
How big is the corpus?
The corpus contains a total of around 220 million words of written and spoken text.
What are the major components of the corpus?
The World English Corpus is made up of the following:
What types of texts are included in the different corpora?
Academic discourse, print and broadcast journalism, fiction, recorded conversations (including telephone calls), recorded business meetings, general non-fiction, answerphone messages, emails, legal texts, academic seminars, cultural studies texts, radio documentaries, broadcast interviews, ELT course books, text written by learners of English, including essays and examination scripts.
What is the ratio of the written and spoken texts in the corpus?
The ratio is about 9:1 (written:spoken).
How was the corpus used in creating the dictionary?
The corpus is our primary source of information about the way words behave. It forms the basis of our description of word meanings and of the way words combine with each other (syntactically and collocationally). It also provides information about frequency – of words, meanings, grammatical patterns, and collocations. And finally, it is the main source of the example sentences shown in the dictionary.
What types of computer program were used to get information from the corpus?
Like most dictionary publishers, we used ‘concordancing’ software to investigate word behaviour and word patterning. In addition, the MED team used new, state-of-the-art ‘lexical profiling’ software, which gives us the most detailed and most reliable information about collocations that has ever been available.