From corpus to dictionary

1. Where does the information in dictionaries come from?

A dictionary is a description of the vocabulary of a language. It explains what words mean, and shows how they work together to form sentences. But where do lexicographers – the people who write dictionaries – get their information from?

There are two main sources of information about words: introspection and observation.

  • introspection means ‘looking inside’ your own brain and trying to remember everything you know about a word
  • observation means examining real examples of language in use (in newspapers, novels, blogs, tweets, and so on), so that we can observe how people use words when they are communicating with one another

It’s obvious that a fluent speaker of a language must already know a lot about that language’s vocabulary. So introspection can be a useful source of insights about what words mean and how they are used. But a dictionary has to give a complete and well-balanced account of a word’s behaviour, and introspection alone can never provide enough information for this purpose. Consequently, lexicographers – since the time of Samuel Johnson in the 18th century – have preferred to base their dictionaries on observation. In Johnson’s time, observing language was a laborious business: it meant reading hundreds of books and extracting good examples of words in use. But today’s computer technology makes all this much easier. And it gives us access to so much good language data that we are now able to provide a really reliable account of English vocabulary.

2. Ways of observing language: ‘citations’ and the corpus

For over 250 years, lexicographers have used citations – examples of words in use, taken from books or other sources – as a basis for describing language. This example from our Buzzword archive, explaining the verb ‘to green’, includes citations from two US newspapers:

Buzzword article - green
This kind of data is particularly useful for keeping track of changes in the language, and for spotting new words and phrases as they come into use. Our sources have now broadened to include not just books and newspapers, but language used on the Internet too. So when our blog discussed the use of handbags as an adjective, most of the citations came not from ‘traditional’ media but from tweets and other postings on social networks.

Citations still have a useful role to play, but our main source of language data is the corpus. A corpus is a collection of thousands of different ‘texts’ stored on computer. These texts include novels, academic books and papers, newspapers, magazines, recorded conversations and broadcast interviews, blogs, online journals and discussion groups, and much more. The point of using a corpus is that we can’t observe all the English that is being used by millions (or even billions) of people all over the world, so instead we look at a representative sample of English texts. Using intelligent software (see below) we can find every example in the corpus of a particular word, phrase, grammatical pattern, or collocation. It is this information which forms the basis for everything we say about words in the dictionary.

3. Macmillan’s corpus resources

At Macmillan, we have two types of corpus: general and specialized.

Our general corpus includes a wide variety of informative and imaginative texts – ranging from academic books and journals, to popular and literary novels, to national and local newspapers. It now contains almost 1.6 billion words of written and spoken English – which means it is about eight times larger than the corpus we used when we created the first edition of the Macmillan English Dictionary ten years ago. This is the corpus we use most of the time.
The specialized corpora we use include:
The Macmillan Curriculum Corpus: a 20-million-word database made up of hundreds of school textbooks and examination syllabuses, covering school subjects from agriculture to zoology. We used this first when producing the Macmillan School Dictionary and Macmillan Study Dictionary
A 60 million word corpus of Environmental Science – the first of a planned series of  new corpora for specific domains
the Learner Corpora created by the Centre for English Corpus Linguistics (CECL) at the Université catholique de Louvain-la-Neuve inBelgium. Macmillan’s collaboration with CECL is described below

4. How do you get information from a corpus?

Lexicographers use powerful computer programs to extract information from language corpora. The best-known type of software for analyzing a corpus is called a ‘concordancer’ –  because it produces concordances, like this:

Concordance for the word - remember A concordance for the word remember

A concordancer looks through the whole corpus and finds every example of a particular word or phrase, then displays it with its immediate context – the seven or eight words on either side of it. The picture above shows a small sample of all the sentences in our corpus that contain the verb remember.  The most important thing for us to identify is recurrent patterns: in other words, any feature which occurs not just once but many times. For example, the first line in this concordance says I don’t remember seeing Santa come.

This is an example of the grammatical pattern where remember is used with a verb in the –ing form (or gerund). If you look carefully at the rest of the concordance, you can see two more examples of the same construction:  

I don’t remember wearing a suit at all.
I remember querying it a few years ago.

Reading the concordance above, it is easy to identify several other patterns which are typical of the way remember is used. These include:

  • when the verb is followed by a that-clause: Finally, please remember that URLs are case sensitive
  • the expression ‘…is worth remembering’: This is worth remembering if you suffer from cold hands.
  • when the verb is used with an infinitive: Remember to review success as well!
  • typical adverbs that are used with remember: He vaguely remembered a feeling of total happiness and yet now it was gone.| They barely remembered Mum, not like me.

By scanning hundreds (sometimes thousands) of examples like this we gradually build up a picture of the most important facts about a word like remember.  

However, this is very time-consuming. When lexicographers first started using corpus data, in the 1980s, corpora were relatively small, with just 10 or 20 million words. Consequently, the number of examples for a particular word (like remember) would also be fairly small – so it was possible to look at them all. But with today’s billion-word corpora, this is no longer true. The corpus we use at Macmillan contains 232,394 examples of the verb remember, and it would be impossible to study every one of them.

5. Beyond the concordance

Fortunately, intelligent new software solves this problem of ‘information overload’. In addition to concordances we now look at ‘Word Sketches’, which provide an efficient one-page summary of all the key facts about a word. Here is part of a Word Sketch for the noun evidence – another very common word for which our corpus has about 300,000 different examples

Word sketch - evidence Part of a Word Sketch for evidence

How does this work? The program first collects all the examples of the word being investigated – just as a concordancer does. Then it applies a second stage of analysis. This time, the software looks at particular grammatical relationships. In the case of evidence, it finds all the sentences where evidence is the object of a verb, then identifies the most frequent verbs used in this pattern. These are the verbs listed in the first column of the Word Sketch above: people often talk (or write) about giving evidence, finding evidence, presenting evidence, or gathering evidence. Similarly, the column headed ‘a_modifier’ is a list of the adjectives that most frequently modify this noun: we may say there is little evidence for something, or talk about clear evidence, strong evidence, or scientific evidence. The blue number next to each word tells you how often each combination appears in the corpus: so the combination provide + evidence occurs 10,909 times. And clicking on this number brings up a concordance showing all the sentences in which evidence is the object of provide.

This software has made lexicographers’ lives easier, while at the same time supplying us with information which is more accurate and more detailed. Programs like this are now  standard tools for lexicography, but the Word Sketch software was pioneered by Macmillan and used in producing the first edition of the Macmillan English Dictionary. 

6. What kinds of information does the corpus provide?

Dictionaries don’t just tell you what words mean, they also explain how words are used. And the corpus provides us with the evidence to fulfil these two functions.

Meaning
Many words have more than one meaning, but it is almost always clear which meaning the speaker or writer intends. In these four sentences from the corpus, it is easy to see when the word goal is being used in its footballing meaning, or when it means an aim or objective:

But the referee spotted the foul, and disallowed the goal.

African leaders are seeking the support of the international community to achieve these goals.

He has made 137 appearances for United and scored 27 goals.

Teachers may use this information to help students set goals for themselves.

Just as in real conversations, we identify the ‘right’ meaning through the context the word appears in. By studying words in context, we discover how many different meanings they have.

Grammar We saw how the concordance for remember tells us a lot about the grammatical patterns the verb is used in: with a gerund, a that-clause, an infinitive, and so on. Here again, the Word Sketches provide a useful shortcut by listing the most frequent ‘constructions’ – so we no longer need to scan hundreds of examples. Here is the list of grammar patterns in a Word Sketch for the verb decide:

Word sketch - evidence

This shows that the most frequent pattern with decide is an infinitive clause (‘Vinf_to’: Three months after that they decided to terminate my employment on health grounds). There are 132,188 examples of this in the corpus, which is almost half of all cases where decide is used. The next most common pattern is with a that-clause (‘that_0’: They decided that surrender was the only sensible option), and so on.

Collocations
The Word Sketch software provides high-quality information about collocations, or words that have a tendency to go together. We can see this (above) in the list of verbs frequently used with evidence, and using this software means we can give a really comprehensive account of collocation for the first time. This is of great value to anyone for whom English is a second language, because collocation is a key to expressing your ideas in ways that sound natural and typical.

In this entry for the word importance, frequent collocations are shown in two ways:

  • in the examples, where there are sentences illustrating collocations like stress and emphasize the importance of something
  • in a Collocation Box, which lists the words (the ‘collocates’) that most often go with importance

Definition - importance

We used the same software for creating the Macmillan Collocations Dictionary, which provides an even more detailed description of how English words work together to form natural-sounding combinations

Style and regional varieties
All the words we’ve looked at so far (remember, decide, evidence, importance) can be used in any situation: you might use them in a conversation, read them in a newspaper, or see them in an academic journal. They are what linguists called ‘unmarked’. But there are some words and expressions which are mainly found in one particular type of text: in spoken language, for example, or in newspapers or technical writing. Similarly, most English words are used all over the English-speaking world, but some belong to one particular regional variety of English, such as British English or Indian English.

Look at this sentence from the corpus:

These two distinct eateries say much about why Charleston has become a mecca for food-lovers.

Eatery is another word for ‘restaurant’ – but it is not ‘unmarked’. When we look at all the examples of eatery in the corpus we find that a majority come from newspapers and magazines, and most of these newspapers and magazines are from the U.S. So in the dictionary, the word eatery has two ‘labels’:  mainly american and mainly journalism. It is the evidence of the corpus which enables us to apply labels like this with confidence.

 7. Frequency, and why it’s important

In language, the more frequent something is, the more useful it is to learn. The words ameliorate and improve mean more or less the same – but improve is about 250 times more common. It is worth learning improve (its meaning, grammar, and collocations) because it is part of the ‘core’ vocabulary of English: you will see and hear it frequently, and you will probably need to use it quite often too. Ameliorate is not like this: if you happen to come across it (which is unlikely, because it is very rare), you can look it up in a dictionary, but it is not worth wasting any energy on.

With a very large corpus, it is easy to identify not only which words are most frequent, but also which grammar patterns (like decide + infinitive) and which collocations (like crucial + importance). It is these frequent words and combinations which we explain in most detail in all the Macmillan dictionaries, and the distinction we make between ‘red’ and ‘black’ words is one of the unique features of our dictionaries.

 8. Using the corpus to find examples

Dictionary users appreciate example sentences. A good example is one that shows how a word works in context, and helps to explain what it means. An example for a word in a dictionary should be typical of the way the word is used in real life – so we use the corpus as a source of example sentences.

To see how the selection process works, look back at the entry for importance above. We use the Word Sketch and concordances to identify the facts about the word which are most worth including in the dictionary (and in this case, that includes various common collocations). But notice the first example:

By 1800, the monarchy had declined in importance.

We chose this example because in the corpus we find almost a thousand examples of the expression in importance with a verb in front of it. This means it is one of the typical features of the way importance is used. Further research shows that the verbs which occur in this position are usually words like increase, grow, and gain, or decline, decrease, or diminish.

Among the many examples of this pattern, we find several which illustrate the sequence: By [date] X had declined in importance.

One of these, for example, reads:
By the early 12th century, the monasteries, which had been the focal points of religious life, had declined in importance and the way was ready for the introduction of the diocesan system.

But this sentence is too long for the dictionary, and it contains a lot of unnecessary extra information. So we have changed the sentence a little, and shortened it to what you see in the dictionary:

By 1800, the monarchy had declined in importance.

This example is short and easy to understand – but it faithfully reflects the way importance is used in the corpus.

9. Using the ‘learner corpus’: Macmillan and CECL

When we were developing the second edition of the Macmillan English Dictionary, we also used a different type of corpus – the learner corpus – through our research collaboration with the Centre for English Corpus Linguistics (CECL) at the Université catholique de Louvain in Belgium.

Using innovative software, lexicographers based the Macmillan English Dictionary (MED) on a unique modern corpus of over 200 million words – the World English Corpus. The second edition of the MED added to this corpus through a collaboration with the Centre for English Corpus Linguistics at the Université catholique de Louvain in Belgium.

Centre for English Corpus Linguistics

CECL under its Director, Sylviane Granger, focuses on the development and exploitation of learner corpora. The text in a learner corpus consists of speech and writing produced not by native-speakers but by people who are learning a language. CECL’s learner corpora include data from a worldwide mix of learners of English,  and these provided a wealth of information for our lexicographers about common learners’ problems. This information has been used to provide help for learners by, for example:

  • highlighting and exemplifying correct collocations;
  • offering useful alternatives to over-used vocabulary items;
  • providing notes that point out the differences between easily-confused words;
  • offering specific warnings to alert learners to common errors.

This has led to the development of unique new materials to help learners improve their writing, in the form of the Improve your Writing Skills section in the centre of the dictionary, the Get it right boxes at individual headwords, and the exercises on the CD-ROM. The second edition of the Macmillan English Dictionary was the first dictionary to use learner data in this systematic way.

Would you like to know more?

For a short article about the CECL/Macmillan collaboration, see

M. Rundell & S.Granger, ‘From corpora to confidence’, English Teaching Professional, 50, 2007, p. 15-18.

For a more detailed article on this subject, see

Gilquin, Gaëtanelle, Granger, Sylviane, & Paquot, Magali, ‘Learner corpora: the missing link in EAP pedagogy’, Journal of English for Academic Purposes, 6, 4, 2007, p. 319-335.

For an article about Word Sketches and how they were used for the first time in dictionary-making, see

Adam Kilgarriff and Michael Rundell, ‘Lexical profiling software and its lexicographic applications – a case study.’, In Proceedings of the Tenth Euralex Congress. Copenhagen, 2002:  807-818. Available at http://kilgarriff.co.uk/publications.htm