by Michael Rundell
As long ago as 1933, British linguist Harold Palmer wrote that:
(Quoted in Cowie 1999: 52-53)
These ‘comings-together-of-words’ encompass the whole area of phraseology, and Palmer – a pioneer in the field of English language teaching – was one of the first people to identify phraseology as a key to fluency. Working in Japan in the 1920s and 1930s, Palmer developed a particular interest in collocation, and his Second Interim Report on English Collocations (1933) was extremely influential. The central importance of collocation became even clearer when the arrival of large language corpora revealed the extent to which writers and speakers depend on ‘prefabricated chunks’ of language. As explained by John Sinclair, the father of corpus linguistics:
(Sinclair 1991: 110)
Collocations are ‘semi-preconstructed phrases’ which allow language users to express their ideas with maximum clarity and economy. Not only that, there is strong correlation between frequency in a corpus and typicality, which means that the use of common collocations contributes to the naturalness of a text.
You can look at collocation from a purely statistical point of view, using a large corpus to find the most common combinations of words. But frequency alone doesn’t tell the whole story. The most frequent two-word unit in English is ‘of the’, while fairly predictable adverb / adjective combinations (‘very good’, ‘more likely’, ‘most important’, and so on) are also, in statistical terms, very significant. However, just as important as frequency is the semantic function of collocations. Users need to be able to answer questions like: ‘how do I express the idea of ‘doing’ this noun?’ (in other words, which verb should I use with nouns like crime, benchmark, proposal, or research?); or ‘how can I add the idea of ‘very’ to this adjective’ (to create combinations like highly improbable, unbearably hot, or bitterly disappointed)?
Writers like Jennifer Jenkins, who champion the idea of ‘English as a Lingua Franca’ (ELF), often suggest that the more ‘idiomatic‘ aspects of English – such as slang, cultural allusions, phrasal verbs, or fixed phrases – are unhelpful and unnecessary in an ELF context and should be avoided. But it is important to recognize that collocation – though an aspect of idiomatic English – is not an ‘optional extra’. Collocation is not a more elaborate aspect of language which can be dispensed with when a simpler form of discourse is required. On the contrary, it is central to meaning. Consider the adjective sick, which has a number of meanings, including:
- 1. vomiting or wanting to vomit
- 2. ill
- 3. fed up with something
Each of these meanings is expressed through different collocations, as these examples from our corpus show:
- 1. I had never drunk alcohol before and I became violently sick.
- 2. Volunteers are trained to provide care for chronically sick people in their homes.
- 3. We are all heartily sick of their negative attitude.
In each case, it is the collocation that indicates to the listener or reader which meaning of sick is intended. As the linguist J.R. Firth famously said, you really do ‘know a word by the company it keeps’. A good grasp of collocation contributes to our understanding of what words really mean.
A lot of information about collocation can be found in good learners’ dictionaries like the Macmillan English Dictionary (MED). There is a great deal of relevant data to be found on websites too. For example, the website wordfrequency.info allows you to download lists with ‘the top 200-300 collocates for each of the [most frequent] 20,000 words’. And the Just the Word website delivers about two pages’ worth of collocates for the word impression.
With all this data available, what is the point of a specialized dictionary of collocations? The simple answer is that general learners’ dictionaries don’t provide enough information, and websites provide far too much. MED includes over 500 ‘collocation boxes’, and highlights strong collocations at hundreds of other entries. But even the best general dictionary cannot provide the range and depth of information that a dedicated collocations dictionary can. As for the websites, what you tend to get is very long, unstructured lists of collocations which only makes sense if you read them while consulting another dictionary.
The fact is that identifying collocates is not difficult: with today’s corpus-querying software, applied to a large corpus, it’s easy to find the commonest ways in which a given word tends to combine. But what you get is ‘raw data’, often in overwhelming quantities. In order for it to be genuinely useful, it needs to be categorized, filtered, and interpreted by skilled editors who understand the needs of advanced learners – and this is what we have done in the Macmillan Collocations Dictionary (MCD).
At the end of the 1990s (when the first edition of MED was being compiled), Macmillan lexicographers were the first users of a new software tool which analyzed corpus data to produce ‘Word Sketches’. A Word Sketch is a one-page summary of a word’s collocational behaviour, and this was the program we used for producing the 500 or so ‘collocation boxes’ in MED. Ten years on, the software has become more powerful, and the corpus we are using is 10 times larger. So it was an improved and customized version of the Word Sketch software that was our main resource for identifying collocations for MCD. Click here for a Word Sketch for impression.
Entries in MCD are structured according to the grammatical relations which the headword forms part of. So if the headword is a noun (N), typical relations include v+N (verbs that have the noun as their object) and adj+N (adjectives that frequently modify the noun). As you can see, the columns in the Word Sketches correspond to these relations: the first column (v+N) lists the most frequent verb collocates of impression (words like give, make, create and convey), while the third column lists the adjectives. Notice that some of these collocates already have a tick next to them: this is because the entry in MED lists them in its ‘collocation box’.
To find this information, the software has analyzed a corpus of 1.6 billion words of English, and looked at 44,396 different instances of the word impression. The blue underlined numbers show the frequency of each collocation. So for example, the number 354 shown next to convey in the first column means that the corpus includes 354 sentences in which impression combines with convey. Clicking on this number produces a ‘concordance’ for this particular collocation, as this screenshot shows.
Next to the blue underlined numbers in the Word Sketch, a second number is shown in black: for convey, this is 7.77. This is a measure of probability, and it gives an indication of how ‘strong’ the collocation is. The higher the number here, the stronger the collocation. There is some complicated maths behind all this, but the important point is that, although convey+impression (with 354 examples) occurs less often than make+impression (with almost 4,000) it is actually a stronger collocation. Make is such a common word that it is likely to appear frequently with other words anyway – but convey is far less common, so the fact that it occurs so often with impression is significant.
Which words should a collocations dictionary include? Selecting the headwords for a dictionary is a mixture of science and art. In this case, the science is a statistical measure which tells us the ‘collocationality’ of words (for technical details, see Kilgarriff 2006). To explain this, think of the difference between words which have a small number of very strong collocations, and words which combine freely with many other words. At one end of the scale are words like wreak or unmitigated. There are not many things you can ‘wreak’ – the four most frequent objects are destruction, revenge, vengeance, and (10 times more common than all the others) havoc. (Unmitigated collocates strongly with disaster – and not much else.) At the other end of the scale are words like house, buy and good, which have no strong collocates at all: just about any word can be (and does) combine with words like these, as long as the combination makes sense.
Using this analysis, we can say that wreak is a ‘very collocational’ word, and when there is such a narrow range of collocates, we tend to treat the combination as a fixed phrase. Words like house, on the other hand, are not ‘collocational’ at all. So we conclude that the words which users of a collocations dictionary need to know about, are the ones somewhere in the middle of this spectrum. Consequently, MCD has a smaller (but more carefully selected) headword list than other collocations dictionaries: it describes the collocational behaviour of about 4,500 words – and these are the words that are truly ‘collocational’.
With the customized Word Sketches, huge corpus, and ‘collocationality’ measure, the Macmillan team has continued its record of working with the most advanced computational tools. This ensures that the data we extract from the corpus is of the highest quality. But this is just stage one in the dictionary-making process: the raw data now needs to be analyzed and interpreted, and this is a job for our team of skilled lexicographers – our most important resource.
For us as lexicographers, creating a new collocations dictionary was an exciting challenge. Whereas the ‘advanced learner’s dictionary’ is a well-established category, with a number of standard features (they are all around the same size and all, for example, use a limited ‘defining vocabulary’), there is no fixed model for what a collocations dictionary should look like, or what range of information it should include. So we have the luxury of starting with a blank canvas.
As noted above, MCD focuses on core vocabulary items which have a good range of strong collocates. This means we have fewer entries than other collocations dictionaries: we prefer to give full coverage to a smaller number of words, than to include words like food, house, goalkeeper, or flea which present no real collocational problems. The vocabulary we cover is geared especially to the productive needs of people working in academic or professional environments, who often have to write assignments, reports, or reviews in English. Most of the words in the Academic Word List are included in MCD, and the list of words in A includes items such as:
People whose professional lives require them to operate in English are equally well served, with full collocational descriptions of words such as:
But we also cater for the needs of people learning English at advanced level (as opposed to those using English in their studies or professional life), and MCD includes the kind of vocabulary you might need for writing narrative or descriptive prose. These include words about people and their appearance (such as face, eyes, hair, complexion, smile), the weather (sun, rain, snowfall, mist, breeze), or emotions (sadness, delight, envy, grief).
All of this makes MCD an especially useful resource for anyone studying for the IELTS exam. Indeed, Sam McCarter, IELTS expert and author of Ready for IELTS, has described the dictionary as ‘tailor-made to meet the needs of students preparing for this exam’. With its focus on competence in using language, IELTS requires students to be able to ‘process’ language at speed while engaged in a range of tasks – whether they are writing or speaking, reading or listening. In all cases, a good grasp of collocations brings clear benefits. Recognizing typical collocations, and being able to produce them in speech or writing, is essential to fluency and naturalness, so using MCD in preparing for IELTS will present a real advantage.
The structure of an entry in the Macmillan Collocations Dictionary
- adj+N: adjectives that often collocate with assumption (such as basic, incorrect, or implicit)
- v+N: verbs that often have assumption as their object (such as make, challenge, or reject)
- N+v: verbs where assumption is often the subject (such as underlie or underpin)
But dividing collocates by grammar alone is not sufficiently helpful. It is easy to see that adjective collocates like mistaken, reasonable, and fundamental convey quite different messages. So we make a further division, putting together collocates with similar meanings into separate groupings. The second grouping in the adj+N set, for example, focuses on the idea that an assumption is wrong, and lists six collocates which you can use to express this idea. For maximum clarity, each of these ‘semantic groupings’ gets its own short definition. So if you want a verb collocate to express the idea of ‘questioning’ an assumption, go straight to the v+N grouping with the definition ‘question an assumption‘, to find verb collocates like challenge or examine.
As explained earlier, today’s corpus software makes it easy for us to identify the most important collocates of a word like assumption. But what makes MCD unique is the way these collocates are organized and presented for maximum ease of use.
A high percentage of useful collocations occur in one of four key grammatical relations:
- v+N (commit a crime, conduct a survey)
- adj+N (detailed analysis, compelling argument)
- Grammatical relations like adj+N, v+N, adv+V (criticize severely, correlate closely)
- adv+ADJ (highly influential, historically accurate)
But a special feature of MCD is that it includes a number of less obvious (but nevertheless very useful) types of relations too, such as:
- collocates following prepositions, like recipe + for (disaster, confusion, anarchy etc) poised + for (success, growth, expansion etc), regard + with (affection, respect, favour etc)
- collocates in the form of infinitive verbs, in combinations like reasonable + to (assume, conclude, believe), rash + to (argue, predict, deny)
- ‘and / or’ collocates: the corpus shows that some words often occur in combination with other words of the same type, to create collocations like illiteracy and/or poverty, disease, ignorance, imaginative and/or creative, exciting, bold. This is an interesting feature of language, and using combinations like this contributes to the naturalness of any text.
The most striking of these innovations is that MCD includes collocations where verbs or adjectives go with distinctive sets of nouns, to produce combinations like these:
- allocate (resources, funding, budget)
- exercise (caution, restraint, discretion)
- arbitrary (arrest, detention, imprisonment)
- exhaustive (analysis, review, survey)
It could be argued that these combinations are ‘the wrong way around': that is, that a user will want to know which verb to use with nouns like caution or restraint, but is less likely to ask: ‘what kinds of thing can you exercise?’. But entries like those at exercise and arbitrary have a valuable language-awareness function: it is the collocates that help to explain what words like this really mean. Once again, we are reminded of Firth’s dictum that you ‘know a word by the company it keeps’.
Finally, MCD includes two types of usage note which are designed to provide information about:
- alternatives to collocation
‘Colligation’ refers to a tendency some words have to appear in a particular form. For example, there are verbs that have a strong ‘preference’ for appearing in the passive, while some nouns are most often used in the plural. Colligation is explained by Michael Hoey, the head of Macmillan Dictionaries’ board of advisors (see Hoey 2005), and it is an important feature of natural, idiomatic language. So we use special notes, shown with a pink background, to highlight cases like this. There are two notes of this type at the entry for critic, for example.
In the other type of note (which has a grey background) we draw attention to very common phrases which can be used as an alternative to using a collocation. For example, the entry for conclusion includes a set of collocates for saying that someone has reached the wrong conclusion (adjectives like erroneous, false, and incorrect). But the note tells us that writers and speakers are equally likely to express this idea by saying that someone is jumping to conclusions.
The Macmillan Collocations Dictionary is a carefully researched, corpus-based dictionary, powered by leading-edge language-analysis software. Though designed to meet the productive needs of users and learners of English, and particularly valuable for anyone preparing for the IELTS examination, MCD provides a comprehensive and user-friendly account of the collocational behaviour of the core vocabulary of English. As such, it will be of interest to anyone who teaches, studies or enjoys using the English language. With its many innovative features, it is a welcome addition to the Macmillan range of dictionaries.
Lexical Priming: A New Theory of Words and Language, M. Hoey (London: Routledge, 2005)
The Phonology of English as an International Language, J. Jenkins (Oxford: OUP, 2000)
Collocationality (and how to measure it), in Proceedings of the 12th EURALEX International Congress, A. Kilgarriff (Torino: Edizioni Dell’Orso, 2006)
Corpus, Concordance, Collocation, J. Sinclair (Oxford: OUP, 1991)