The Department for Education is shortly to announce plans for a new GCSE in MFL. It has been publicly confirmed that its design is heavily influenced by the 2016 Pedagogy Review. We can therefore expect it to include a specified vocabulary list (i.e. words we should all teach and words according to which the exams will be designed), whose makeup is largely informed by word frequency.
“High frequency” words are widely defined to be the top 2000 words in a language. The words beyond this are known as “content vocabulary” or “lower frequency” words. It is certainly true that these words are very important and very useful. Have a look at this graph:
This graphic shows us that the top 2000 words make up 80-90% of a given piece of text – spoken or written. That’s pretty impressive. In spoken language, it’s probably closer to the 90%. So surely, if we know 90% of the words in a text, that gives us pretty decent comprehension and allows us to say most things we would ever want to say.
Well, sort of. The snag is, comprehensibility can’t be measured solely in terms of how much of a text we understand or how many of the words we know. (Not to mention the fact that ‘high frequency words’ aren’t necessarily the ones we might think are useful to our learners – but that’s the subject of another blog.)
Take this simple example: J’étais très heureux hier. It’s a 5 word sentence. But some words are more important than others. If I’m unsure of “très“, I can still get the crucial details: j’étais XXXX heureux hier. If I’m unsure on my tenses – perhaps I don’t know “étais“, I can still get the gist: Je XXXX heureux hier. But the word “heureux” is absolutely vital: ‘j’étais XXXX hier‘ is totally redundant. What this shows is that some words are more important than others. So I might understand 90% of a text, but I might not understand very much of it at all if the 10% which is missing is crucial to the meaning.
Another good illustration of this is that the top four words in French – Le (+ variants), de, à and un/une – account for 22% of all French words. So if you know those four words, you have 22% coverage. But you don’t have 22% understanding, you have 0% understanding. In fact, many of the top words are generic function words which don’t unlock what a text or utterance is really saying. And beyond the 800th or so word, the words cease to become significantly more frequent than the words below them in the list: after about 800, each word achieves les than 0.01% coverage.
The real meaning of a text is often hidden away in the 10-20% of words that are lower frequency. If I only learn the top 2000 words of a language, I will be missing that crucial content vocabulary. That content vocabulary isn’t the optional extra bits, it is the crucial content – hence its name. Have a look at these examples. I’ve taken some from spoken language with people saying things that a learner might plausibly want to say, and some from news media. Have a look at one or two of them, see how much you can understand and then scroll down for my analysis.
Young people describing what job they’d like to do in French:
French speakers describing someone they know:
Germans telling us what they think of Berlin:
Headline news in New Zealand:
Headline news in Switzerland:
Headline news in Senegal
Headline news in Germany :
Headline news in Germany 
As you can see, the missing 10-20% or so of words from the content vocab (i.e. low frequency ranges) are pretty critical to delivering the meaning. Typically we can find out that someone is doing something or that something is happening – but we are missing the what.
To be clear, there are passages which I could have chosen where a text is a little more intelligible. As a general rule, the more generic a text, the more you can understand if your vocab is limited to the top 2000 words. For example, in texts where young people say they didn’t know what job they wanted to do, you can understand virtually the whole thing. But as soon as the person says what they want to do, you’re more likely to need content vocabulary. Ditto generic news about standard governmental affairs is pretty easy to understand if you have the top 2000 words, but anything less political (such as a story about an individual event) and you get stuck.
Cognates start to play a key role here and this becomes tricky for the ‘level playing field’ between languages. Take a text on Covid, for example. ‘Vaccination‘ and “déconfinement” are not a high frequency word in French, but it’s pretty obvious what they mean. Lockerung and Impfung, though, are less clear. A strategy for comprehension in one language is not immediately transferable to another language. There’s also an issue with vocabulary profiling tools not being able to properly pick up seperable verbs in German (“legen” and “zu” are clear as individual items, for example, but when combined as zulegen it is a wholly different word with a very distinct meaning).
When discussing the contribution of made by high frequency words, it’s important to remind ourselves how those lists are determined. In other words, how we do know what is high frequency? Essentially what happens is that people ‘sample’ the language – take a bunch of texts from a bunch of sources – and count how often individual words appear, in order to arrive at a final list. Therefore, what you choose to include in your sample is significant – not just linguistically, but also politically.
- how much is written vs how much is spoken?
- which TL countries are included?
- what kinds of speakers and communities are reflected?
One of the reasons that generic political stories are very easily understandable if you learn the top 2000 words in French is because the list is informed by lots of political French in the first place: transcripts from the Canadian parliament, EU debates and press agency newswires. This is the 23,000,000 words analysed to make the most commonly cited French list. Does it reflect what we have in mind for our learners when we teach French? That strikes to the heart of a wider debate – namely, what our goals and aims are in the first place.
So what does all this mean?
It means that high frequency words are very important. We can’t learn or use a language without them. But it also means that order to be able to communicate meaning, we need more than just high frequency words. Knowing only the top 2000 words doesn’t actually get us very far. We need the content vocab to actually about stuff, as opposed to just generics. 90% coverage might sound like a very impressive level of understanding of a text, but it’s not if the 10% of missing words are actually the ones where the meaning is conveyed – and that is generally the case.
This is perhaps what leads vocabulary expert Professor James Milton (‘Measuring Second Language Vocabulary Acquisition, 2009) to conclude that the “concern that foreign language course books in the UK contain an excessive quantity of infrequent items appears misplaced” and that an effective course “is probably going to introduce frequent and infrequent vocabulary in roughly equal amount. It will probably be thematically very diverse”. There is no evidence base for learning the high frequency words first, because it means that for a long time learners won’t be able to use their language or understand the language. Elsewhere (2010), Professor Milton identifies that only learning frequent words leaves learners “inevitably handicapped in terms of communicability and comprehension“. I.e. they can’t say or understand anything.
For curriculum designers this means they need to make sure the frequent stuff is well covered (which, in general, it naturally will be, precisely because it is frequent and therefore comes up all the time), but also means that a sensible course would define some fields of study or lexical fields in which it is intended learners have a decent amount of the content vocab. . How do we decide which lexical fields – i.e. themes – they should be? Well, perhaps there could be lots of optionality (as there is in History). And inevitably we’d have to be clear about what our curriculum goals are in the first place before we reach those decisions.