corpus (pl. corpora)
a collection of written or spoken material stored on a computer and used to find out how language is used
From the Cambridge English Dictionary online
I’ve been interested in corpora for a while now, but never seem to have time to go beyond my very basic understanding of how the Brigham Young University corpus interface works. I’ve always used it for the BNC (British National Corpus), which covers 1980-1993, but discovered a few seconds ago (!) that COCA (Corpus of Contemporary American English) is constantly updated, so I think I’ll be switching to that from now on!
All I knew before was how to do a basic search for a term and how to look for collocates, possible with a verb or noun near the key word if I was feeling very adventurous. Thanks to three talks I attended on different versions of corpora during the conference, I now feel like I know much more! 🙂
Jennie Wright did a very practical session introducing us to the basic functions of COCA, with three activities you can take straight into the classroom. Mura Nava, the master of corpora, helpfully collected my tweets from the session (and added notes to make it clearer – thanks!) which show all three activities, and Jennie has shared the list of corpora resources on her blog. She particularly recommended COCA Bites, a series of very short YouTube videos designed to introduce you to the corpus.
One thing I particularly like about COCA is the fact that parts of speech are highlighted in different colours. Here’s an example of a KWIC search for ‘conference’, giving concordance lines with the key word in a single column (a function Jennie taught me!)
James Thomas taught us how to answer language questions from corpora, focussing on the SKELL (Sketch Engine for Language Learning) concordancer (thanks for correcting that James!). I didn’t realise that SKELL was created by the people at Masaryk University, in (one of) my second home(s) Brno 🙂 Again, Mura collected the tweets, this time by me, Leo Selivan (another corpus master) and Dan Ruelle.
What makes SKELL different to many corpora is that it uses algorithms to select 40 sentences from however many the search finds, getting rid of as many as possible with obscure words or which are overly long to make it easier for learners to use. This works well for common words, but not always for slightly more obscure words, like ‘mansplain‘ (possibly the word of the conference, thanks to David Crystal’s opening plenary!) You can also use the ‘word sketch’ function on the corpus to show you lots of collocates, a function I think I will now use instead of a collocations dictionary! Michael Houston Brown has a very clear introduction to SKELL on Mura’s eflnotes blog.
One slight problem, as with all corpora, is that it cannot distinguish between different senses of the same word, which may confuse learners. In this example, conference is listed both in the sense of the IATEFL conference, and as a sporting league. This could also be seen in the COCA image above, but I think it is easier to spot here.
If you’d like to find out more, James has recently written an article for the Humanising Language Teaching magazine.
Making your own corpus
Chad Langford and Joshua Albair are clearly die-hard corpus fans. They trawled through over one million words from over 8,000 TripAdvisor restaurant reviews to create their own corpus of review language. The findings were very interesting and showed up some clear features of the genre, but I’m not sure how practical it would be for most teachers to do this kind of project as anything other than a hobby. They’re based at Lille University, but they didn’t say how much of their time was dedicated to this project versus teaching, or how many groups they used it with, so it was difficult to work out the return on their investment of time. Nevertheless, it was very interesting to see how you go about building a corpus. Again, thanks to Mura for collating my tweets with more information in them.
Mura also collated tweets for one more corpus-related talk at IATEFL, based on the English Grammar Profile. Cambridge have recorded all of their talks from the conference, including this one, so you can watch it at your leisure. He has a free ebook with examples of the BYU-COCA corpus interface.
There are interviews with some of the presenters of corpus talks at this year’s IATEFL, including James, Chad and Josh, on Mura’s blog. This list of talks shows everything connected to corpora from this year’s conference.