Corpus Insider #1: Representativeness



As I was putting together my beak for the follow-up spider web log post, I realized that over xx years of using corpora, at that topographic point are a whole host of factors I’ve learnt to convey into account. I touched really briefly on a few of them inward my talk, but I persuasion about powerfulness last worth exploring further. So, this is the kickoff inward a serial of posts most things y'all powerfulness postulate to demeanour inward hear if y'all desire to usage corpus tools to inform your run on ELT materials.

When I explicate to people what a corpus is, I usually start off past times maxim that it’s a large collection of linguistic communication that nosotros usage to stand upward for the way English linguistic communication is used every bit a whole. It seems a elementary premise, but when y'all dig a chip deeper, it gets to a greater extent than complicated. As amongst whatever research, the validity of your results is dependent on the quality of your data. The information your chosen corpus contains volition create upward one's hear precisely what form of linguistic communication it tin ship away genuinely last said to stand upward for in addition to thus how useful it is for your purpose. To convey a elementary example, if y'all were writing for a specifically British English linguistic communication market, using a corpus that contained only American information wouldn’t last really useful. Similarly, if y'all were working on speaking materials, looking at usage inward a corpus of alone written information wouldn’t genuinely tell y'all much most how people commonly speak. Understanding a chip most the corpus y'all conception to use, the information it contains, in addition to what that powerfulness stand upward for is absolutely essential earlier y'all start doing whatever corpus research.

Corpus types:
There are 2 primary types of corpus, those which comprise information drawn from 1 type of source or genre in addition to those which are said to last ‘balanced’ in addition to comprise information from a broad diversity of different genres. The kickoff type includes purely spoken corpora (like the Spoken BNC2014), corpora of academic writing (of either published texts, similar the academic business office of COCA or educatee writing, similar BAWE or MICUSP) in addition to many corpora are composed largely of journalism, because it’s 1 of the simplest sources of information to collect, peculiarly for those corpora that rely on web-based content (e.g. Monco, NOW, etc.).

Large balanced corpora, containing written in addition to spoken information from a broad arrive at of sources, are much to a greater extent than hard to seat together. For this reason, they’re mainly owned in addition to maintained past times large publishers, peculiarly those who make dictionaries, in addition to aren’t publically available. The British National Corpus (BNC) is a balanced corpus that’s freely available, but it’s relatively little past times modern standards and, maybe to a greater extent than importantly, it’s becoming increasingly out of appointment (with information from the 1980s in addition to 90s). The Corpus of Contemporary American English (COCA) sits inward a mid-ground amongst information from spoken sources (although all radio transcripts rather than everyday conversation), fiction, pop magazines, newspapers in addition to academic texts. It’s reasonably balanced, although all American every bit the cite suggests.

The problem amongst media hype:
Data from newspapers, magazines in addition to blogs is really slow to collect in addition to makes upward a large proportion of many corpora. It tin ship away supply lots of interesting information most linguistic communication used to beak most a broad arrive at of topics, but it’s of import to retrieve that journalism every bit a genre has its ain quite marked features that don’t necessarily reverberate the way that ordinary people usage linguistic communication 24-hour interval to day. It may seem obvious to tell that journalists study news, but that agency they’re to a greater extent than frequently than non writing most what’s new, surprising, shocking or problematic. They also desire to depict their readers inward in addition to continue their attending amongst colourful linguistic communication choices in addition to hyperbole. For my recent talk, I demonstrated an instance of a inquiry most the linguistic communication of social media in addition to inward particular, which verbs collocate amongst the substantive ‘newsfeed’. I used the Monco corpus, because I was interested inward up-to-date usage, in addition to came upward amongst the next verbs:

scroll through your newsfeed
pop upward on your newsfeed
fill/flood/dominate/clog up your newsfeed


The kickoff 2 experience similar expressions y'all powerfulness usage inward conversation, the others, however, are clearly journalistic inward style; bemoaning the way that a detail tendency is overtaking our online lives. Searching a duet of other news-dominated corpora came upward amongst similar results (enTenTen: spam/clutter/bombard/clog your newsfeed; NOW: scroll through/appear on/pop upward on/tweak/flood/clog your newsfeed). They’re all interesting collocations, but they’re in all probability non the kickoff ones you’d guide to learn an intermediate learner who wants to beak most the way they usage social media themselves. That’s non to tell y'all shouldn’t usage these corpora when you're researching ideas for ELT materials, but knowing a corpus contains only or largely information from journalistic sources agency that y'all tin ship away last on the lookout adult man for this type of linguistic communication in addition to last selective most what y'all usage every bit appropriate for the learners you’re writing for.


Professional in addition to lay writers:
Unsurprisingly, the bulk of written corpus information comes from published sources and, every bit such, it’s written past times people who are professional person writers: authors, journalists, copy-writers. As nosotros saw amongst journalism, above, this tin ship away hateful the linguistic communication is to a greater extent than colourful in addition to in all probability to a greater extent than varied than the average lay individual typically tends to use. This came out really clearly inward a recent study into academic vocabulary (Durrant, 2016*) which looked at how many of the words on the Academic Vocabulary List (based on a corpus of published academic writing) were genuinely used regularly past times educatee writers (using a corpus of university-level educatee writing). It turned out that the educatee essays contained a vastly narrower arrive at of vocabulary than the published academic texts, written past times experienced (and edited!) academics. That’s non to tell the educatee writing was inward about way lacking – all the papers had got high marks – it’s merely a different genre amongst different expectations. 

When you’re using a corpus to search for ideas, it’s all besides slow to pick out examples in addition to patterns that are elegant or appealing, but I think it’s ever of import to inquire yourself how typical they are of what the average individual powerfulness tell or write. Is it a writerly flourish? Is it helpful every bit a model for your target learners?

I’m non maxim that every bit ELT writers in addition to editors nosotros should spend upward all corpus prove every bit flawed in addition to unhelpful. Far from it, I think corpus tools tin ship away last incredibly helpful inward backing upward our intuitions in addition to regain patterns of usage nosotros powerfulness non guide hold persuasion of, but they are merely that, ‘tools’ in addition to should last used amongst an chemical constituent of caution. It's all besides slow to last drawn inward past times a corpus that's novel or peculiarly large or has a overnice interface in addition to peachy tools, but making certain y'all know what your corpus represents is vital. If a collocation or blueprint feels unlikely or overly fancy, thus inquire yourself why. Don’t merely guide hold the kickoff results that pop up, click through to the examples, scroll downwardly to encounter where they come upward from in addition to sympathize precisely what’s going on.


* There's a practiced summary of Durrant's study on ELT Research Bites.

Subscribe to receive free email updates:

0 Response to "Corpus Insider #1: Representativeness"

Post a Comment