Recently, I gave a 10-minute utter at the ELT Freelancers’ Awayday inwards Oxford nearly “Simple corpus hacks for ELT editors”. I only had fourth dimension to hold off at i corpus together with a handful of searches, but I promised to portion some of my other favourites inwards a weblog post. So hither goes…
1 Monco: In my talk, I looked at the Monco corpus. I chose it because it’s a monitor corpus, then it monitors electrical current usage, updating alongside novel information daily together with equally such, I uncovering it useful for answering linguistic communication questions that haven’t withal made it into conventional reference sources similar dictionaries. For example, inwards my talk, nosotros looked at how wellbeing (spelled equally a unmarried word) may survive catching upwards alongside to a greater extent than traditional hyphenated shape (well-being) that you’ll uncovering inwards most dictionaries (simply past times typing well-being|wellbeing into the search box). The separate was 35% – 65% inwards Monco compared alongside 17% – 83% inwards the British National Corpus (with information from the 1980s together with 90s). We also turned upwards some potentially useful verb collocates for newsfeed, including scroll through together with pop up, which won’t convey withal made it into a collocations dictionary. One of my favourite features of Monco, especially for the corpus novice, is its user-friendly search concealment together with its overnice graphics for results.
On the downside, Monco’s information is drawn from alone online intelligence sources which agency that it’s truly only reflective of journalism, rather than linguistic communication usage inwards general. And although it includes sources from the UK, US, Canada together with Australia, it isn’t balanced, then there’s significantly to a greater extent than information from some sources than others – a constituent to acquit inwards heed that tin skew the results.
2 Brigham Young University: Not strictly a unmarried corpus, but a collection of unlike corpora available via the same site together with the go-to source for lots of queries. Personally, I tend to utilisation COCA (the Corpus of Contemporary American English) for checking U.S.A. usage. It’s a large corpus containing a overnice multifariousness of contemporary sources (1990 – present), including radio & TV transcripts, fiction, newspapers, magazines together with academic data. Through BYU, you lot tin also uncovering host a specialized corpora including a corpus of Wikipedia entries together with even, slightly weirdly, the Hansard corpus of British parliamentary proceedings, should that tumble out to jibe your purpose!
My primary grumble alongside BYU is that I uncovering the interface clunky together with frustrating to use, especially alongside its rather distracting colour-coding.
3 BAWE together with BASE: The British Academic Written English linguistic communication corpus (BAWE) together with the British Academic Spoken English linguistic communication corpus (BASE) are composed of written together with spoken information collected from academy students at a issue of British universities. The written corpus contains essays together with other coursework which received a adept travel past times grade together with the spoken information includes lectures together with seminars. I item similar these corpora because they’re an instance of linguistic communication equally it mightiness survive used past times the peers of the students we’re aiming at, rather than text produced past times professional person writers, journalists, academics, etc. which doesn’t necessarily render an appropriate model for the average ELT student. This is evidently university-level language, then is especially relevant for EAP, but I think BAWE could survive useful for whatever advanced students who bespeak to write formal essays (IELTS, CAE, Proficiency). And if you’re looking for U.S.A. academic equivalents, you lot could also banking concern check out MICUSP together with MICASE.
BAWE together with BASE are truly available via several sources, but I wanted the excuse to instruct you lot to experience Sketch Engine, for me, the gilded measure when it comes to corpus tools together with the interface used past times all the major lexicon publishers for their large corpora.
4 Spoken BNC2014: I acknowledge this is the corpus on my listing that I’ve belike used to the lowest degree then far, but I’m including it because it’s i I’m quite excited nearly finding uses for. Slightly reverse to its name, it was only released inwards 2017 together with is the final result of a massive projection to collect information nearly electrical current spoken English linguistic communication used inwards everyday contexts. If you’re working on speaking materials, looking at show from written English linguistic communication is non going to tell you lot anything terribly useful, because nosotros exactly don’t speak how nosotros write. So I think this could driblet dead the go-to corpus for anyone who wants to know how people truly state things.
Unfortunately, the Spoken BNC2014 doesn’t convey the most user-friendly interface together with getting access involves a chip of a faffy sign-up procedure which could survive off-putting for the casual user. If spoken linguistic communication is your matter though, I think it’s worth investing the fourth dimension together with examine to banking concern check it out, non to the lowest degree because some of the content is exactly truly funny!
A banknote nearly corpora together with copyright: It’s of import to think that, inwards general, the information that appears inwards a corpus is liable to all the green copyright restrictions. That agency you lot can’t exactly line a large chunk of linguistic communication from the corpus together with utilisation it inwards your activity, especially non if it’s for commercial publication. Occasionally, of course, you lot come upwards across real short, ‘vanilla’ examples which could convey come upwards from almost anywhere (A missy opened the door. The traffic was especially bad.), but to survive honest, these are few together with far between. Generally, when I search for a item linguistic communication item, I’ll scan through the examples together with jot downwards a ‘frame’:
I/you scroll through my/your (Facebook) newsfeed to see/searching for/on the develop …
Then I’ll utilisation my notes equally the Earth for an instance that keeps the experience together with blueprint of the ones I’ve looked at, but fits my instruction role … together with doesn’t infringe copyright.
There are lots of unlike corpora out in that place together with corpus fans volition convey their personal favourites. If you’re novel to corpora though, I’d state alternative i or 2 to banking concern check out, play exactly about alongside a few unproblematic searches, utilisation the help to instruct you lot started, together with encounter what’s most useful for you. Be warned though, it tin survive addictive!




0 Response to "Four Favourite Corpora"
Post a Comment