Chapter 1 What is Corpus Linguistics?
1.1 Why do we need corpus data?
- There is an unnecessary dichotomy in linguistics
- “intuiting” linguistic data
- Inventing sentences exemplifying the phenomenon under investigation and then judging their grammaticality
- Corpus data
- Highlight the importance of language use in real context
- Highlight the linguistic tendency in the population (from a sample)
- “intuiting” linguistic data
- Strengths of Corpus Data
- Data reliability
- How sure can we be that other people will arrive at the same observations/patterns/conclusions using the same method?
- Can others replicate the same logical reasoning in intuiting data?
- Can others make the same “grammatical judgement”?
- How sure can we be that other people will arrive at the same observations/patterns/conclusions using the same method?
- Data validity
- How well do we understand the true mental representation that the linguistic data correspond to?
- Can we know more about the mental representation of grammar based on one man’s constructed sentences and/or his grammatical judgement?
- Can we better generalize our insights from one man’s intuition or from the group minds (population vs. sample vs. one-man)?
- How well do we understand the true mental representation that the linguistic data correspond to?
- Data reliability
Please provide one or two examples that show the differences between one’s grammatical intuition and the corpus-based observations. The examples could be in English or in your native language (e.g., Chinese).
1.2 What is corpus?
- This can be tricky: different disciplines, different definitions
- Literature
- History
- Sociology
- Field Linguistics
- Linguistic Corpus in corpus linguistics (Stefanowitsch, 2019)
- Authentic
- Representative
- Large
- A few well-received definitions
“a collection of texts which have been selected and brought together so that language can be studied on the computer” (Wynne, 2006)
“A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research.” (John Sinclair in (Wynne, 2006))
“A corpus refers to a machine-readable collection of (spoken or written) texts that were produced in a natural communitive setting, and in which the collection of texts is compiled with the intention (1) to be representative and balanced with respect to a particular linguistic language, variety, register, or genre and (2) to be analyzed linguistically.” (Gries, 2016)
1.3 What is a corpus linguistic study?
- CL characteristics
- No general agreement as to what it is
- Not a very homogeneous methodological framework
- (Compared to other sub-disciplines in linguistics) It’s quite new
- Intertwined with many linguistic fields
- Interactional linguistics, cognitive linguistics, functional syntax, usage-based grammar etc.
- Stylometry, computational linguistics, NLP, digital humanities, text mining, sentiment analysis
- Corpus-based vs. Corpus-driven (cf. Tognini-Bonelli, 2001)
- Corpus-based studies:
- Typically use corpus data in order to explore a theory or hypothesis, aiming to validate it, refute it or refine it.
- Take corpus linguistics as a method
- Corpus-driven studies:
- Typically reject the characterization of corpus linguistics as a method
- Claim instead that the corpus itself should be the sole source of our hypotheses about language
- It is thus claimed that the corpus itself embodies a theory of language (Tognini-Bonelli, 2001, pp. 84–85)
- Corpus-based studies:
The distinction between corpus-based and corpus-driven studies can be subtle. Both of these approaches, however, are often connected to the general usage-based grammar framework. I would highly recommend the chapter on usage-based model (Chapter 11) in Croft & Cruse (2004).
An example of corpus-driven grammatical framework is pattern grammar. Please see Hunston & Francis (2000) for a comprehensive introduction.
- Stefanowitsch (2019) defines Corpus Linguistics as follows:
Corpus linguistics is the investigation of linguistic research questions that have been framed in terms of the conditional distribution of linguistic phenomena in a linguistic corpus
- More on Conditional Distribution
- An exhaustive investigation
- Text-processing technology
- Retrieval and coding
- Regular expressions
- Common methods
- KWIC concordances
- Collocates
- Frequency lists
- Text-processing technology
- A systematic investigation
- The distribution of a linguistic phenomenon under particular conditions (e.g. lexical, syntactic, social, pragmatic etc. contexts)
- Statistical properties of language
- An exhaustive investigation
- Examples
- When do English speakers use the complementizer that?
- What are the differences between small and little?
- When do English speakers choose “He picked up the book” vs. “He picked the book up”?
- When do English speakers place the adverbial clauses before the matrix clause?
- Do speakers use different linguistic forms in different genres?
- Is the word “gay” used differently across different time periods?
- Do L2 learners use similar collocation patterns as do L1 speakers?
- Do speakers of different socio-economic classes talk differently?
Now please try to think of one or two research questions that may fit into the category of a corpus linguistic study and elaborate on the idea of conditional distribution:
You are interested in the linguistic structure of … and would like to see how it varies/changes/develops depending on …
1.4 Additional Information on CL
- Important Journals in Corpus Linguistics
- Corpus Linguistics and Linguistic Theory
- International Journal of Corpus Linguistics
- Corpora
- Applied Linguistics
- Computational Linguistics
- Digital scholarship in the Humanities
- Language Teaching
- Language Learning
- Journal of Second Language Writing
- CALL
- Language Teaching Research
- ReCALL
- System
- Corpus Linguistics and Linguistic Theory
- Important Conferences on Corpus Linguistics
- Corpus Linguistics Conference (usually held in UK)
- International Association of Applied Linguistics
- American Association for Corpus Linguistics
- American Association for Applied Linguistics
- Teaching and Language Corpora Conference
- International Cognitive Linguistics Conference
- International Conference on Construction Grammar