Chapter 1 What is Corpus Linguistics?

1.1 Why do we need corpus data?

  • There is an unnecessary dichotomy in linguistics
    • “intuiting” linguistic data
      • Inventing sentences exemplifying the phenomenon under investigation and then judging their grammaticality
    • Corpus data
      • Highlight the importance of language use in real context
      • Highlight the linguistic tendency in the population (from a sample)
  • Strengths of Corpus Data
    • Data reliability
      • How sure can we be that other people will arrive at the same observations/patterns/conclusions using the same method?
        • Can others replicate the same logical reasoning in intuiting data?
        • Can others make the same “grammatical judgement”?
    • Data validity
      • How well do we understand the true mental representation that the linguistic data correspond to?
        • Can we know more about the mental representation of grammar based on one man’s constructed sentences and/or his grammatical judgement?
        • Can we better generalize our insights from one man’s intuition or from the group minds (population vs. sample vs. one-man)?

Please provide one or two examples that show the differences between one’s grammatical intuition and the corpus-based observations. The examples could be in English or in your native language (e.g., Chinese).

1.2 What is corpus?

  • This can be tricky: different disciplines, different definitions
    • Literature
    • History
    • Sociology
    • Field Linguistics
  • Linguistic Corpus in corpus linguistics (Stefanowitsch, 2019)
    • Authentic
    • Representative
    • Large
  • A few well-received definitions

“a collection of texts which have been selected and brought together so that language can be studied on the computer” (Wynne, 2006)

“A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research.” (John Sinclair in (Wynne, 2006))

“A corpus refers to a machine-readable collection of (spoken or written) texts that were produced in a natural communitive setting, and in which the collection of texts is compiled with the intention (1) to be representative and balanced with respect to a particular linguistic language, variety, register, or genre and (2) to be analyzed linguistically.” (Gries, 2016)

1.3 What is a corpus linguistic study?

  • CL characteristics
    • No general agreement as to what it is
    • Not a very homogeneous methodological framework
    • (Compared to other sub-disciplines in linguistics) It’s quite new
  • Intertwined with many linguistic fields
    • Interactional linguistics, cognitive linguistics, functional syntax, usage-based grammar etc.
    • Stylometry, computational linguistics, NLP, digital humanities, text mining, sentiment analysis
  • Corpus-based vs. Corpus-driven (cf. Tognini-Bonelli, 2001)
    • Corpus-based studies:
      • Typically use corpus data in order to explore a theory or hypothesis, aiming to validate it, refute it or refine it.
      • Take corpus linguistics as a method
    • Corpus-driven studies:
      • Typically reject the characterization of corpus linguistics as a method
      • Claim instead that the corpus itself should be the sole source of our hypotheses about language
      • It is thus claimed that the corpus itself embodies a theory of language (Tognini-Bonelli, 2001, pp. 84–85)

The distinction between corpus-based and corpus-driven studies can be subtle. Both of these approaches, however, are often connected to the general usage-based grammar framework. I would highly recommend the chapter on usage-based model (Chapter 11) in Croft & Cruse (2004).

An example of corpus-driven grammatical framework is pattern grammar. Please see Hunston & Francis (2000) for a comprehensive introduction.

  • Stefanowitsch (2019) defines Corpus Linguistics as follows:

Corpus linguistics is the investigation of linguistic research questions that have been framed in terms of the conditional distribution of linguistic phenomena in a linguistic corpus

  • More on Conditional Distribution
    • An exhaustive investigation
      • Text-processing technology
        • Retrieval and coding
        • Regular expressions
      • Common methods
        • KWIC concordances
        • Collocates
        • Frequency lists
    • A systematic investigation
      • The distribution of a linguistic phenomenon under particular conditions (e.g. lexical, syntactic, social, pragmatic etc. contexts)
      • Statistical properties of language
  • Examples
    • When do English speakers use the complementizer that?
    • What are the differences between small and little?
    • When do English speakers choose “He picked up the book” vs. “He picked the book up”?
    • When do English speakers place the adverbial clauses before the matrix clause?
    • Do speakers use different linguistic forms in different genres?
    • Is the word “gay” used differently across different time periods?
    • Do L2 learners use similar collocation patterns as do L1 speakers?
    • Do speakers of different socio-economic classes talk differently?

Now please try to think of one or two research questions that may fit into the category of a corpus linguistic study and elaborate on the idea of conditional distribution:

You are interested in the linguistic structure of … and would like to see how it varies/changes/develops depending on …

1.4 Additional Information on CL

  • Important Journals in Corpus Linguistics
    • Corpus Linguistics and Linguistic Theory
    • International Journal of Corpus Linguistics
    • Corpora
    • Applied Linguistics
    • Computational Linguistics
    • Digital scholarship in the Humanities
    • Language Teaching
    • Language Learning
    • Journal of Second Language Writing
    • CALL
    • Language Teaching Research
    • ReCALL
    • System
  • Important Conferences on Corpus Linguistics
    • Corpus Linguistics Conference (usually held in UK)
    • International Association of Applied Linguistics
    • American Association for Corpus Linguistics
    • American Association for Applied Linguistics
    • Teaching and Language Corpora Conference
    • International Cognitive Linguistics Conference
    • International Conference on Construction Grammar

References

Croft, W., & Cruse, D. A. (2004). Cognitive linguistics. Cambridge University Press.
Gries, S. T. (2016). Quantitative corpus linguistics with R: A practical introduction (2nd ed.). Routledge.
Hunston, S., & Francis, G. (2000). Pattern grammar: A corpus-driven approach to the lexical grammar of English. John Benjamins Publishing.
Stefanowitsch, A. (2019). Corpus linguistics: A guide to the methodology. Language Science Press. http://langsci-press.org/catalog/book/148
Tognini-Bonelli, E. (2001). Corpus linguistics at work. John Benjamins.
Wynne, M. (Ed.). (2006). Developing linguistic corpora—a guide to good practice. EADH: The European Association for Digital Humanities. https://users.ox.ac.uk/~martinw/dlc/index.htm