Chapter 1 What is Corpus Linguistics?

1.1 Why do we need corpus data?

  • Linguistics often rests on an unnecessary dichotomy

    • Intuiting linguistic data

      • Researchers invent sentences to exemplify a target phenomenon. They then judge grammaticality based on personal competence. This practice foregrounds individual introspection.
    • Corpus data

      • Corpus evidence foregrounds language use in real contexts.
      • Corpus evidence reveals linguistic tendencies within a population through principled sampling.
      • This contrast marks a shift from isolated impressions toward observable patterns.
  • Strengths of corpus data

    • Data reliability

      • Reliability asks how confidently other researchers can reach the same observations, patterns, conclusions when they apply the same method.

        • Intuition-based practice limits replication, since others cannot follow the same reasoning path.
        • Grammatical judgments vary across individuals, which weakens shared evaluation.
    • Data validity

      • Validity asks how closely linguistic evidence reflects underlying mental representations.

        • Constructed examples from a single analyst offer limited insight into cognition.
        • Evidence from many speakers supports stronger generalization, since population-level patterns exceed insights drawn from one mind.

Exercise 1.1 Provide two examples that demonstrate a conflict between individual grammatical intuition and evidence from corpus-based observations. You may draw these examples from English or your native language (e.g., Chinese). For each case, clearly explain the discrepancy and include specific citations for the referenced works.

1.2 What is corpus?

  • This can be tricky: different disciplines, different definitions
    • Literature
    • History
    • Sociology
    • Field Linguistics
  • Linguistic Corpus in corpus linguistics (Stefanowitsch, 2019)
    • Authentic
    • Representative
    • Large
  • A few well-received definitions

“a collection of texts which have been selected and brought together so that language can be studied on the computer” (Wynne, 2006)

“A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research.” (John Sinclair in (Wynne, 2006))

“A corpus refers to a machine-readable collection of (spoken or written) texts that were produced in a natural communitive setting, and in which the collection of texts is compiled with the intention (1) to be representative and balanced with respect to a particular linguistic language, variety, register, or genre and (2) to be analyzed linguistically.” (Gries, 2016)

1.3 What is a corpus linguistic study?

  • CL characteristics

    • No general consensus defines the field.
    • The methodological framework lacks strong homogeneity.
    • Compared with many subdisciplines in linguistics, the field emerged relatively recently.
  • Intertwined with many linguistic fields

    • Interactional linguistics, cognitive linguistics, functional syntax, usage-based grammar.
    • Stylometry, computational linguistics, NLP, digital humanities, text mining, sentiment analysis.
    • This breadth shows how corpus work connects theory, method, application across domains.
  • Corpus-based versus corpus-driven (cf. Tognini-Bonelli, 2001)

    • Corpus-based studies

      • Researchers use corpus data to test a theory, evaluate a hypothesis, refine analytical claims.
      • This approach treats corpus linguistics as a methodological toolkit.
    • Corpus-driven studies

      • Researchers reject the view of corpus linguistics as a mere method.
      • They argue that the corpus itself should function as the sole source of hypotheses about language.
      • This position claims that the corpus itself embodies a theory of language (Tognini-Bonelli, 2001, pp. 84–85).
  • Note on the distinction

    • The boundary between corpus-based, corpus-driven work can appear subtle.
    • Both approaches often connect closely with the broader usage-based grammar framework.
    • Chapter 11 in Croft & Cruse (2004) offers a clear introduction to usage-based models.
    • Pattern grammar provides a well-known example of a corpus-driven framework.
    • Hunston & Francis (2000) offers a comprehensive introduction to pattern grammar.
  • Definition of Corpus Linguistics (Stefanowitsch, 2019)

    Corpus linguistics is the investigation of linguistic research questions that have been framed in terms of the conditional distribution of linguistic phenomena in a linguistic corpus.

  • More on conditional distribution

    • An exhaustive investigation

      • Text-processing technology supports large-scale analysis.

        • Retrieval: corpus toolkit vs. coding
        • Corpus Query Language vs. Regular Expressions
      • Common methods

        • KWIC concordances
        • Collocates
        • Frequency lists
    • A systematic investigation

      • Researchers examine how a phenomenon distributes under specific conditions such as lexical, syntactic, social, pragmatic contexts.
      • Statistical properties of language guide interpretation.
  • Examples of corpus linguistic research questions

    • When do English speakers use the complementizer that?
    • What differences distinguish small from little?
    • When do English speakers choose “He picked up the book” in contrast to “He picked the book up”?
    • When do English speakers place adverbial clauses before the matrix clause?
    • Do speakers use distinct linguistic forms across genres?
    • Does the word “gay” show different usage patterns across historical periods?
    • Do L2 learners use collocation patterns similar to L1 speakers?
    • Do speakers from different socio-economic classes use language in distinct ways?
  • Key idea behind conditional distribution

    • A corpus linguistic study does not ask whether a form exists.
    • A corpus linguistic study asks where, when, how often, under which conditions a form occurs.
    • The explanatory force comes from systematic comparison across clearly defined contexts.

Now try to formulate corpus linguistic research questions:

  • Formulate one research question that fits a corpus linguistic study.
  • Use the template below as a guide.
  • Clarify how the question reflects the logic of conditional distribution.

I am interested in the linguistic structure of X. I would like to examine how it varies across context A, context B, context C (levels). I focus on its distribution in relation to factor Y (factor).

  • Example 1

    • Research interest

      • I am interested in the structure of stance adverbs such as actually, clearly, perhaps.
    • Research question

      • How does the use of stance adverbs vary across academic writing, newspaper discourse, spoken conversation?
    • Conditional distribution

      • The analysis compares the distribution of stance adverbs under different discourse registers/genres.
      • The conditioning factor includes the following levels: academic writing, newspaper discourse, spoken conversation.
  • Example 2

    • Research interest

      • I am interested in the structure of noun phrase modification in learner English.
    • Research question

      • How does noun phrase complexity change across proficiency levels in learner corpora, in relation to task type?
    • Conditional distribution

      • The analysis traces how one linguistic phenomenon distributes under conditions defined by proficiency level and task context.

Exercise 1.2 Select at least two journal papers that utilize a corpus-based investigation as their primary methodology. For each study, identify the core research questions and reformulate them using the following structured template:

“The study investigates the linguistic structure of [X]. It examines how this structure varies across [Context A], [Context B], and [Context C], specifically focusing on its distribution in relation to [Factor Y].”

Ensure your analysis clearly captures the relationship between the linguistic variables and the contextual or social factors examined in the research.

You must provide full and accurate citations for all referenced works following APA (7th edition) format.

1.4 Additional information on corpus linguistics

  • Key journals in corpus linguistics

    • Corpus Linguistics Linguistic Theory — theory-oriented corpus research.
    • International Journal of Corpus Linguistics — core venue for methodological research.
    • Cognitive Linguistics - focus on cognitive perspectives of language processing.
    • Corpora — focus on corpus design, annotation, empirical analysis.
    • Applied Linguistics — frequent corpus-based work in applied domains.
    • Computational Linguistics — strong links to large-scale corpus analysis.
    • Digital Scholarship in the Humanities — corpus methods within humanities research.
    • Language Teaching — corpus-informed perspectives on pedagogy.
    • Language Learning — empirical studies using learner corpora.
    • Journal of Second Language Writing — corpus approaches to writing research.
    • CALL — technology-supported learning informed by corpus evidence.
    • Language Teaching Research — classroom-focused studies using corpus insights.
    • ReCALL — technology, corpora, language learning.
    • System — discourse, pedagogy, corpus-informed analysis.
  • Major conferences relevant to corpus linguistics

    • Corpus Linguistics Conference — flagship meeting, often hosted in the UK.
    • International Association of Applied Linguistics — strong corpus presence across thematic panels.
    • American Association for Corpus Linguistics — dedicated forum for corpus research.
    • American Association for Applied Linguistics — frequent corpus-based strands.
    • Teaching Language Corpora Conference — focus on pedagogy, learner corpora.
    • International Cognitive Linguistics Conference — growing corpus-based cognitive work.
    • International Conference on Construction Grammar — increasing engagement with corpus evidence.

References

Croft, W., & Cruse, D. A. (2004). Cognitive linguistics. Cambridge University Press.
Gries, S. T. (2016). Quantitative corpus linguistics with R: A practical introduction (2nd ed.). Routledge.
Hunston, S., & Francis, G. (2000). Pattern grammar: A corpus-driven approach to the lexical grammar of English. John Benjamins Publishing.
Stefanowitsch, A. (2019). Corpus linguistics: A guide to the methodology. Language Science Press. http://langsci-press.org/catalog/book/148
Tognini-Bonelli, E. (2001). Corpus linguistics at work. John Benjamins.
Wynne, M. (Ed.). (2006). Developing linguistic corpora—a guide to good practice. EADH: The European Association for Digital Humanities. https://users.ox.ac.uk/~martinw/dlc/index.htm