Structured Corpus Processing#

There are a lot of pre-collected corpora available for linguistic studies. This section will demonstrate how we can load existing corpora in Python and perform basic corpus analysis with these data.

To facilitate the sharing of corpus data, the corpus linguistic community has now settled on a few common schemes for textual data storage and exchange. In particular, I would like to talk about the following common types of corpus data representation:

  • XML (BNC XML Edition)

  • CHILDES

  • Praat TextGrid (for spoken data)