User Tools

Site Tools


corpus

The CoRaLHis corpus

From all the texts that the several corpora will put in CoRaLHis’ disposal, a selection will be made, consisting of 100.000 tokens per half century and per genre (see below). The choice of the amount of tokens per cell has been made after consultation with specialists on corpus linguistics and representativity.

GENRES

The texts are divided into the following genres:

  1. Legal/juridical & administrative texts
    • Expository, descriptive
    • Procedural (laws)
  2. Religious texts (books of psalms, hagiography, sermons)
  3. Historiography, chronicles
  4. Didactic & scientific literature (such as bestiaries, other treatises, including lexicographic & grammar works)
    • Expository, descriptive
    • Procedural (recipes, falconry, etc.)
  5. Literary narrative texts & Fiction (including epic literature)
  6. Poetry, lyrical texts
  7. Theatre
  8. Documents of daily life (personal prose: letters, memoires, etc.)

LANGUAGES

At the first stage of the project, the selection of 100,000 tokens per half century and genre will be made for a total amount of 8 languages:

  1. French
  2. Italian
  3. Catalan
  4. Spanish
  5. Portuguese
  6. Occitan
  7. Gascon
  8. (Sardinian)
  9. (Francoprovençal)
  10. (Rhaeto-Romance)

Thus, 8 languages x 8 genres x 12 periods (half centuries), gives a total of 672 language/genre/period samples; or 840 language/genre/period samples if genre subdivisions are considered. If we multiply this number for the mentioned 100,000 tokens per cell, this gives us a corpus with 67,200,000 or 84,000,000 tokens altogether.

In sum, for each of the languages mentioned (French, Italian, Catalan, Spanish, Portuguese, Occitan, Gascon), we envisage to have 9,600,000 tokens (8 genres x 12 periods of time x 100,000 tokens for each cell). It has to be taken into account that certain genres, such as theatre, are restricted per se to a given period of time, notably the latest centuries in this case.

On the other hand, we of course foresee that, for some languages, there will be cells that cannot be entirely filled out, at least from the very beginning. In this regard, it is important to highlight that the project is conceived as a growing and ongoing initiative, which will be enlarged as time passes by and more collaborations are established.

Finally, it should be noted that, in addition to the 8 major languages mentioned, the corpus also envisages to include texts from Sardinian, Francoprovençal and Rhaeto-Romance. For these languages, the scarcity of resources available does not allow us to envisage gathering texts for all the genres and periods of time, at least not in a significant way. Despite this not being possible and therefore comparative work including these languages being affected, we think they need to be represented in the corpus as much as possible, thus we have opted for representing the genres / periods available. For example, for Sardinian we have especially at our disposal a notable number of legal texts from the 13th and 14th centuries, whereas the number of texts for the remaining genres and centuries is being currently increased.

LIST OF TEXTS

corpus.txt · Last modified: 2021/06/17 08:35 by annapineda