User Tools

Site Tools


corpus

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
corpus [2020/08/23 17:21]
annapineda
corpus [2021/06/17 08:35] (current)
annapineda
Line 1: Line 1:
-===== The CoRaLHis ​CORPUS ​=====+===== The CoRaLHis ​corpus ​=====
  
 From all the texts that the several corpora will put in CoRaLHis’ disposal, a selection will be made, consisting of 100.000 tokens per half century and per genre (see below). The choice of the amount of tokens per cell has been made after consultation with specialists on corpus linguistics and representativity. From all the texts that the several corpora will put in CoRaLHis’ disposal, a selection will be made, consisting of 100.000 tokens per half century and per genre (see below). The choice of the amount of tokens per cell has been made after consultation with specialists on corpus linguistics and representativity.
Line 23: Line 23:
 ==== LANGUAGES ==== ==== LANGUAGES ====
  
-At the first stage of the project, the selection of 100.000 tokens per half century and genre will be made for a total amount of 8 languages:+At the first stage of the project, the selection of 100,000 tokens per half century and genre will be made for a total amount of 8 languages:
  
   - French   - French
   - Italian   - Italian
-  - Southern Italian varieties 
   - Catalan   - Catalan
   - Spanish   - Spanish
Line 37: Line 36:
   - (Rhaeto-Romance)   - (Rhaeto-Romance)
  
-Thus, 8 languages x 8 genres x 12 periods (half centuries), gives a total of 768 language/​genre/​period samples; or 960 language/​genre/​period samples if genre subdivisions are considered. If we multiply this number for the mentioned 100,000 tokens per cell, this gives us a corpus with 76,800,000 or 96,000,000 tokens altogether. ​+Thus, 8 languages x 8 genres x 12 periods (half centuries), gives a total of 672 language/​genre/​period samples; or 840 language/​genre/​period samples if genre subdivisions are considered. If we multiply this number for the mentioned 100,000 tokens per cell, this gives us a corpus with 67,200,000 or 84,000,000 tokens altogether. ​
  
-In sum, for each of the languages mentioned (**French, Italian, Southern Italian varieties, Catalan, Spanish, Portuguese, Occitan, Gascon**), we envisage to have 9,600,000 tokens (8 genres x 12 periods of time x 100,000 tokens for each cell). It has to be taken into account that certain genres, such as theatre, are restricted per se to a given period of time, notably the latest centuries in this case. +In sum, for each of the languages mentioned (**French, Italian, Catalan, Spanish, Portuguese, Occitan, Gascon**), we envisage to have 9,600,000 tokens (8 genres x 12 periods of time x 100,000 tokens for each cell). It has to be taken into account that certain genres, such as theatre, are restricted per se to a given period of time, notably the latest centuries in this case. 
  
 On the other hand, we of course foresee that, for some languages, there will be cells that cannot be entirely filled out, at least from the very beginning. In this regard, it is important to highlight that the project is conceived as a growing and ongoing initiative, which will be enlarged as time passes by and more collaborations are established. On the other hand, we of course foresee that, for some languages, there will be cells that cannot be entirely filled out, at least from the very beginning. In this regard, it is important to highlight that the project is conceived as a growing and ongoing initiative, which will be enlarged as time passes by and more collaborations are established.
  
-Finally, it should be noted that, in addition to the 8 major languages mentioned, the corpus also envisages to include texts from **Sardinian**,​ **Francoprovençal** and **Rhaeto-Romance**. For these languages, the scarcity of resources available does not allow us to envisage gathering texts for all the genres and periods of time, at least not in a significant way. Despite this not being possible and therefore comparative work including these languages being affected, we think they need to be represented in the corpus as much as possible, thus we have opted for representing the genres / periods available. For example, for Sardinian we have at our disposal a notable number of legal texts from the 13th and 14th centuries.+Finally, it should be noted that, in addition to the 8 major languages mentioned, the corpus also envisages to include texts from **Sardinian**,​ **Francoprovençal** and **Rhaeto-Romance**. For these languages, the scarcity of resources available does not allow us to envisage gathering texts for all the genres and periods of time, at least not in a significant way. Despite this not being possible and therefore comparative work including these languages being affected, we think they need to be represented in the corpus as much as possible, thus we have opted for representing the genres / periods available. For example, for Sardinian we have especially ​at our disposal a notable number of legal texts from the 13th and 14th centuries, whereas the number of texts for the remaining genres and centuries is being currently increased.
  
  
Line 49: Line 48:
 ==== LIST OF TEXTS ==== ==== LIST OF TEXTS ====
  
-  - [[https://​www.dropbox.com/​s/​rkfgoypndlngub7/​FR%20selection%20of%20texts%20WEB.pdf?​dl=0| Proposal for French]].+  - [[https://​www.dropbox.com/​s/​rkfgoypndlngub7/​FR%20selection%20of%20texts%20WEB.pdf?​dl=0| Proposal for French ​version Word]] [[https://​www.dropbox.com/​s/​h49ygvpyeeidx1i/​DATABASE%20-%20FRENCH%20-%20selection_2020.06.15--.pdf?​dl=0| Version Excel]]
   - Proposal for Italian [not available yet]   - Proposal for Italian [not available yet]
-  ​- Proposal for Southern Italian varieties [not available yet] +  - [[https://​www.dropbox.com/​s/​ktnwzy47iu8wpsl/​CAT%20selection%20of%20texts%20WEB.pdf?​dl=0| Proposal for Catalan ​version Word]] [[https://​www.dropbox.com/​s/​k0fifehag1ptifj/DATABASE%20-%20CATALAN%20-%20selection_2020.06.10..pdf?dl=0 | Version Excel]] 
-  ​- [[https://​www.dropbox.com/​s/​ktnwzy47iu8wpsl/​CAT%20selection%20of%20texts%20WEB.pdf?​dl=0| Proposal for Catalan]] +  - [[https://​www.dropbox.com/​scl/​fi/​6nkvr5txatgvr3av6h7gb/​SP-selection-of-texts-WEB.docx?​dl=0&​rlkey=79pg4vag3w3rl0vmrjzpv7k3q| Proposal for Spanish version Word]] ​[[https://​www.dropbox.com/​s/​jbc7dzsmndwaipg/DATABASE%20-%20SPANISH%20-%20selection_2020.07.09-.pdf?​dl=0 ​| Version Excel]] 
-  - [[https://​www.dropbox.com/​s/​tul94eslfnkson0/SP%20selection%20of%20texts%20WEB.pdf?​dl=0| ​Proposal for Spanish]] +  - [[https://​www.dropbox.com/​scl/​fi/​we5qvk2fgprwbzx38nyso/​PORT-selection-of-texts-WEB.docx?​dl=0&​rlkey=b6rn7ksdfh87uikrtkbtmd917| Proposal for Portuguese ​version Word]] [[https://​www.dropbox.com/​s/​a7gkwkhi84stmc2/​DATABASE%20-%20PORTUGUESE%20-%20%20selection_2021.05.pdf?​dl=0 | Version Excel]] 
-  - [[https://​www.dropbox.com/​s/​ljg19wxx8ly2ec2/PORT%20selection%20of%20texts%20WEB.pdf?dl=0| Proposal for Portuguese]] +  - [[https://​www.dropbox.com/​s/​vdqzwpre95ojj93/​OC%20selection%20of%20texts%20WEB.pdf?​dl=0| Proposal for Occitan ​version Word]] [[https://​www.dropbox.com/​s/​7les6yd27lay4mi/​DATABASE%20-%20OCCITAN%20-%20selection_2020.07.09..pdf?​dl=0 | Version Excel]] 
-  - [[https://​www.dropbox.com/​s/​vdqzwpre95ojj93/​OC%20selection%20of%20texts%20WEB.pdf?​dl=0| Proposal for Occitan]] +  - Proposal for Gascon [not available yet] 
-  - Gascon [not available yet] +  - [[https://​www.dropbox.com/​scl/fi/​m63gf5bh8vugsqusi8wms/SARD-selection-of-texts-WEB.docx?​dl=0&​rlkey=wcmpieksj6n3agcpqvp6q82pj| Proposition for Sardinian version Word]] [[https://​www.dropbox.com/​s/​t3amwpfew6su1n8/​DATABASE%20-%20SARDINIAN-def..pdf?​dl=0| ​Version Excel]] 
-  - [[https://​www.dropbox.com/​s/oijkemzf3oltclp/SARD%20selection%20of%20texts%20WEB.pdf?​dl=0| ​Proposal for Sardinian]] +  - [[https://​www.dropbox.com/​scl/​fi/​m63gf5bh8vugsqusi8wms/​SARD-selection-of-texts-WEB.docx?​dl=0&​rlkey=wcmpieksj6n3agcpqvp6q82pj| Proposal for Francoprovençal version Word]] ​[[https://​www.dropbox.com/​s/​3yloh1jhst6c5oq/DATABASE%20-%20FRANCOPROVEN%C3%87AL-d%C3%A9f.pdf?dl=0 | Version Excel]] 
-  - [[https://​www.dropbox.com/​s/​rl3li1fdifczbox/FR-PROV%20selection%20of%20texts%20WEB.pdf?​dl=0| ​Proposal for Francoprovençal]] +  -  Proposal for Rhaeto-Romance [not available yet] 
-  - Proposal for Rhaeto-Romance [not available yet]+ 
 + 
 + 
 + 
 + 
corpus.1598203279.txt.gz · Last modified: 2020/08/23 17:21 by annapineda