Classification of Corpora

What is a Corpus?

The word corpus is derived from a Latin word which means ‘Body’ in French ‘corps’. ‘Corpora’ and ‘corpuses’ are used to express the plural forms of “corpus”. The word corpus refers to a large set of texts electronically stored and processed; the texts may be in written or spoken form or the mixture of the two.

In the Oxford Companion to the English language (1992) corpus is defined as: “a body of texts, utterances, or other specimens considered more or less representative of a language and usually stored as electronic data base”.

The British National Corpus (BNC) comprises approximately 100 million words of written texts (90%) and (10%) of spoken texts as transcripts. In the same context Cook suggests that: The word corpus refers to a databank of language which has actually occurred-whether written, spoken or a mixture of the two’. The written texts are originally from magazines, books, diaries, newspapers, letters, popular fictions however the spoken texts can be any recorded formal or informal conversations: Telephone conversations, dialogues, radio-shows, political meetings…

As mentioned above, the origin of the written texts are from books, periodicals, newspapers; however the spoken texts are from telephone conversations, dialogues, classroom interactions, public discussions …These spoken texts are recorded and converted into written texts. Generally speaking, a corpus is any electronic set of texts or database that can be available on computers as software or via internet. However, for the sake of simplicity, we will opt for Mc Enery’s and Wilson’s definition (2001:197) which considers corpus “a finite collection of machine-readable texts, sampled to be maximally representative of a language or variety”.

To conclude, and in the light of what has been said so far, one can argue that all the definitions suggest, somehow, that a corpus is the collection of texts which makes it (corpus) worth studying. In this respect, a corpus can be very helpful for linguists, teachers and students. They have just to type the target word or expression and they will be surely amazed at the quantitative and qualitative results obtained. Computational technologies allow us to gather and analyse very large bodies of data.

Among the other benefits of corpora uses (as electronic texts), there are:
– Accessibility.
– Speed
– Accuracy
– Availability

Classification of Corpora

Nowadays, linguists can find many types of corpora; it depends only on the purposes they were created for and their contents. Among the most important kinds of corpora, we can find: National corpora, monitor corpora, reference corpora, synchronic corpora, diachronic corpora, multilingual corpora, spoken corpora, developmental and learner corpora…

National Corpora

National corpora aim to represent the national language of a country; they are mainly formed from selections of both written and spoken texts. Generally, they are made freely available in the net. Among the famous English national corpora one can list:
– The British National Corpus (BNC)
– The American National Corpus (ANC)

Monitor Corpora

Monitor corpora are expandable; that is to say that the language is continuously up dated by adding new words, new expressions and new texts in order to follow the recent and rapid language change. Examples of monitor corpora:

– The Bank of English Corpus.
– The COBUILD Corpus.

Reference Corpora

Reference corpora have a fixed size and they are definitively designed at least for a certain period of time (they are not expandable e.g., the BNC).

Synchronic Corpora

These corpora are of paramount importance to make comparative studies of worldwide English and its real use in different countries. The best example without any doubt is The International Corpus of English (ICE), which is specifically designed for the synchronic studies of English language over the world.

The ICE is composed mainly by many sub-corpora (written and spoken texts) from countries where English is the first language or at least the official language (Australia, Canada, USA, Hong Kong…).

Diachronic Corpora

Contrary to synchronic corpora, diachronic corpora or historical corpora contain the same language but from different periods of times. They permit to linguists and researchers to make historical studies of a target language. They will find precious information about the changes of the language in different periods. The Helsinki Corpus of English Texts is the best-known diachronic/historical corpus which contains 1.5 million words of English, dating from the 8th century to the 18th century (Old, Middle, and Early Modern English).

Multilingual Corpora

These corpora are formed by many languages, or at least two languages for bilingual corpora. The best example is the European Multilingual Corpus where texts and their translations are originally selected from the European Parliament and some European commissions. This corpus covers all the European languages and other non-European languages such as Chinese, Japanese…

When one text is presented by one of the European members whatever his/her language is, the other members will read it and/or listen to the translated forms of this text in all the available official languages of this parliament or this commission. Thus, an accurate data base is formed to find the translation of words, concepts, expressions and texts.

Spoken Corpora

The content of these corpora is purely composed of spoken material with some references of the speakers such as:
– Age (the interest may be on teenagers’ speech or adult speech for example)
– Social class (to study how a special social class can affect a language)
– Region (for example to study how some words or expressions are pronounced in a target region)
– Gender

The spoken corpora are formed by dialogues, monologues, classroom interactions, lectures political speech, sports commentaries, advertisement, business meetings, telephone/mobile conversations, internet chats, conferences, radio and television talks….

Developmental and Learner Corpora

The material is mainly from pupils or students acquiring their first language (L1) or their second language (L2) in English. The learners’ exam papers and their written expressions are collected, corrected and coded anonymously. At the end, one can explore this kind of corpora to find the most frequent words, expressions, and mistakes made by beginners for example. Among the famous developmental and learner corpora, we can list:
-The Longman learners’ corpus: 10 million words.
– The Cambridge learner corpus (CLC): 20 million words .

Table des matières

General Introduction
CHAPTER 1: CORPORA
1. Introduction
2. What is a Corpus?
3. Size of Corpora
4. Classification of Corpora
4.1. National Corpora
4.2. Monitor Corpora
4.3. Reference Corpora
4.4. Synchronic Corpora
4.5. Diachronic Corpora
4.6. Multilingual Corpora
4.7. Spoken Corpora
4.8. Developmental and Learner Corpora
5. Design of a Corpus
6. Transforming a Raw Corpus into a Useful One
6.1. Annotating a Corpus (Tagging and Parsing)
6.2. Tagging a Corpus
6.2.1. Example of Tagging
6.2.2. Tagging a Text
6.3. Parsing a Corpus
7. The Useful Corpus
7.1. Frequency
7.2. Parts of Speech Frequencies
7. 3. Concordances
7.4. Collocations
8. The Uses of Corpora in Applied Linguistics
8.1. Computational Linguistics
8.2. Corpus Linguistics
8.3. Corpora and Lexicography
8.4. Corpora and Dictionaries
8.5. Corpora and Register Study
9. Translation
10. Corpora and Historical Studies
11. Corpora and Sociolinguistics
11.1. Lexical Variation by Gender
12. Corpora and ESP
13. Conclusion
CHAPTER 2: EXPLORING TEXTBOOKS
1. Introduction
2. References and Origins Text Study
2.1. Middle School Text References and Origins
2.1.1. The First Year Middle School Text Origins
2.1.1.1. First Year Middle School Reference Text Study Recapitulation
2.1.2. The Second Year Middle School Text Origins
2.1.2.1. Second Year Middle School Reference Text Study Recapitulation
2.1.3. The Third Year Middle School Text Origins
2.1.3.1. Third Year Middle School Reference Text Study Recapitulation
2.1.4. The Fourth Year Middle School Text Origins
2.1.4.1. Fourth Year Middle School Reference Text Study Recapitulation
2.1.5. Middle School Reference Text Study Recapitulation
2.2. Secondary School Text references and Origins
2.2.1. First Year Secondary School Text Origins
2.2.1.1. First Year Secondary School Reference Text Study Recapitulation
2.2.2. Second Year Secondary School Text Origins
2.2.2.1. Second Year Secondary School Reference Text Study
2.2.3. Third Year Secondary School Text Origins
2.2.3.1. Third Year Secondary School Reference Text Study Recapitulation
2.2.4. Secondary School Reference Text Study Recapitulation
3. Content Type Study
3.1. Middle School Files’ Topic Taxonomy
3.2. Secondary School Textbook File Types
3.2.1. Secondary School Files’ Topic Taxonomy
4. Domain Study
4.1. BNC Domains
4.2. Middle School Textbook Domain Study
4.2.1. First Year Middle School Textbook Domain Study
4.2.2. Second Year Middle School Textbook Domain Study
4.2.3. Third Year Middle School Textbook Domain Study
4.2.4. Fourth Year Middle School Textbook Domain Study
4.2.5. Middle School Textbook Domains’ Recapitulative Study
4.3. Secondary School Textbook Domain Study
4.3.1. First Year Secondary School Textbook Domain Study
4.3.2. Second Year Secondary School Textbook Domain Study
4.3.3. Third Year Secondary School Textbook Domain Study
4.3.4. Secondary School Textbook Domains’ Recapitulative Study
5. Conclusion
CHAPTER 3: TEXTBOOK PEDAGOGY RECOMMENDATIONS
1. Introduction
2. The Algerian Educational System
3. The competency based approach through Middle and Secondary school Textbooks
3.1. From Competency Based Approach to Pedagogical Objectives
4. The Fundamental Competencies
5. Middle School and Secondary School Textbook File Study
5.1. Middle school textbook File Study
5.1.1. First year Middle School Textbook File Study
5.1.2. Second Year Middle School Textbook File Study
5.1.3. Third Year Middle School Textbook Study
5.1.4. Fourth Year Middle School Textbook
5.2. Secondary School Textbook File Study
5.2.1. First Year Secondary School Textbook File Study
5.2.2. Second Year Secondary School Textbook File Study
5.2.3. Third Year Secondary School Textbook File Study
6. Pupils’ Expected Outcomes
7. Competency Based Approach and projects
7.1. Textbooks’ Project Pedagogy
7.2. Intelligence Detection by Projects
7.3. Project Procedure
8. Middle School Textbooks’ Projects
8.1. First Year Middle School Textbook Projects
8.2. Second Year Middle School Textbook Projects
8.3. Third Year Middle School Textbook Projects
8.4. Fourth Year Middle School Textbook Projects
9. Secondary School Textbook Projects
9.1. First Year Secondary School Textbook Projects
9.2. Second Year Secondary school Textbook Projects
9.3. Third Year Secondary School Textbook Projects
10. The pedagogical Relationship between File Contents and Projects
10.1. Middle School Textbook Relationship between File Contents and Projects
10.1.1. First Year Middle School Projects
10.1.2. Second Year Middle School Projects
10.1.3. Third Year Middle School Projects
10.1.4. Fourth Year Middle School Projects
10.2. Secondary School Textbook Relationship between File Contents and Projects
10.2.1. First Year Secondary School Projects
10.2.2. Second Year Secondary School Projects
10.2.3. Third Year Secondary School Projects
11. Conclusion
CHAPTER 4: FINDINGS AND DISCUSSIONS
1. Introduction
2. Textbook Glossaries
2.1. Middle School Textbook Glossary
2.2. Secondary School Textbook Glossaries
3. General Processing of Texts
3.1. Middle School Textbooks
3.1.1. Specific Frequencies
3.1.1.1. First Year Middle School Texts and wordlists
3.1.1.2. Second Year Middle School Texts and Wordlists
3.1.1.3. Third Year Middle School Texts and Wordlists
3.1.1.4. Fourth Year Middle School Texts and Wordlists
3.1.1.5. Middle School Recapitulative Study
3.2. Secondary School Textbooks
3.2.1. Text 1
3.2.1.1. Frequency Ranges
3.2.2. Text 2
3.2.2.1. Frequency Ranges
3.2.3. Text 3
3.2.3.1. Frequency Ranges
3.2.4. Text 4
3.2.4.1. Frequency Ranges
3.2.5. Text 5
3.2.5.1. Frequency Ranges
3.2.6. Text 6
3.2.6.1. Frequency Ranges
3.2.7. Text 7
3.2.7.1. Frequency Ranges
3.2.8. Text 8
3.2.8.1. Frequency Ranges
3.2.9. Text 9
3.2.9.1. Frequency Ranges
3.2.10. Text 10
3.2.10.1. Frequency Ranges
3.2.11. Text 11
3.2.11.1. Frequency Ranges
3.2.12. Text 12
3.2.12.1. Frequency Ranges
3.2.13. Text 13
3.2.13.1.Frequency Ranges
3.2.14. Text 14
3.2.14.1. Frequency Ranges
4. Recapitulative Study
4.1. Range Recapitulative Study
4.2. Academic Recapitulative Study
5. Corpus Text Analysis
5.1. “Computers Vs Books” Corpus Study
5.1.1. Corpus Analysis
5.1.2.The Frequency Lists
5.1.3. Advanced Investigations
5.1.3.1.Synonymy Investigation
5.1.3.2.Genre Investigation
5.1.3.3.Related Word Frequencies
5.1.3.4. “Computers Vs Books” Collocation Study
5.1.3.5. “Computers Vs Books” Word Study
5.1.2.6.Text’s Highlights
5.1.3.7.Noun Frequencies
5.1.3.8. Distinctive Parts of the Text Study
5.1.3.9. Words’ Category Study
5.1.3.10. Head Word Frequencies
5.1.3.11. Modality Investigation
5.2. “The Story behind Supermarket Success” Text Analysis
5.2.1. Corpus Results
5.2.1.1. Frequency Ranges
5.2.1.2. The Frequency Lists
5.2.1.3. Specific Word Frequencies
5.2.1.4. Advanced Investigations
5.2.1.4.1. Synonymy Investigation
5.2.1.4.2. Singular Vs Plural Investigation
5.2.1.4.3. Recapitulative Study
5.2.1.5. Genre Investigation
5.2.1.6. Style Investigation
5.2.1.7. Frequent Word Categories
5.2.1.8. Head Word Frequencies
9. Conclusion
General Conclusion