La Linguistique de corpu

La Linguistique de corpu

Corpus et intuition

La linguistique de corpus n’aurait pas vu le jour sans l’informatique. Mais l’informatique a également permis le développement d’autres domaines de recherche linguistique, dont le Traitement Automatique des Langues (TAL).

Le TAL (qui englobe les deux approches « Linguistique Computationnelle » et « Ingénierie Linguistique » ; [Bourigault, 2007 : 25–42]) est un domaine essentiellement orienté vers la création d’applications informatiques qui s’appuient sur des connaissances linguistiques.

Y est notamment évalué l’apport d’informations linguistiques dans l’accès à l’information. Parmi les applications concernées, le TAL contribue à la Recherche d’Information, l’Extraction d’Information, le Résumé Automatique,

les Systèmes de Question Réponse, la Traduction Automatique. G. Leech nous informe que la LCI comme le TAL ne sont pas des disciplines au sens strict du terme, mais se définissent par les méthodes et outils qu’il mettent en place :

« The only other branch of linguistics which, like corpus linguistics, refers to a tool or methodology rather than a subject-matter is computational linguistics, defined as the investigation of language by means of computers. But nowadays, there is an obvious and growing overlap between corpus linguistics and computational linguistics.

When we talk about corpus linguistics today, of course we assume that the corpus is machine readable, and is to be investigated by means of computers. So in fact the branch of linguistics we are discussing at this Symposium should strictly be labeled “computer corpus linguistics” to distinguish it from the corpus linguistics of the pre-computer age. » [Leech, 1992 : 106]

La LCI et le TAL partagent des objets similaires et emploient des outils communs : corpus, programmes informatiques ou ressources linguistiques. N. W. Francis (voir aussi [Laks, 2008]) remarque que l’usage du corpus en linguistique n’est pourtant pas né de l’invention de l’ordinateur : « I will confine myself to corpora accumulated B.C., i.e.

before the use of computers […] Some seem to believe that there were not corpora before that. The truth is that many important corpora of English were assembled long before the computer was invented. » [Nelson W. Francis, 1992] Si les travaux sur corpus existaient déjà, en quoi l’informatisation des corpus justifient la distinction d’une nouvelle linguistique de corpus ?

Tout d’abord, considérer que l’informatisation de l’objet d’étude du linguiste permet uniquement de faciliter son travail serait réducteur, comme l’affirme Leech : « In my contribution to the Symposium, I wish to argue that computer corpus linguistics (henceforth CCL) defines not just a newly emerging methodology for studying language,

but a new research enterprise, and in fact a new philosophical approach to the subject. The computer, as a uniquely powerful technological tool, has made this new kind of linguistics possible. So technology here (as for centuries in natural science) has taken a more important role than that of supporting and facilitating research: I see it as the essential means to a new kind of knowledge, and as an “open sesame” to a new way of thinking about language. »

La Linguistique de corpus

L’existence de corpus informatisés permet d’envisager de nouvelles approches du langage et permet de définir de nouvelles hypothèses de recherche. Un apport majeur de cette synergie entre linguistique et informatique est la possibilité de quantification de phénomènes linguistiques au sein du cadre qu’est le corpus.

Le type d’entreprise scientifique qu’envisage Leech dans cet article est la constitution de modèles statistiques de la performance linguistique (en tant que produit et non en tant que processus ; [ibid. : 108]). J. McH.

Sinclair voit dans l’existence de corpus informatisés, une opportunité unique de développer de nouvelles techniques de description du langage : « The most exciting aspect of long-text data-processing, however, is not the mirroring of intuitive categories of description. It is the possibility of new approaches, new kinds of evidence,

and new kinds of description. Here, the objectivity and surface validity of computer techniques become an asset rather than a liability. Without relinquishing our intuitions, of course, we try to find explanations that fit the evidence, rather than adjusting the evidence to fit a pre-set explanation. » [Sinclair, 1991 : 36] Les outils informatiques sont des atouts (« asset ») et les données attestées