SAMPLING AND REPRESENTATIVENESS IN CORPUS CONSTRUCTION

SAMPLING AND REPRESENTATIVENESS IN CORPUS CONSTRUCTION

Авторы

  • Gulmira Uralova a 4th-year student of the Faculty of Philology, Jizzakh State Pedagogical University Email: urolovagulmira4@gmail.com
  • Nozima Toshtemirova a 4th-year student of the Faculty of Philology, Jizzakh State Pedagogical University
  • Hakima Abdullajonova a teacher of the Faculty of Philology, Jizzakh State Pedagogical University Supervisor

Ключевые слова:

corpus construction, sampling, representativeness, linguistic data, language variation, corpus design, data balance, register, authenticity, digital linguistics.

Аннотация

Sampling and representativeness are two foundational pillars in the construction of linguistic corpora. Without rigorous attention to these principles, a corpus risks becoming a distorted mirror of the language it aims to describe. The study of corpus design has grown from the early days of manually assembled text collections to sophisticated digital systems capable of storing and analyzing billions of words. However, the key challenge remains constant: ensuring that the selected data accurately represents the linguistic variety, register, and communicative functions of a target language community. This paper examines theoretical and practical approaches to sampling and representativeness, exploring their implications for corpus-based linguistic research. It highlights major frameworks, such as Sinclair’s representativeness model, Biber’s multidimensional approach, and balanced corpus design principles from projects like the British National Corpus (BNC) and Corpus of Contemporary American English (COCA). The discussion also evaluates the influence of sociolinguistic diversity, genre selection, and data authenticity in shaping a corpus that mirrors real-world language use. Furthermore, the paper reflects on emerging challenges in digital linguistics, including the integration of social media texts, multimodal data, and machine-generated language. By synthesizing classical theories and modern methodologies, this work provides a comprehensive exploration of how sampling and representativeness ensure the scientific validity, generalizability, and credibility of corpus-based linguistic findings.

Библиографические ссылки

Biber, D. (1988). Variation across Speech and Writing. Cambridge University Press.

Leech, G. (1991). The State of the Art in Corpus Linguistics. In Aijmer & Altenberg (Eds.), English Corpus Linguistics. Longman.

McEnery, T., & Hardie, A. (2012). Corpus Linguistics: Method, Theory and Practice. Cambridge University Press.

Sinclair, J. (1996). EAGLES: Preliminary Recommendations on Corpus Typology. European Commission.

zcHunston, S. (2002). Corpora in Applied Linguistics. Cambridge University Press.

Kennedy, G. (1998). An Introduction to Corpus Linguistics. Longman.

Baker, P., Hardie, A., & McEnery, T. (2006). A Glossary of Corpus Linguistics. Edinburgh University Press.

Davies, M. (2008). The Corpus of Contemporary American English (COCA). Brigham Young University.

Davies, M. (2013). Corpus of Global Web-Based English (GloWbE). Brigham Young University.

Tognini-Bonelli, E. (2001). Corpus Linguistics at Work. John Benjamins.

Загрузки

Опубликован

2025-11-01 — Обновлена 2025-11-24

Версии

Loading...