BUILDING A GOOD QUALITY BILINGUAL CORPUS FOR A LOW-RESOURCE LANGUAGE PAIR

Authors

  • Nguyen Tien Ha Hung Vuong University, Vietnam
  • Nguyen Hung Cuong Hung Vuong University, Vietnam
  • Nguyen Van Vinh VNU University of Engineering and Technology

DOI:

https://doi.org/10.51453/2354-1431/2023/962

Keywords:

Data mining, Big data, Bilingual corpus, Sentence alignment.

Abstract

In natural language processing (NLP), a good quality bilin-gual corpus is very important in some applications, such as machine translation, building bilingual dictionaries, cross-language retrieval, etc. For low-resource language pairs, for example, the Vietnamese-Lao pair, it is very difficult to build a good quality bilingual corpus because bilingual resources are rare. In this paper, we presented the process of building a good quality bilingual corpus for a low-resource language pair and proposed a novel method of sentence alignment that takes advantage of pre-trained modern models for rich-resource languages. In our experiments on aligning sentences and building a bilingual corpus for the Vietnamese-Laos language pair, we achieved higher precision and recall than other good sentence alignment meth-ods and a good quality sentence-aligned Vietnamese-Laos bilingual corpus.

Downloads

Download data is not yet available.

References

[1] J. Tiedemann, “OPUS - parallel corpora for every¬one,” in Proceedings of the 19th Annual Conference of the European Association for Machine Translation: Projects/Products, Baltic Journal of Modern Comput¬ing, Riga, Latvia, 2016.

[2] E. Salesky, M. Wiesner, J. Bremerman, R. Cattoni, M. Negri, M. Turchi, D. W. Oard, M. Post, “The Mul¬tilingual TEDx Corpus for Speech Recognition and Translation,” CoRR, abs/2102.01757, 2021.

[3] S. Siripragada, J. Philip, V. P. Namboodiri, C. V. Jawa- har, “A Multilingual Parallel Corpora Collection Effort for Indian Languages,” CoRR, abs/2007.07691, 2020.

[4] L. Doan, L. T. Nguyen, N. L. Tran, T. Hoang, D. Q. Nguyen, “PhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation,” 2021.

[5] A. Magueresse, V. Carles, E. Heetderks, “Low- resource Languages: A Review of Past Work and Fu¬ture Challenges,” CoRR, abs/2006.07264, 2020.

[6] N. Dash, A. Selvaraj, Limitations of Language Corpora, 259-272, 2018, doi:

1007/978-981-10-7458-5-15.

[7] X. Ma, “Champollion: A Robust Parallel Text Sen¬tence Aligner,” in Proceedings of the Fifth Interna¬tional Conference on Language Resources and Evalua¬tion (LREC’06), European Language Resources Asso¬ciation (ELRA), Genoa, Italy, 2006.

[8] D. Varga, P. Halacsy, A. Kornai, V. Nagy, L. Nemeth, V. Tron, “Parallel corpora for medium density lan¬guages,” in Recent Advances in Natural Language Processing IV, 247-258, John Benjamins, 2007.

[9] N. T. Ha, N. T. M. Huyen, N. M. Hai, “Building a sentence-aligned Vietnamese-English bilingual corpus in tourism domain for machine translation,” JOUR¬NAL OF RESEARCH AND DEVELOPMENT ON INFORMATION AND COMMUNICATION TECH¬NOLOGY, V-1, number 39, 2018.

[10] N. T. M. Huyn, M. Rossignol, “A language¬independent method for the alignement of parallel corpora,” in Proceedings of the 20th Pacific Asia Conference on Language, Information and Compu¬tation, 223-230, Tsinghua University Press, Huazhong Normal University, Wuhan, China, 2006, doi:http: //hdl.handle.net/2065/29065.

[11] B. Thompson, P. Koehn, “Vecalign: Improved Sen¬tence Alignment in Linear Time and Space,” in Pro¬ceedings of the 2019 Conference on Empirical Meth¬ods in Natural Language Processing and the 9th Inter¬national Joint Conference on Natural Language Pro¬cessing (EMNLP-IJCNLP), 1342-1348, Association for Computational Linguistics, Hong Kong, China, 2019, doi:10.18653/v1/D19-1136.

[12] K. Chousa, M. Nagata, M. Nishino, “SpanAlign: Sen¬tence Alignment Method based on Cross-Language Span Prediction and ILP,” in Proceedings of the 28th International Conference on Computational Linguis¬tics, 4750-4761, International Committee on Compu¬tational Linguistics, Barcelona, Spain (Online), 2020, doi:10.18653/v1/2020.coling-main.418.

[13] S. Luo, H. Ying, S. Yu, “Sentence Alignment with Parallel Documents Helps Biomedical Machine Trans¬lation,” 2021.

[14] H. Hassan, A. Aue, C. Chen, V. Chowdhary, J. Clark, C. Federmann, X. Huang, M. Junczys-Dowmunt, W. Lewis, M. Li, S. Liu, T. Liu, R. Luo, A. Menezes, T. Qin, F. Seide, X. Tan, F. Tian, L. Wu, S. Wu, Y. Xia, D. Zhang, Z. Zhang, M. Zhou, “Achieving Human Par¬ity on Automatic Chinese to English News Translation,” CoRR, abs/1803.05567, 2018.

[15] V. Chaudhary, Y. Tang, F. Guzman, H. Schwenk, P. Koehn, “Low-Resource Corpus Filtering Using Mul¬tilingual Sentence Embeddings,” in Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), 261-266, Association for Computational Linguistics, Florence, Italy, 2019, doi:10.18653/v1/W19-5435.

[16] R. Sennrich, B. Haddow, A. Birch, “Neural Ma¬chine Translation of Rare Words with Subword Units,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1715-1725, Association for Com¬putational Linguistics, Berlin, Germany, 2016, doi: 10.18653/v1/P16- 1162.

Downloads

Published

2023-06-27

How to Cite

Nguyễn, H., Nguyễn, C., & Nguyễn, V. (2023). BUILDING A GOOD QUALITY BILINGUAL CORPUS FOR A LOW-RESOURCE LANGUAGE PAIR. SCIENTIFIC JOURNAL OF TAN TRAO UNIVERSITY, 9(3). https://doi.org/10.51453/2354-1431/2023/962

Issue

Section

Natural Science and Technology