BUILDING A GOOD QUALITY BILINGUAL CORPUS FOR A LOW-RESOURCE LANGUAGE PAIR
Keywords:Data mining, Big data, Bilingual corpus, Sentence alignment.
In natural language processing (NLP), a good quality bilin-gual corpus is very important in some applications, such as machine translation, building bilingual dictionaries, cross-language retrieval, etc. For low-resource language pairs, for example, the Vietnamese-Lao pair, it is very difficult to build a good quality bilingual corpus because bilingual resources are rare. In this paper, we presented the process of building a good quality bilingual corpus for a low-resource language pair and proposed a novel method of sentence alignment that takes advantage of pre-trained modern models for rich-resource languages. In our experiments on aligning sentences and building a bilingual corpus for the Vietnamese-Laos language pair, we achieved higher precision and recall than other good sentence alignment meth-ods and a good quality sentence-aligned Vietnamese-Laos bilingual corpus.
 J. Tiedemann, “OPUS - parallel corpora for every¬one,” in Proceedings of the 19th Annual Conference of the European Association for Machine Translation: Projects/Products, Baltic Journal of Modern Comput¬ing, Riga, Latvia, 2016.
 E. Salesky, M. Wiesner, J. Bremerman, R. Cattoni, M. Negri, M. Turchi, D. W. Oard, M. Post, “The Mul¬tilingual TEDx Corpus for Speech Recognition and Translation,” CoRR, abs/2102.01757, 2021.
 S. Siripragada, J. Philip, V. P. Namboodiri, C. V. Jawa- har, “A Multilingual Parallel Corpora Collection Effort for Indian Languages,” CoRR, abs/2007.07691, 2020.
 L. Doan, L. T. Nguyen, N. L. Tran, T. Hoang, D. Q. Nguyen, “PhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation,” 2021.
 A. Magueresse, V. Carles, E. Heetderks, “Low- resource Languages: A Review of Past Work and Fu¬ture Challenges,” CoRR, abs/2006.07264, 2020.
 N. Dash, A. Selvaraj, Limitations of Language Corpora, 259-272, 2018, doi:
 X. Ma, “Champollion: A Robust Parallel Text Sen¬tence Aligner,” in Proceedings of the Fifth Interna¬tional Conference on Language Resources and Evalua¬tion (LREC’06), European Language Resources Asso¬ciation (ELRA), Genoa, Italy, 2006.
 D. Varga, P. Halacsy, A. Kornai, V. Nagy, L. Nemeth, V. Tron, “Parallel corpora for medium density lan¬guages,” in Recent Advances in Natural Language Processing IV, 247-258, John Benjamins, 2007.
 N. T. Ha, N. T. M. Huyen, N. M. Hai, “Building a sentence-aligned Vietnamese-English bilingual corpus in tourism domain for machine translation,” JOUR¬NAL OF RESEARCH AND DEVELOPMENT ON INFORMATION AND COMMUNICATION TECH¬NOLOGY, V-1, number 39, 2018.
 N. T. M. Huyn, M. Rossignol, “A language¬independent method for the alignement of parallel corpora,” in Proceedings of the 20th Pacific Asia Conference on Language, Information and Compu¬tation, 223-230, Tsinghua University Press, Huazhong Normal University, Wuhan, China, 2006, doi:http: //hdl.handle.net/2065/29065.
 B. Thompson, P. Koehn, “Vecalign: Improved Sen¬tence Alignment in Linear Time and Space,” in Pro¬ceedings of the 2019 Conference on Empirical Meth¬ods in Natural Language Processing and the 9th Inter¬national Joint Conference on Natural Language Pro¬cessing (EMNLP-IJCNLP), 1342-1348, Association for Computational Linguistics, Hong Kong, China, 2019, doi:10.18653/v1/D19-1136.
 K. Chousa, M. Nagata, M. Nishino, “SpanAlign: Sen¬tence Alignment Method based on Cross-Language Span Prediction and ILP,” in Proceedings of the 28th International Conference on Computational Linguis¬tics, 4750-4761, International Committee on Compu¬tational Linguistics, Barcelona, Spain (Online), 2020, doi:10.18653/v1/2020.coling-main.418.
 S. Luo, H. Ying, S. Yu, “Sentence Alignment with Parallel Documents Helps Biomedical Machine Trans¬lation,” 2021.
 H. Hassan, A. Aue, C. Chen, V. Chowdhary, J. Clark, C. Federmann, X. Huang, M. Junczys-Dowmunt, W. Lewis, M. Li, S. Liu, T. Liu, R. Luo, A. Menezes, T. Qin, F. Seide, X. Tan, F. Tian, L. Wu, S. Wu, Y. Xia, D. Zhang, Z. Zhang, M. Zhou, “Achieving Human Par¬ity on Automatic Chinese to English News Translation,” CoRR, abs/1803.05567, 2018.
 V. Chaudhary, Y. Tang, F. Guzman, H. Schwenk, P. Koehn, “Low-Resource Corpus Filtering Using Mul¬tilingual Sentence Embeddings,” in Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), 261-266, Association for Computational Linguistics, Florence, Italy, 2019, doi:10.18653/v1/W19-5435.
 R. Sennrich, B. Haddow, A. Birch, “Neural Ma¬chine Translation of Rare Words with Subword Units,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1715-1725, Association for Com¬putational Linguistics, Berlin, Germany, 2016, doi: 10.18653/v1/P16- 1162.
How to Cite
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
All articles published in SJTTU are licensed under a Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA) license. This means anyone is free to copy, transform, or redistribute articles for any lawful purpose in any medium, provided they give appropriate attribution to the original author(s) and SJTTU, link to the license, indicate if changes were made, and redistribute any derivative work under the same license.
Copyright on articles is retained by the respective author(s), without restrictions. A non-exclusive license is granted to SJTTU to publish the article and identify itself as its original publisher, along with the commercial right to include the article in a hardcopy issue for sale to libraries and individuals.
Although the conditions of the CC BY-SA license don't apply to authors (as the copyright holder of your article, you have no restrictions on your rights), by submitting to SJTTU, authors recognize the rights of readers, and must grant any third party the right to use their article to the extent provided by the license.