Vietnamese treebank construction and entropy-based error detection
http://repository.vnu.edu.vn/handle/VNU_123/28373
Treebanks, especially the Penn treebank for natural language processing (NLP) in English, play an essential role in both research into and the application of NLP.
\
However, many languages still lack treebanks and building a treebank can be very complicated and difficult.
This work has a twofold objective.
Firstly, to share our results in constructing a large Vietnamese treebank (VTB) with three levels of annotation including word segmentation, part-of-speech tagging, and syntactic analysis.
Major steps in the treebank construction process are described with particular regard to specific Vietnamese properties such as lack of word delimiter and isolation.
Those properties make sentences highly syntactically ambiguous, and therefore it is difficult to ensure a high level of agreement among annotators.
Various studies of Vietnamese syntax were employed not only to define annotations but also to systematically deal with ambiguities.
Treebanks, especially the Penn treebank for natural language processing (NLP) in English, play an essential role in both research into and the application of NLP.
\
However, many languages still lack treebanks and building a treebank can be very complicated and difficult.
This work has a twofold objective.
Firstly, to share our results in constructing a large Vietnamese treebank (VTB) with three levels of annotation including word segmentation, part-of-speech tagging, and syntactic analysis.
Major steps in the treebank construction process are described with particular regard to specific Vietnamese properties such as lack of word delimiter and isolation.
Those properties make sentences highly syntactically ambiguous, and therefore it is difficult to ensure a high level of agreement among annotators.
Various studies of Vietnamese syntax were employed not only to define annotations but also to systematically deal with ambiguities.
Title: | Vietnamese treebank construction and entropy-based error detection |
Authors: | Nguyen, Phuong-Thai Le, Anh-Cuong Ho, Tu-Bao |
Keywords: | Treebank Error detection Entropy |
Issue Date: | 2015 |
Publisher: | Đại học Quốc gia Hà Nội |
Citation: | ISIKNOWLEDGE |
Abstract: | Treebanks, especially the Penn treebank for natural language processing (NLP) in English, play an essential role in both research into and the application of NLP. However, many languages still lack treebanks and building a treebank can be very complicated and difficult. This work has a twofold objective. Firstly, to share our results in constructing a large Vietnamese treebank (VTB) with three levels of annotation including word segmentation, part-of-speech tagging, and syntactic analysis. Major steps in the treebank construction process are described with particular regard to specific Vietnamese properties such as lack of word delimiter and isolation. Those properties make sentences highly syntactically ambiguous, and therefore it is difficult to ensure a high level of agreement among annotators. Various studies of Vietnamese syntax were employed not only to define annotations but also to systematically deal with ambiguities. |
Description: | LANGUAGE RESOURCES AND EVALUATION Volume: 49 Issue: 3 Pages: 487-519 Published: SEP 2015 ; TNS05625 |
URI: | http://repository.vnu.edu.vn/handle/VNU_123/28373 |
Appears in Collections: | Bài báo của ĐHQGHN trong Web of Science |
Nhận xét
Đăng nhận xét