李 博 · 硕士论文(计算机科学)
论文摘要

Research on Chinese Word Segmentation and Proposals for Improvement

Unlike English or other western languages, written Chinese does not have explicit boundary to delimit words such as blank space. Thus Chinese Word Segmentation (CWS) is the fundamental task and an acknowledged problem of Chinese natural language processing. CWS is defined as the process of transforming Chinese text from sequences of characters into sequences of words. 显示全部内容 »


In this research project, state of the art of CWS is systematically investigated and the real difficulities in CWS research are analyzed; in addition, the work is mainly targeting the segmentation ambiguity, proposed a CWS system including two ideas for pre-segmentation, ambiguity detection and overlapping ambiguity disambiguation. The character-by-character maximum matching method has a better performance than bi-directional maximum matching method and Omni-segmentation method, which is able to detect maximum overlapping ambiguity string (MOAS) and combination ambiguity string (CAS), at the same time to save more cost than Omni-segmentation method. The web-search and rule based disambiguation method has a rational performance for MOAS disambiguation, according to the test result, the precision rate of MOAS disambiguation is about 89.07% by the web-search method; with two rules applied on, the precision rate increases 2.2%.

Keywords: CWS (Chinese Word Segmentation), MOAS (maximum overlapping ambiguity string), CAS (combination ambiguity string), OOV (out-of-vocabulary).
论文导师
Henning Christiansen

罗斯基勒大学计算机系教授
电 邮:henning@ruc.dk
电 话:(+45) 4674 3832
主 页:http://akira.ruc.dk/~henning/
论文进程
答辩考官
Erik Frøkjær

哥本哈根大学计算机系副教授
电 邮:erikf@diku.dk
电 话:(+45) 3532 1456
Henning Christiansen

罗斯基勒大学计算机系教授
电 邮:henning@ruc.dk
电 话:(+45) 4674 3832

论文评语
This master thesis report by Bo Li treats Chinese word segmentation which is a complex problem in Chinese language processing research. It is of great important for the construction and use of digital information search and retrieval systems, for digital support for language translation, and for many other fields within text processing of Chinese. 显示全部内容 »


In brief summary, the word segmentation problem concerns how to break up a string of characters in such a way that each chunk of characters describes the meaningful element that the author wants to designate, for instance a word, a name, a phrase, a symbol, or something else to be described. This word segmentation problem is different and more complex in Chinese because Chinese is so different from western languages, and because written Chinese does not make use of any kind of delimiters between characters to indicate the boundaries of each the intended meaningful element. Typically, many different segmentations of a string of Chinese characters are perfectly meaningful; but different segmentations will represent different meanings and some may be “syntactically right” but without sense. So the problem is how to pick the right one. For a well-trained Chinese reader this usually does not cause any real problem, especially when the reader is familiar with the subject matter of the text. But in digital text processing of Chinese, word segmentation is a hard research problem.

The main research question that the thesis investigates is: how to improve the Chinese word segmentation in order to approach higher precision rate? The author uncovers his research by through discussions of these three sub-questions (1) why Chinese word segmentation is such an important and complex problem? (2) What are the main problems that restrict the development of Chinese word segmentation? (3) What is the state-of-the-art of Chinese word segmentation? Based on this the author chooses to focus on segmentation ambiguity. Two types of segmentation ambiguity are discussed: overlapping segmentation ambiguity, and combinational segmentation ambiguity. Selected techniques for detecting and clearing up ambiguity problems are explained in detail. Also the importance of the “out of vocabulary” problem as well as “true ambiguity” is explained.

On this background the author contributes to the current research of Chinese word segmentation with his own experimental system, an integrated word segmentation system which includes the following facilities: (a) pre-segmentation, (b) ambiguity detection, (c) proper noun detection, and (d) disambiguation. The author proposes a “character-by-character maximum matching segmentation technique” and compare this with two well-known segmentation techniques, the bi-directional maximum matching technique and the Omni-segmentation technique. The test results indicate that the character-by-character maximum matching technique gives better performance than the two well-known techniques. Further, the idea of a web-based technique for “maximum overlapping ambiguity string”(MOAS) disambiguation is proposed. Also this technique shows promising results.

We find the thesis very systematically structured and well written. The thesis gives a very good and clear introduction to the word segmentation problem in the context of written Chinese, which is helpful even for western readers with no background in the Chinese language. Examples from Chinese language are chosen in an appropriate way as to illustrate key issues and problems to the readers. The coverage of the literature about the state-of-the-art of Chinese word segmentation techniques is comprehensive and through. Furthermore, the author presents contributions to the field which appear to be novel and well though out. Overall the author demonstrates his ability to systematically analyze and describe a complex problem and to produce and substantiate his own original ideas.

Of minor shortcomings in the thesis report we can mention:

♦ The review of the results from the SIGHAN BAKEOFF competitions, section 2.5, could have been improved by a more qualitative discussion to complement the detailed statistics.

♦ Concerning the “out of vocabulary” word problem, it is fully acceptable that the thesis does not try to come up with new solutions to this. However, as “out of vocabulary” is known to be a major problem in any natural language task, it would have been desirable with a more detailed analysis of the impact of this problem on the techniques for Chinese word segmentation. Thus, section 3.4 could have been extended.

♦ The mathematical precision and layout of formulas in parts of section 3 could have been improved. However, as the mathematical modeling is not a key issue in the thesis, and since the overall understanding of the concepts is demonstrated in the thesis, this issue does not change the overall good impression.

♦ The discussion of the experimental results presented in Chapter 4 could have been improved in order to clarify the specific contributions to the field.

The oral presentation of the master thesis as well as the following discussion was very satisfactory. Bo Li gave an interesting and concise overview of his study with a very clear focus of the important aspects. Further, he responded with insight and openness to all questions by the examiners. The examiners unanimously agree that Bo Li’s thesis work should be given the mark 12 (twelve).


Erik Frøkjæ

External examiner
Associate professor
University of Copenhagen, Denmark

Henning Christiansen

Supervisor and internal examiner
Professor
Roskilde University, Denmark