Saturday, October 20, 2012

A Hybrid Method for Extracting Key Terms of Text Documents

Abstract: key terms are important terms in the document, which can give high-level description of contents for the reader. Extracting key terms is a basic step for many problems in natural language processing, such as document classification, clustering documents, text summarization and output the general subject of the document. This article proposed a new method for extracting key terms from text documents. As an important feature of this method, we note the fact that the result of its work is a

group of key terms, with terms from each group are semantically related by one of the main subjects of the document. Our proposed method is based on a combination of the following two techniques: a measure of semantic proximity of terms, calculated based on the knowledge base of Wikipedia and an algorithm for detecting communities in networks. One of the advantages of our proposed method is no need for preliminary learning, because the method works with the knowledge base of Wikipedia. Experimental evaluation of the method showed that it extracts key terms with high accuracy and completeness. Key words: Extraction Method, Key Term, Semantic Graph, Text Document I. Introduction Key terms (keywords or key phrases) are important terms in the document, which can give high-level description of contents for the reader. Extracting key terms is a basic step for many problems in natural language processing, such as document classification, clustering documents, text summarization and output the general subject of the document (Manning and Schtze, 1999). In this article we propose a method for extracting document key terms, using Wikipedia as a rich information resource about the semantic proximity of terms. Wikipedia www.wikipedia.org is a free available encyclopaedia, which is now the largest encyclopaedia in the world. It contains millions of articles and redirect pages of synonyms of the main title of the article available in several languages. With a vast network of links between articles, a large number of categories, redirect pages and disambiguation pages, Wikipedia is an extremely powerful resource for our work and for many other applications of natural language processing and information retrieval. Our method is based on the following two techniques: A measure of semantic proximity, calculated based on Wikipedia and an algorithm for networks analysis, namely, Girvan-Newman algorithm for communities detection in networks. A brief description of these techniques is given below. Establishing the semantic proximity of concepts in the Wikipedia is a natural step towards a tool, useful for the problems of natural language processing and information retrieval. Over the recent years a number of articles were published on semantic proximity computation between concepts using different approaches [7, 8, 3, 12]. [7] Gives a detailed overview of many existing methods of semantic proximity calculation of concepts using Wikipedia. Although the method described in our article does not impose any requirements to the method of semantic proximity determination, the efficiency of this method depends on...

Website: www.ijens.org | Filesize: -
No of Page(s): 6
Download A Hybrid Method for Extracting Key Terms of Text Documents - IJENS.pdf

No comments:

Post a Comment