Text classification systems

Topic > Text classification systems - 1054

There are currently many classification systems. Generally, these systems fall into two main categories. These are binary, multiclass systems. Binary classification systems are only concerned with classifying documents into two main categories or groups. Classification systems of this type are used to distinguish only two classes of objects. As Maranis and Bebenko (2009) explain, these systems provide a Yes/No answer to the question: does this document belong to class X? In this, such systems can be useful for classifying emails in which they are classified as spam or not, or commercial transactions in which they are determined as fraudulent or not. In such applications it is more likely and simpler to use binary classification systems since we only have two classes or groups. Multiclass systems, in turn, divide documents into two or more classes. As the name indicates, these classifiers assign each document or data point to one of many classes where each has a distinct subject area. Newspaper reports, for example, can be classified into different categories such as news, sports, culture, business and money, politics, science, etc. This thesis is only about text clustering. That is, he makes no a priori assumptions about the interrelationships of Hardy's prose works. Computational text clustering methods fall into two main categories. These are mathematical, linguistic and statistical methods (Srivastava and Sahami, 2009; Justo and Torres, 2005). Linguistic methods are based on natural language processing techniques. Methods of this type usually involve morphological and syntactic processes to extract meaning and identify relationships within documents. Mathematical and statistical classification...... half of the article...... including SenseClusters (Purandare and Pedersen, 2004). This and others are programs that allow users to group similar contexts such as emails and web pages (Pedersen, 2008). The working principle of such programs is that data documents can be grouped based on their mutual contextual similarities (Purandare and Pedersen, 2004). Programs of this type have in fact proven to be a successful clustering method when applied to web pages and their merits are more tangible with multimedia material. However, an approach of this type brings with it some limitations. One of these, perhaps the most important, is that it does not deal with the content analysis of documents. A further drawback is that in almost all context classification applications “identical replications of controlled experiments lead to different conclusions” (Martin et al.., 2005: 470).