Comparison of different model’s performances in task of document classification

  • Authors Kristijan Spirovski, Evgenija Stevanoska, Andrea Kulakov, Zaneta Popeska, Goran Velinov
  • Reasearch fields / Keywords Artificial intelligence, machine learning, information systems, natural language processing
  • Publication year 2018
  • Publication WIMS '18, ACM New York

Although the number of additional resources in Macedonian which can be used for solving information retrieval problem (or general Natural Language Processing problem) is very limited, models exist which are general enough and do not need additional knowledge about the language. This paper presents a document classification model, that doesn’t rely on any language specific additional resources. The model is trained and tested on a set of news articles extracted from Macedonian websites, and each document is labeled with a class representing one of the twelve category sections from which the documents were extracted. The goal of this paper is to test different methods for feature selection and choice of vocabulary. Furthermore, we choose a model which gives the best accuracy for document classification task and we make sensitivity analysis on its architecture in order to further improve its performance. Although similar research already exists, this paper aims to combine different experiments and test them on Macedonian language documents. The models used in this paper are Random Forest (RF), Support Vector Machines (SVM) and Neural Network (NN). The performed experiments showed that the best accuracy is achieved when each document is represented as tf-idf vector, the vocabulary contains equal number of representative words from each class, and simple Neural Network with 3 hidden layers is used as a model. The main conclusion is that a language independent model for solving document classification problem can be successfully build for Macedonian language, achieving around 80% accuracy on the test set.