Scientific journal
Bulletin of Higher Educational Institutions
North Caucasus region

TECHNICAL SCIENCES


UNIV. NEWS. NORTH-CAUCAS. REG. TECHNICAL SCIENCES SERIES. 2022; 3: 33-40

 

http://dx.doi.org/10.17213/1560-3644-2022-3-33-40

 

CLUSTERING THE CORPORATION OF TEXT DOCUMENTS USING THE K-MEANS ALGORITHM

Bulyga F.S., Kureichik V.M.

Bulygа Filipp S. – Graduate Student, Department «Computer-Aided Design Systems» Department, bulyga@sfedu.ru

Kureichik Viktor M.  – Doctor of Technical Sciences, Professor, Department «Computer-Aided Design Systems», vmkureychik@sfedu.ru

 

Abstract

This paper presents a solution to the problem of text document corpus clustering based on the application of the K-means algorithm. As the main approaches to data preprocessing, a multi-stage algorithm for normalizing the input corpus of text documents is used, as well as the Information Gain measure for extracting the characteristic features of a document. The sequence of implementation of the above approaches allows, in the end, to form a set of features that most fully characterize the original set of documents, as a result, allowing to improve the performance indicators of the applied clustering algorithm. To objectify the proposed hypothesis, a series of comparative experiments was conducted, the results of which demonstrate the advantage of the proposed solution, in comparison with the main classical clustering algorithms, on average by 10-15%. The novelty of the proposed solution lies in the use of modernized input data preprocessing approaches, as well as the use of the Information Gain measure, designed to extract many characteristic features of the input document.

 

Keywords: clustering, classification, clustering of text documents, k-means, Information Gain, error matrix, Chameleon

 

Full text: [in elibrary.ru]

 

References

  1. Anokhin A.A. Databases and search engines for scientific citation - evaluation of the productivity of labor research. Higher School: scientific-methodical and journalistic journal. 2015; 2(106):48-53.
  2. Bulyga F.S., Kureichik V.M. Algorithms of agglomerative clustering in relation to the problems of analysis of linguistic expert information. Izvestiya SFedU. Technical science. 2021. 6(223):73-88. (In Rus.).
  3. Setyaningsih S. Using cluster analysis study to examine the successful performance entrepreneur in Indonesia. Procedia Economics and Finance. 2012; (4): 89-298.
  4. Ghoshdastidar D., Perrot M., Luxburg U. Foundations of comparison-based hierarchical clustering. Advances in Neural Information Processing System 32 (NIPS 2019). December 2019.Pp. 7456-7466.
  5. Gupta M., Rajavat A. Comparison of algorithms for document clustering. In: IEEE Sixth International Conference on Computational Intelligence and Communication Networks, (CICN). 2014. Pp. 541-545.
  6. Zimin A.A., Karmanova A.N., Lu Y. UPGMA-analysis of type II CRISPR RNA-guided endonuclease Cas9 homologues from the compost metagenome. E3S Web of Conferences. Mishref. 2021; (265).
  7. Singh V.K., Tiwari N., Garg S. Document clustering using K-means, heuristic K-means, and fuzzy C-means. 2011 International Conference on Computational Intelligence and Communication Networks IEEE. 2011. Pp. 297-301.
  8. Nuriev S.I., Gazizova A.I., Minyazev R.S. Searching inside binary and text files. GNII "NATIONAL DEVELOPMENT". St. Petersburg. 2019. Pp. 271-274.
  9. Bolshakova E.I., Efremova N.E., Sharikov G.F. Tools for developing systems for extracting information from Russian-language texts. New information technologies in automated systems, 2015; (18):533-543. (In Rus.).
  10. Yatsko V.A. Stop words as a basis for the classification of text documents. Actual Problems of Applied Mathematics, Informatics and Mechanics. 2021. Pp. 486-492. (In Rus.).
  11. Zhong X., Rajapakse J.C. Graph embeddings on gene ontology annotations for protein-protein interaction prediction. BMC Bioinformatics. 2020; 21(16):65-74.
  12. Andrievskaya N.K. Generalized modified model for representing text information resources. Informatics and Cybernetics. 2020; 4(22):21-30. (In Rus.).
  13. Ali Javaheri Javid M., Blackwell T., Zimmer R., Majid Al-Rifaie M. Analysis of information gain and Kolmogorov complexity for structural evaluation of cellular automata configurations. Connection Science. 2016; 28(2):155-170.
  14. Gorkun O.P. Evaluation of the quality of the machine learning algorithm. Actual problems and ways of development of energy, engineering and technology, 2019; (1):103-107. (In Rus.).
  15. Ranjit K.N., Raghunandan K.S., Chethan H.K., Sunil C., Naveen C. A symbolic representation and classification of fruits. International Journal of Computing and Digital Systems. 2019; 8(6): 565-575.
  16. Scientific Electronic Library eLIBRARY.RU. Available at: https://www.elibrary.ru/ (accessed 12.04.2022).