Internal evaluation criteria for categorical data in hierarchical clustering

Optimal number of clusters determination

Authors

  • Zdeněk Šulc University of Economics, Department of Statistics and Probability, Prague, Czechia Author https://orcid.org/0000-0002-7624-8104
  • Jana Cibulková University of Economics, Department of Statistics and Probability, Prague, Czechia Author
  • Jiří Procházka University of Economics, Department of Statistics and Probability, Prague, Czechia Author
  • Hana Řezanková University of Economics, Department of Statistics and Probability, Prague, Czechia Author

DOI:

https://doi.org/10.51936/

Abstract

The paper compares 11 internal evaluation criteria for hierarchical clustering of categorical data regarding a correct number of clusters determination. The criteria are divided into three groups based on a way of treating the cluster quality. The variability-based criteria use the within-cluster variability, the likelihood-based criteria maximize the likelihood function, and the distance-based criteria use distances within and between clusters. The aim is to determine which evaluation criteria perform well and under what conditions. Different analysis settings, such as the used method of hierarchical clustering, and various dataset properties, such as the number of variables or the minimal between-cluster distances, are examined. The experiment is conducted on 810 generated datasets, where the evaluation criteria are assessed regarding the optimal number of clusters determination and mean absolute errors. The results indicate that the likelihood-based BIC1 and variability-based BK criteria perform relatively well in determining the optimal number of clusters and that some criteria, usually the distance-based ones, should be avoided.

Downloads

Published

2024-12-12

Issue

Section

Articles