Internal evaluation criteria for categorical data in hierarchical clustering: Optimal number of clusters determination

Zdeněk Šulc; Jana Cibulková; Jiří Procházka; Hana Řezanková

doi:10.51936/lxut1974

Authors

Zdeněk Šulc University of Economics, Department of Statistics and Probability, Prague, Czechia Author https://orcid.org/0000-0002-7624-8104
Jana Cibulková University of Economics, Department of Statistics and Probability, Prague, Czechia Author
Jiří Procházka University of Economics, Department of Statistics and Probability, Prague, Czechia Author
Hana Řezanková University of Economics, Department of Statistics and Probability, Prague, Czechia Author

DOI:

https://doi.org/10.51936/lxut1974

Abstract

The paper compares 11 internal evaluation criteria for hierarchical clustering of categorical data regarding a correct number of clusters determination. The criteria are divided into three groups based on a way of treating the cluster quality. The variability-based criteria use the within-cluster variability, the likelihood-based criteria maximize the likelihood function, and the distance-based criteria use distances within and between clusters. The aim is to determine which evaluation criteria perform well and under what conditions. Different analysis settings, such as the used method of hierarchical clustering, and various dataset properties, such as the number of variables or the minimal between-cluster distances, are examined. The experiment is conducted on 810 generated datasets, where the evaluation criteria are assessed regarding the optimal number of clusters determination and mean absolute errors. The results indicate that the likelihood-based BIC1 and variability-based BK criteria perform relatively well in determining the optimal number of clusters and that some criteria, usually the distance-based ones, should be avoided.

Internal evaluation criteria for categorical data in hierarchical clustering

Optimal number of clusters determination

Authors

DOI:

Abstract

Downloads

Published

Issue

Section

License