Internal evaluation criteria for categorical data in hierarchical clustering
Optimal number of clusters determination
DOI:
https://doi.org/10.51936/Abstract
The paper compares 11 internal evaluation criteria for hierarchical clustering of categorical data regarding a correct number of clusters determination. The criteria are divided into three groups based on a way of treating the cluster quality. The variability-based criteria use the within-cluster variability, the likelihood-based criteria maximize the likelihood function, and the distance-based criteria use distances within and between clusters. The aim is to determine which evaluation criteria perform well and under what conditions. Different analysis settings, such as the used method of hierarchical clustering, and various dataset properties, such as the number of variables or the minimal between-cluster distances, are examined. The experiment is conducted on 810 generated datasets, where the evaluation criteria are assessed regarding the optimal number of clusters determination and mean absolute errors. The results indicate that the likelihood-based BIC1 and variability-based BK criteria perform relatively well in determining the optimal number of clusters and that some criteria, usually the distance-based ones, should be avoided.
Downloads
Published
Issue
Section
License
Copyright (c) 2024 Zdeněk Šulc, Jana Cibulková, Jiří Procházka, Hana Řezanková (Author)

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.