Table of Contents
kmeans.h implements a number of algorithm for K-means quantization: Lloyd [16] , an accelerated version by Elkan [7] , and a large scale algorithm based on Approximate Nearest Neighbors (ANN). All algorithms support float
or double
data and can use the \(l^1\) or the \(l^2\) distance for clustering. Furthermore, all algorithms can take advantage of multiple CPU cores.
Please see K-means fundamentals for a technical description of K-means and of the algorithms implemented here.
Getting started
The goal of K-means is to partition a dataset into \(K\) “compact” clusters. The following example demonstrates using kmeans.h in the C programming language to partition numData
float
vectors into compute numCenters
clusters using Lloyd's algorithm:
Once the centers have been obtained, new data points can be assigned to clusters by using the vl_kmeans_quantize function:
Alternatively, one can directly assign new pointers to the closest centers, without bothering with a VlKMeans object.
There are several considerations that may impact the performance of KMeans. First, since K-means is usually based local optimization algorithm, the initialization method is important. The following initialization methods are supported:
Method | Function | Description |
---|---|---|
Random samples | vl_kmeans_init_centers_with_rand_data | Random data points |
K-means++ | vl_kmeans_init_centers_plus_plus | Random selection biased towards diversity |
Custom | vl_kmeans_set_centers | Choose centers (useful to run quantization only) |
See Initialization methods for further details. The initialization methods use a randomized selection of the data points; the random number generator init is controlled by vl_rand_init.
The second important choice is the optimization algorithm. The following optimization algorithms are supported:
Algorithm | Symbol | See | Description |
---|---|---|---|
Lloyd | VlKMeansLloyd | Lloyd's algorithm | Alternate EM-style optimization |
Elkan | VlKMeansElkan | Elkan's algorithm | A speedup using triangular inequalities |
ANN | VlKMeansANN | ANN algorithm | A speedup using approximated nearest neighbors |
See the relative sections for further details. These algorithm are iterative, and stop when either a maximum number of iterations (vl_kmeans_set_max_num_iterations) is reached, or when the energy changes sufficiently slowly in one iteration (vl_kmeans_set_min_energy_variation).
All the three algorithms support multithreaded computations. The number of threads used is usually controlled globally by vl_set_num_threads.