k-means with Python

By leonardo maffi
Version 2.0, 2 Jun 2007

[Go back to the index]

A basic implementation of k-means algorithm in pure Python language: kmeans.zip
Notes about this code: it is almost a toy, it's slow, and probably it's fragile (all basic k-means algorithms are fragile, there are point distributions that make them go bang). When k and/or the number of points is big then using C or ShedSkin can be very useful. This code is very C-like because it's designed to work with ShedSkin (and because it comes from a C version).

Info about the algorithm:
http://en.wikipedia.org/wiki/K-means_algorithm

Images and ideas from Alessandro Giusti, <giusti at leet point it>, http://www.leet.it/home/lale/clustering/

k-means program adapted from a C version by Roger Zhang, <rogerz at cs point dal point ca>
http://cs.smu.ca/~r_zhang/code/kmeans.c

(Note: it seems that Psyco is useless on this function. ShedSkin makes it about 40+ times faster.)

If MatPlotLib is available, then the k_means.py module plots a scatterplot like this one:

Some tests:

All tests are done on truecolor PNG images. Note that the number of k (the clusters, that is the final color number) is always 32. No image has dithering, if not specified.

 
First original test image, colors.png
k-means with 1000 sample points
k-means with 5000 sample points
k-means with 10000 sample points
k-means with 5000 sample points, YIQ color space

 
Paint Shop Pro 7, optimized median cut
PhotoShop Perceptual
k-means with 10000 sample points, then dithered with PhotoShop, 80%
Photoshop Perceptual, dithered 80%

This is a very difficult image, but this algorithm works well (better than an optimized median cut) with just 1000 sample points. The YIQ color space changes the 3D space of the points, making the vector quantization more close to a perceptual-based one.

 
Second original image, dog.png
k-means with 10000 sample points, YIQ color space
k-means with 10000 sample points, YIQ color space, then dithering (with Fireworks, 100%)
Paint Shop Pro 7, optimized median cut

This image has a high local variance and few shades of colors (grey, gold, white, dark grey), so 32 colors are enough to produce a realistic image even without dithering.

 
Third original test image, sea.png
k-means with 6000 sample points
k-means with 20000 sample points
k-means with 6000 sample points, YIQ color space
Paint Shop Pro 7, optimized median cut
PhotoShop Perceptual

Another difficult image. The sea and sky contain slowly changing colors, and the sharp golden lights have a color quite different from most of the other pixels of the image. The k-means on many samples is probably the better image, but the image producd by PhotoShop has crisper golden lights.

The results of the program are quite good. The program is quite slow, but compiling it with ShedSkin makes it fast enough. A fuzzy k-means algorithm may be an improvement, but for most practical images a k-means is probably enough.

Possibile improvements

I have tried to improve this basic algorithm. In practice the tests show that using the YIQ color space isn't useful for this program.

I have tried to remove duplicated colors from the images before submitting them to the k-means routine, but I have seen that for artificially-generated images (that have zones of exactly the same color, like the Banner image below) such operation removes the higher "weigth" of such common colors from the sampling, so the computed palette can have big color shifts.

I have seen that some of the quantized images seem to have low saturated colors, so I have tried to compute the palette on an image with slightly increased saturation (but the final quantization is done on the original image), but the results are worse. So I have tried to compute the palette on a sharpened image, but from some tests I can see that the results are equally worse.

 
Original sea image
k-means with 20000 sample points
k-means with 5000 sample points, on a up-saturated image
k-means with 5000 sample points, on sharpened image

So I have tried to compute the k-means using a more perceptual-based color distance, I have found a simple, fast and probably good one here, that I have adapted (the python version can be seen in the colorDist.py):

http://www.compuphase.com/cmetric.htm

The results are quite similar to the old ones:

Original sea image
k-means with 20000 sample points
k-means with 5000 sample points, perceptual-based color distance

Original Colors image (the second line shows the image at 200% size)
k-means with 5000 sample points
k-means with 5000 sample points, perceptual-based color distance

Original truecolor Wasp image
k-means with 5000 sample points
k-means with 5000 sample points, perceptual-based color distance

Original Banner image (from Scientific American site), 256 colors (the second line shows the image at 200% size)
k-means with 5000 sample points. 32 colors
k-means with 5000 sample points, perceptual-based color distance, 32 colors

I have tried to improve the result using the same perceptual color distance to do the final quantization pass, but results aren't conclusive (sometimes better, sometimes not).

I have tried to quantize the images myself (k_means_demo_quant.py in the zip file), with an eucledean distance and with the same perceptual-based color distance, but results are worse than doing the same quantization with PaintShopPro, I don't know why.

Note that there are quantitative ways to compare images, but I haven't used them here because the purpose of perceptual-based image processing is to improve the visual apparence of the resulting images, and not numerical measures of their difference.

Changing the k (number of groups of the k-means, that is the number of colors), the results are interesting. For k=256 the resulting image is similar to the one produced by PhotoShop, but the algorithm (in Python) is much slower. With k=16 the image produced by PhotoShop is quite better, it actually contain yelloy, green, cyan etc, colors. Maybe I have to add some heuristics or I have to do many runs of the k-means function and take the best solution.

k-means with 20000 samples, 256 colors

PhotoShop 256 colors

PhotoShop 16 colors
k-means with 10000 samples, 16 colors
k-means with 10000 samples, 16 colors, perceptual-based color distance

Using the perceptual-based color distance with so few colors is bad for this image, because the cyan-like colors become too much represented.

k-means with 10000 samples, 16 colors, dithering 80% with PhotoShop

It can be seen that even on this difficult image with only 16 colors chosen by the k-means function, using a good dithering produces an image good enough.

Update Feb 17 2006: I have modified the code, following a suggestion by David Locke. My code sometimes produced less than the requested k centroids. Choosing the starting k centroids as all different (if it is possibile) solves this problem in all the tests I have done.

Update May 6 2007: I have added the k-means++ euristics to choose the starting centroids to the kmeans module.

Update Jun 2 2007: I have added a compiled C version (and improved the C code) plus a Python caller that uses ctypes (ctypes is into the standard library for Python 2.5+, but it can be instelled on older versions of Python too. The compiled code in the zip file is for windows, but it can be compiled for gcc on Linux too). The C routine is about 81 times faster than the CPython V.2.5.1 version, about 8 times faster than the Psyco V.1.5.2 version, and about 2 times faster than the ShedSkin version (but the C code doesn't contain the routine to assure the starting k centroids are all different, yet).

A basic implementation of k-means algorithm in pure Python language: kmeans.zip

[Go back to the index]