By leonardo maffi
Version 2.0, 2 Jun 2007
A basic implementation of kmeans algorithm in pure Python language:
kmeans.zip
Notes about this code: it is almost a toy, it's
slow, and probably it's fragile (all basic kmeans algorithms are fragile, there
are point distributions that make them go bang). When k and/or the number of
points is big then using C or ShedSkin can be very useful. This code is very
Clike because it's designed to work with ShedSkin (and because it comes from
a C version).
Info about the algorithm:
http://en.wikipedia.org/wiki/Kmeans_algorithm
Images and ideas from Alessandro Giusti, <giusti at leet point it>, http://www.leet.it/home/lale/clustering/
kmeans program adapted from a C version by Roger Zhang, <rogerz at cs
point dal point ca>
http://cs.smu.ca/~r_zhang/code/kmeans.c
(Note: it seems that Psyco is useless on this function. ShedSkin makes it about 40+ times faster.)
If MatPlotLib is available, then the k_means.py module plots a scatterplot like this one:
All tests are done on truecolor PNG images. Note that the number of k (the clusters, that is the final color number) is always 32. No image has dithering, if not specified.
First original test image, colors.png

kmeans with 1000 sample points

kmeans with 5000 sample points

kmeans with 10000 sample points

kmeans with 5000 sample points, YIQ
color space

Paint Shop Pro 7, optimized median cut

PhotoShop Perceptual

kmeans with 10000 sample points, then
dithered with PhotoShop, 80%

Photoshop Perceptual, dithered 80%

This is a very difficult image, but this algorithm works well (better than an optimized median cut) with just 1000 sample points. The YIQ color space changes the 3D space of the points, making the vector quantization more close to a perceptualbased one.
Second original image, dog.png

kmeans with 10000 sample points, YIQ
color space

kmeans with 10000 sample points, YIQ
color space, then dithering (with Fireworks, 100%)

Paint Shop Pro 7, optimized median cut

This image has a high local variance and few shades of colors (grey, gold, white, dark grey), so 32 colors are enough to produce a realistic image even without dithering.
Third original test image, sea.png

kmeans with 6000 sample points

kmeans with 20000 sample points

kmeans with 6000 sample points, YIQ
color space

Paint Shop Pro 7, optimized median cut

PhotoShop Perceptual

Another difficult image. The sea and sky contain slowly changing colors, and the sharp golden lights have a color quite different from most of the other pixels of the image. The kmeans on many samples is probably the better image, but the image producd by PhotoShop has crisper golden lights.
The results of the program are quite good. The program is quite slow, but compiling it with ShedSkin makes it fast enough. A fuzzy kmeans algorithm may be an improvement, but for most practical images a kmeans is probably enough.
I have tried to improve this basic algorithm. In practice the tests show that using the YIQ color space isn't useful for this program.
I have tried to remove duplicated colors from the images before submitting them to the kmeans routine, but I have seen that for artificiallygenerated images (that have zones of exactly the same color, like the Banner image below) such operation removes the higher "weigth" of such common colors from the sampling, so the computed palette can have big color shifts.
I have seen that some of the quantized images seem to have low saturated colors, so I have tried to compute the palette on an image with slightly increased saturation (but the final quantization is done on the original image), but the results are worse. So I have tried to compute the palette on a sharpened image, but from some tests I can see that the results are equally worse.
Original sea image

kmeans with 20000 sample points

kmeans with 5000 sample points, on
a upsaturated image

kmeans with 5000 sample points, on
sharpened image

So I have tried to compute the kmeans using a more perceptualbased color
distance, I have found a simple, fast and probably good one here, that I have
adapted (the python version can be seen in the colorDist.py):
http://www.compuphase.com/cmetric.htm
The results are quite similar to the old ones:
Original sea image

kmeans with 20000 sample points

kmeans with 5000 sample points, perceptualbased
color distance

Original Colors image (the second line
shows the image at 200% size)

kmeans with 5000 sample points

kmeans with 5000 sample points, perceptualbased
color distance

Original truecolor Wasp image

kmeans with 5000 sample points

kmeans with 5000 sample points, perceptualbased
color distance

Original Banner image (from Scientific
American site), 256 colors (the second line shows the image at 200% size)

kmeans with 5000 sample points. 32
colors

kmeans with 5000 sample points, perceptualbased
color distance, 32 colors

I have tried to improve the result using the same perceptual color distance to do the final quantization pass, but results aren't conclusive (sometimes better, sometimes not).
I have tried to quantize the images myself (k_means_demo_quant.py in the zip file), with an eucledean distance and with the same perceptualbased color distance, but results are worse than doing the same quantization with PaintShopPro, I don't know why.
Note that there are quantitative ways to compare images, but I haven't used them here because the purpose of perceptualbased image processing is to improve the visual apparence of the resulting images, and not numerical measures of their difference.
Changing the k (number of groups of the kmeans, that is the number of colors),
the results are interesting. For k=256 the resulting image is similar to the
one produced by PhotoShop, but the algorithm (in Python) is much slower.
With k=16 the image produced by PhotoShop is quite better, it actually contain
yelloy, green, cyan etc, colors. Maybe I have to add some heuristics or I have
to do many runs of the kmeans function and take the best solution.
kmeans with 20000 samples, 256 colors 
PhotoShop 256 colors

PhotoShop 16 colors

kmeans with 10000 samples, 16 colors

kmeans with 10000 samples, 16 colors,
perceptualbased color distance

Using the perceptualbased color distance with so few colors is bad for this image, because the cyanlike colors become too much represented.
kmeans with 10000 samples, 16 colors,
dithering 80% with PhotoShop

It can be seen that even on this difficult image with only 16 colors chosen by the kmeans function, using a good dithering produces an image good enough.
Update Feb 17 2006: I have modified the code, following a suggestion by David Locke. My code sometimes produced less than the requested k centroids. Choosing the starting k centroids as all different (if it is possibile) solves this problem in all the tests I have done.
Update May 6 2007: I have added the kmeans++ euristics to choose the starting centroids to the kmeans module.
Update Jun 2 2007: I have added a compiled C version (and improved the C code) plus a Python caller that uses ctypes (ctypes is into the standard library for Python 2.5+, but it can be instelled on older versions of Python too. The compiled code in the zip file is for windows, but it can be compiled for gcc on Linux too). The C routine is about 81 times faster than the CPython V.2.5.1 version, about 8 times faster than the Psyco V.1.5.2 version, and about 2 times faster than the ShedSkin version (but the C code doesn't contain the routine to assure the starting k centroids are all different, yet).
A basic implementation of kmeans algorithm in pure Python language: kmeans.zip