Skip to main content

Implementing k-means clustering in Ruby

Rundt EbbesvikvannetInspired by the partitioning problem I set about to implement a well known algorithm, k-means clustering, from memory, for fun! ... and, for science... Interestingly, this is somehow the opposite of the partitioning problem. In the partitioning problem we tried to maximize variation of categorical variables within groups, whereas here we're trying to find groups of elements that are the most "similar" to each other in a n-dimensional continuous space.

The main idea of k-means is the following - if you have a bunch of elements and a given number of clusters:
  1. Create the initial clusters randomly within the space spanned by the elements (typically (always?) you would pick randomly from your elements).
  2. Lob all elements into the cluster with the nearest center (using some Euclidean distance metric typically).
  3. Recenter each cluster on the average of its elements, 
  4. If necessary move the elements to their now nearest clusters. 
  5. Repeat the "re-centering" and moving of elements until we have a "stable" (enough) result.
Simple enough, but I don't think I have ever implemented it first hand. (At least not in Ruby.)

First I implemented some simple data structures, Points:
I use the snazzy method missing as a way to reference named dimensions of points easily, e.g. point.x and point.y (or point.z, point.t, point.w1, point.hausdorff, or point.scary).

Clusters are simply:
The definition of "moved" was the first design choice I could not remember entirely - is it when a certain number of points "jump" from a cluster to another or when the cluster center moves more than a certain distance? (Most of the time I would assume this boils down to the same.) I went for the latter. The clusters themselves keep track of if they have moved or not.

I also implemented a special case of points, 2D points:

Last helper method in here is simply to plot these clusters (in 2D), and I use gnuplot like so:

The real meat is in the k-means method itself of course:

This takes as parameters the points themselves (as an Enumerable of Points), the number of clusters (k), the delta (defines what is a "move"), and if we want to plot each iteration. (Especially interesting while debugging the algorithm... :-)) No real surprises here, I guess. I chose to delete clusters with no points. I'm not entirely sure if that is what's in the original algorithm. (I guess by now I should duckduckgo it. :-))

Then I created a very silly little script that generates 10000 random points in 2D and (tries to) put them into 7 clusters, gnuplot'ing each iteration, to test the algorithm.

Voila, the results are in:
montage -geometry "320"x"240" -tile 7x7 outfiles/clusters-7* clusters-all.png
Or maybe slightly more interesting, as an animated GIF (with sound?):
 convert outfiles/clusters-7* clusters-all.gif

Epiolog
I guess a variant of this could serve as a solution to the partitioning problem if we just map the categorical variables to the continuous space (using the weights?) and sample one element from each resulting (population/number of groups + 1) cluster (where the cluster size would be locked - Lloyd's algorithm?) to form groups.

Moral
It is useful to (try to) implement an algorithm you think you know from scratch from time time to see if you really do know all the details in the implementation of it... Especially in new languages.

Update: Clone this github if you for any reason would want to play around with this yourself: https://github.com/mortenjohs/k-means...

Comments

Popular posts from this blog

Fix your rapid blinking Marantz SR-6004 using nothing but 3 fingers - and a thumb

A couple of years ago my (most of the time excellent) Maranz SR6004 acted up. It did't want to turn itself on. Properly. Just stood there and blinked rapidly. Its little red light that is. At me. The solution was so simple that I didn't bother to write it down as I was sure to remember it. Alas, no. Some weeks ago it did it again. (Can it be the heat?) Just stood there blinking rapidly at me. The manual just said - as it said last time around - that it was time to return the unit to it's maker. Or similar. Some googling led me to this page:  http://www.allquests.com/question/4056803/Marantz-XXX4-Series-Failure-Issues.html  The technical term for what I had experienced seems to be "The Pop of Death". Aïe. But!, humongous letters said: YOU CAN SOMETIMES RESET THE UNIT BY PRESSING SURR MODE, CLEAR AND EXIT SIMULTANEOUSLY And so I did. And so it was fixed. And all was well. (And now I have written it down for the next time.)

Fix upside down Skype video in Ubuntu 12.10 [UPDATED]

When launching Skype in 64-bit Ubuntu 12.10 on my Asus U35J the webcam image was all topsy-turvy. Since I don't live in Australia, or something (tsk-tsk), this was not really cutting it for me.  Some quick googling led me to this forum post:  http://forums.pcpitstop.com/index.php?/topic/198236-why-is-my-skype-video-showing-upside-down/   After making sure that the necessary packages was installed (notably  libv4l-0) I adapted the command from the forum post to: LD_PRELOAD=/usr/lib/i386-linux-gnu/libv4l/v4l1compat.so skype and voila, the image was OK. Next step is for this to be set to default, which seems to be outlined here (in steps 2 and 3):  http://pc-freak.net/blog/how-to-fix-upside-down-inverted-web-camera-laptop-asus-k51ac-issue-on-ubuntu-linux-and-debian-gnu-linux/  (Actually this post seems to cover most of what is useful from the forum post above...) UPDATE (19/04/2013): Since my laptop was working fine, I decided it was about time to fix it. Also I wanted to

Using a Raspberry Pi as a MIDI USB/5-pin bridge

In my constant... need... to get everything music instrument related to communicate with each other, I wanted to look into ways to get some of my keyboards/synths with only MIDI over USB to talk to devices with regular good old-fashioned 5-pin MIDI ports from the eighties. Cables! First I had a quick look at off the shelf solutions. The most interesting one being the Kenton MIDI USB Host – providing MIDI host functionality for USB devices as well as regular MIDI in and out in a small box. Unfortunately it is rather expensive (~125 €) and a reliable online source warned me that it was not entirely stable in collaboration with my OP-1, so I started thinking of more... home-grown solutions. I decided to try to use my old Raspberry Pi and see if that would serve as a USB host with a borrowed MIDI USB adapter. (Thanks Simon.) A cheaper, and, as an added boon, a nerdier solution. Step 1: Get the USB MIDI device up and running This was the easy part. The device I have been lent