Skip to main content

Posts

Showing posts with the label stats

Implementing k-means clustering in Ruby

Inspired by the partitioning problem I set about to implement a well known algorithm, k-means clustering, from memory, for fun! ... and, for science... Interestingly, this is somehow the opposite of the partitioning problem. In the partitioning problem we tried to maximize variation of categorical variables within groups, whereas here we're trying to find groups of elements that are the most "similar" to each other in a n-dimensional continuous space. The main idea of k-means is the following - if you have a bunch of elements and a given number of clusters: Create the initial clusters randomly within the space spanned by the elements (typically (always?) you would pick randomly from your elements). Lob all elements into the cluster with the nearest center (using some Euclidean distance metric typically). Recenter each cluster on the average of its elements,  If necessary move the elements to their now nearest clusters.  Repeat the "re-centering" and mov...

Partitioning - or, students into n groups - in Ruby

A while back a friend of mine asked if I could automate the creation of groups in a class of students - to maximize the variation within each group, but minimize the difference between them. This sounded like an interesting problem, so I set about to solve it in my own naive experimental Monte Carlo (inspired) way - without looking into ways this has been solved (surely elegantly) before. The result was this little Ruby script: Loading... First I just require some code I have previously written. (I guess I really should make them into gems or something instead of copying code around...) names.rb  tries to guess sex from a person's name and/or title, and countries.rb  maps countries to continents, regions etc with some fuzzy matching of names. (I'm looking at you Democratic Republic of Kongo,  Koreas  etc.) Easy-peasy. Then I set some standard variables. (Maybe I'll make this dynamic in the future, why not?) The most interesting entry here is the classifiers ...