Partitioning - or, students into n groups - in Ruby

A while back a friend of mine asked if I could automate the creation of groups in a class of students - to maximize the variation within each group, but minimize the difference between them. This sounded like an interesting problem, so I set about to solve it in my own naive experimental Monte Carlo (inspired) way - without looking into ways this has been solved (surely elegantly) before. The result was this little Ruby script:
First I just require some code I have previously written. (I guess I really should make them into gems or something instead of copying code around...) names.rb tries to guess sex from a person's name and/or title, and countries.rb maps countries to continents, regions etc with some fuzzy matching of names. (I'm looking at you Democratic Republic of Kongo, Koreas etc.) Easy-peasy.

Then I set some standard variables. (Maybe I'll make this dynamic in the future, why not?) The most interesting entry here is the classifiers hash. This maps the columns we want to use later on to maximize the difference of to their respective weights. (This should of course be updated to match whatever is in the input file...) Another semi-useful parameter is number_of_runs. This is the number of simulations the partitioner will perform before keeping the best partitioning. And of course number_of_groups - the number of groups that the students will be split over.
Høst i Tete D'Or #1

count is just a variant of my count function already covered here in the blog (link).

I slurp up the list of participants as an array of hashes, from the file name specified above, converting the headers as specified in the variable_names_map hash and generating some standard variables in the process - unless they are already present.

The important function in my naive approach is the scoring_function. This basically looks at the extremes one variable at the time. That is, it checks to see if any value (or category) of a variable is over or underrepresented, based on the expected number of elements with this value. samples is a count of number of elements in the group. I iterate over all categories of the variable to find the worst sinner (squared). So if we are looking at sex, it will check male, and then female and generate a score based on the category that is most off from the expected value (The number of participants with that characteristics divided by number of participants in this group).

The penalty_function simply assigns scores to each group in the current partition using the scoring_function per classifier multiplied by the weight of the respective classifiers.

The main loop of the program runs the simulations number_of_runs times. I clone the participants list and sample from that one so that I can delete participants as I add them to the partitions as I generete them. (Maybe this step is unecessary if Ruby's sample function doesn't loop before all the elements have been exhausted...) Update: I halved the runtime by rather sampling once - a list of the same length as the original list and popping elements from that sampled list. The way I then instansiatiate the empty partitions hash is quite a neat Ruby feature/trick/hack. The default value of a hash (or array) can take blocks as constructors! This means that if you try to pull something non-existant from a hash a new element will be generated based on that block. In this case I generate an empty array when that occurs. (An alternative to that would be to use the ||= operator while adding elements, but that is maybe slightly less elegant...)
Høst i Tete D'Or #2
After this loop I just write out the best partition.

Voila - it takes about 10 5 seconds on my current computer(s) to generate these groups when number of runs is set to 10000. Not sure if the algorithm is scientifically sound (and it is quite possible rather an overkill), but it seems to actually do the job when assigning a flock of students to groups and was the first thing that sprang to my mind when wanting to create something quickly (It took way longer to write up this blogpost than the code) and... dirtily... 

The group work got done and groups seemed fairly balanced.

- Penalizes small groups more heavily than larger groups (if some groups has more participants than other)?
- Variables with more categories dominates the score? (There's probably a useful statistical tool to solve this... (Normalisation?))
- Variables have to be categoric (age has to be grouped).

Potential future work:
- Split out the scoring function (and possible the penalty function) as a stand alone library.
- Let the script take arguments for useful parameters (filename, number_of_runs, group_size, variable_names_map, classifiers etc...)
- Could it be interesting to look at continuos variables (ie. age) as well as categorical ones? 
- Rather than discarding previous partitions one could let new ones evolve from a certain number of branches? (Local minimum problems?)

(BTW; God jul!)