Sorting Cats with Hadoop and psort
[This post will also be published on http://lwsffhs.wordpress.com/]
This is my first “self” tutorial on hadoop mapreduce streaming. If you are really IT oriented you probably want to read http://hadoop.apache.org/docs/r0.15.2/streaming.html (or any newer version). This post doesn’t add much to that document with respect to hadoop mapreduce streaming. Here I play a bit with the “sort” on the command line. Probably you might want to read first my previous notes: psort: parallel sorting …. I will run these examples in a virtual cluster (libvirt/qemu/KVM) composed of 1 master node with 4 CPUs and 10 computing nodes with 2 CPUs each. The virtual nodes are distributed in two physical machines (I will post here in the future some details about this virtual cluster).
The question I had was: what hadoop mapreduce streaming actually does?
psort: Parallel sorting on the command line. An example.
[This is a copy of the post on: http://lwsffhs.wordpress.com/ at http://lwsffhs.wordpress.com/2012/08/29/psort-parallel-sorting-on-the-command-line-an-example/ ]
I am in the process to understand hadoop and the map-reduce framework.
This introductory line will be clarified with the next post, but keep in mind that in this post I am not seeking for the fastest sort but a bit more for a sort within a parallel framework. I need to sort a lot of data 😉
I needed a simple code which would work on my Q6600 processor and also on my 2 nodes 16×2 cores cpus. Sorting seems to be a good example, easy to understand, easy to implement with the sort command and a pretty typical problem. More over hadoop recently (maybe years in the IT time scale) won one of the sorting competition (See here or also here. Google it for up to date data). It sounded a good starting point for a simple and dummy comparison.