This is a very simple script which I use often to create temporary backups of working directories. As complexity grows, I needed to have a script to save locally the state of my work, so that I can easily revert to a previous state. It came out that I also use it for backups. See it:
There are some simple unix commands that are pretty useful but time to time I forget about. I often use them in handling large data sets and I am always surprised how a good pipe might save time and resources making possible to handle large amount of data within a small resources. It might make the difference between having an answer or not.
[This post will also be published on http://lwsffhs.wordpress.com/]
This is my first “self” tutorial on hadoop mapreduce streaming. If you are really IT oriented you probably want to read http://hadoop.apache.org/docs/r0.15.2/streaming.html (or any newer version). This post doesn’t add much to that document with respect to hadoop mapreduce streaming. Here I play a bit with the “sort” on the command line. Probably you might want to read first my previous notes: psort: parallel sorting …. I will run these examples in a virtual cluster (libvirt/qemu/KVM) composed of 1 master node with 4 CPUs and 10 computing nodes with 2 CPUs each. The virtual nodes are distributed in two physical machines (I will post here in the future some details about this virtual cluster).
The question I had was: what hadoop mapreduce streaming actually does?