There are some simple unix commands that are pretty useful but time to time I forget about. I often use them in handling large data sets and I am always surprised how a good pipe might save time and resources making possible to handle large amount of data within a small resources. It might make the difference between having an answer or not.
This is a simple script which I use often to quickly visualize data. It is in my ~/bin directory since years, so I give it a chance to go public 😉
When I finished (after few days) to install our local cluster I could finally get on this video on you tube. Nice video, cheers masterschema!
But I’ll give here my little report in any case for the records and for some more comprehensive written form.
(EDIT: Please note, I do not describe a fast deployment of hadoop, but indeed an hand-on approach to a minimal set up that I used in order to learn some details. For tech people you probably find useful the network configuration on different hard nodes)
Maybe a little intro, what is this about:
- Big data, data analytics, and the like. We want to test an hadoop installation. Not last, eventually use it to simplify some problems with a different approach.
- This is (in this blog) a demo installation and eventually a test case. (no production here, even if it might be a starting point)
- You will not see any data indeed! This is only about serving the tools.
- We focus on map/reduce and/or software implementation, which means we do not consider the system for any data storage solution.
- Beside this: have fun and check how we set up a little virtual cluster…
[This post will also be published on http://lwsffhs.wordpress.com/]
This is my first “self” tutorial on hadoop mapreduce streaming. If you are really IT oriented you probably want to read http://hadoop.apache.org/docs/r0.15.2/streaming.html (or any newer version). This post doesn’t add much to that document with respect to hadoop mapreduce streaming. Here I play a bit with the “sort” on the command line. Probably you might want to read first my previous notes: psort: parallel sorting …. I will run these examples in a virtual cluster (libvirt/qemu/KVM) composed of 1 master node with 4 CPUs and 10 computing nodes with 2 CPUs each. The virtual nodes are distributed in two physical machines (I will post here in the future some details about this virtual cluster).
The question I had was: what hadoop mapreduce streaming actually does?