This is a very simple script which I use often to create temporary backups of working directories. As complexity grows, I needed to have a script to save locally the state of my work, so that I can easily revert to a previous state. It came out that I also use it for backups. See it:
There are some simple unix commands that are pretty useful but time to time I forget about. I often use them in handling large data sets and I am always surprised how a good pipe might save time and resources making possible to handle large amount of data within a small resources. It might make the difference between having an answer or not.
This is a simple script which I use often to quickly visualize data. It is in my ~/bin directory since years, so I give it a chance to go public 😉
When I finished (after few days) to install our local cluster I could finally get on this video on you tube. Nice video, cheers masterschema!
But I’ll give here my little report in any case for the records and for some more comprehensive written form.
(EDIT: Please note, I do not describe a fast deployment of hadoop, but indeed an hand-on approach to a minimal set up that I used in order to learn some details. For tech people you probably find useful the network configuration on different hard nodes)
Maybe a little intro, what is this about:
- Big data, data analytics, and the like. We want to test an hadoop installation. Not last, eventually use it to simplify some problems with a different approach.
- This is (in this blog) a demo installation and eventually a test case. (no production here, even if it might be a starting point)
- You will not see any data indeed! This is only about serving the tools.
- We focus on map/reduce and/or software implementation, which means we do not consider the system for any data storage solution.
- Beside this: have fun and check how we set up a little virtual cluster…
[This post will also be published on http://lwsffhs.wordpress.com/]
This is my first “self” tutorial on hadoop mapreduce streaming. If you are really IT oriented you probably want to read http://hadoop.apache.org/docs/r0.15.2/streaming.html (or any newer version). This post doesn’t add much to that document with respect to hadoop mapreduce streaming. Here I play a bit with the “sort” on the command line. Probably you might want to read first my previous notes: psort: parallel sorting …. I will run these examples in a virtual cluster (libvirt/qemu/KVM) composed of 1 master node with 4 CPUs and 10 computing nodes with 2 CPUs each. The virtual nodes are distributed in two physical machines (I will post here in the future some details about this virtual cluster).
The question I had was: what hadoop mapreduce streaming actually does?
[This is a copy of the post on: http://lwsffhs.wordpress.com/ at http://lwsffhs.wordpress.com/2012/08/29/psort-parallel-sorting-on-the-command-line-an-example/ ]
I am in the process to understand hadoop and the map-reduce framework.
This introductory line will be clarified with the next post, but keep in mind that in this post I am not seeking for the fastest sort but a bit more for a sort within a parallel framework. I need to sort a lot of data 😉
I needed a simple code which would work on my Q6600 processor and also on my 2 nodes 16×2 cores cpus. Sorting seems to be a good example, easy to understand, easy to implement with the sort command and a pretty typical problem. More over hadoop recently (maybe years in the IT time scale) won one of the sorting competition (See here or also here. Google it for up to date data). It sounded a good starting point for a simple and dummy comparison.