hd bbc – our lab local virtual hadoop cluster

When I finished (after few days) to install our local cluster I could finally get on this video on you tube. Nice video, cheers masterschema!

But I’ll give here my little report in any case for the records and for some more comprehensive written form.

(EDIT: Please note, I do not describe a fast deployment of hadoop, but indeed an hand-on approach to a minimal set up that I used in order to learn some details. For tech people you probably find useful the network configuration on different hard nodes)

Maybe a little intro, what is this about:

  • Big data, data analytics, and the like. We want to test an hadoop installation. Not last, eventually use it to simplify some problems with a different approach.
  • This is (in this blog) a demo installation and eventually a test case. (no production here, even if it might be a starting point)
  • You will not see any data indeed! This is only about serving the tools.
  • We focus on map/reduce and/or software implementation, which means we do not consider the system for any data storage solution.
  • Beside this: have fun and check how we set up a little virtual cluster…

Lets start from the hosts machines first. At the lab we have a two nodes “cluster”, each one with:

  •  OS: CentOS 6.3 (6.0 upgraded to 6.3 with the exception of the kernel, we needed a specific disk driver)
  • CPU: 2 x AMD Opteron(TM) Processor 6276. 2.3 GHz (16 Cores each)
  • RAM: 64Gb (8 x 8 Gb)
  • DISK: 2 x 600 Gb SAS-II 15000 rpm + RAID Controller Adaptec 6405

BTW: we called the two machines buffalo and bill, and the full cluster is now bbc!

We wanted to test hadoop to see if, even in such small environment, it might become useful. Actually we had two targets in mind: the first is simply to have a local hadoop for development before production, the second was to test if it might solve some problems which are normally, in other codes, limited by the RAM. The previous post about sorting cats was running on this small cluster.

We tried to install hadoop on our two nodes but we found pretty soon problems with a satisfactory configuration. We realized that it would be difficult to control the resources in particular if hadoop had to share the nodes with other jobs (EDITED: Namely: You need dedicated machines). So the idea was to create a virtual cluster dedicated only to hadoop.

I found this combination comfortable: libvirtd/qemu/KVM. I used in the past VMware and VirtualBOX but libvirtd seems easier and straightforward to use for our purpose. You can just type virt-manager and the rest goes very smooth.

The first thing to do is to define a first virtual machine which we will clone to create the cluster. You can download one qcow2 CentOS image from the internet including cloudera. I did play a bit with Oz-Image-Buid. Getting a ready qcow2 file saves you the installation time, in particular choosing the main packages to install.

At the creation of a virtual machine, virt-manager does a lot for you: defining the network and changing on the fly the iptables. In my case I use a system connection, you can read more on the official pages from libvirt. If you right click on the connection you can configure some parameters. In particular this is the virtual network tab:

In my case the subnetwork is

Once you have your starting qcow2 disk image you can add it using virt-manager and start it right away. In this case you use the qcow2 image as main disk so set it as a boot disk in with virt-manager. Start the virtual machine, click on the console tab and login as root with the given password.

The first thing I did was to set up the network. (In the console of the virtual machine!) edit these files:

  • /etc/sysconfig/network
  • /etc/sysconfig/network-scripts/ifcfg-eth0
  • /etc/hosts

In my case they look like this:

[mariotti@master01 ~]$ cat /etc/sysconfig/network

[user@master01 ~]$ cat /etc/sysconfig/network-scripts/ifcfg-eth0

[user@master01 ~]$ cat /etc/hosts localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 master00.hddomain master00 master01.hddomain master01 node02.hddomain node02 node03.hddomain node03
# bnode02.bhddomain bnode02 bnode03.bhddomain bnode03 bnode04.bhddomain bnode04 bnode05.bhddomain bnode05

The second network data in the /etc/hosts files will become clear later. This is a minimum networking. Verify that a line similar to this:


is NOT present in  /etc/sysconfig/network-scripts/ifcfg-eth0.

You also need to allow passwordless ssh. So just create the authorized_keys file:

ssh-keygen (# type few times the enter key)
cp .ssh/id_rsa.pub .ssh/authorized_keys

Type: shutdown -r now. It is nice to see that a virtual machine has a very low hardware latency and will boot in few seconds.

IMPORTANT: Now you have the new virtual machine running and most probably with a default NAT type of network. This is the right time to install all the software you need. For example in my case I needed python with numpy and scipy which are not on the default repositories. In my case (I installed the software after closing the network) I had to install by hand all the RPMs by coping the rpm files via few machines. The target in our case was hadoop with some streaming capabilities using python. In other words: think about all your needs on the cluster at this stage or it becomes time consuming if you have to do it on a later stage.

Now it become a bit tricky. I did a totally different procedure with trials and errors but I will try to give you a simple version. The virtual networking seems to be fast because it is implemented with some direct kernel hooks. You will need to document yourself on this. Having in mind hadoop I was checking about virtual disks performances, it seems that a raw virtual disk format has the best performance. The problem of the raw format is that the disk space is fully allocated in advance in the host machine. It means that if I allocate hadoop disks, I need to reserve disk space in advance on the host even if I do not use it. We agreed that this might be a good compromise, in particular we have an external file server and the disk on the bbc cluster is dedicated to computation.

For this purpose you will add a second disk on the virtual machine defined as raw disk and without any cashing, you can do this quickly again via the virt-manager (we defined it as a 10Gb disk). You will configure the disk to be mounted on /hadoop. After adding the disk and rebooting, as root on the virtual machine, create the directory /hadoop (mkdir /hadoop), create the filesystem (mkfs.ext4 /dev/vdb1) and add this line in the file /etc/fstab:

/dev/vdb1               /hadoop                 ext4    defaults        0 0

This mount point will be used by cloudera by default, so pretty convenient. Last things to do are to verify that selinux is disabled (SELINUX=disabled in /etc/selinux/config) and to switch off the firewall. This of course if you plan to close completely the network. Shutdown the virtual machine.

At this point you have a configured virtual machine and two files in the images directory of libvirt:

[user@buffalo ~]$ ls -al /var/lib/libvirt/images/
total [...]
drwxr-xr-x 3 root libvirt 4096 Oct 2 12:35 .
drwxr-xr-x 3 root root 4096 Sep 24 14:28 ..
-rw-r--r-- 1 qemu qemu 3843686400 Oct 30 06:35 centos60_x86_64-01master.qcow2
-rw------- 1 root root 10737418240 Sep 26 16:07 hd01master.img

We now create the other nodes. Symply copy the disks:

cp centos60_x86_64-01master.qcow2 centos60_x86_64-02node.qcow2
cp centos60_x86_64-01master.qcow2 centos60_x86_64-03node.qcow2
cp centos60_x86_64-01master.qcow2 centos60_x86_64-04node.qcow2
cp  hd01master.img  hd02node.img
cp  hd01master.img  hd03node.img
cp  hd01master.img  hd04node.img

You can of course use your naming conventions. Again via virt-manager create the (in this case three) virtual machines using the newly “created” disks (the qcow2 and the img). When you add the disks do it in order: first the qcow2 then the img or they might get the device name switched (/dev/vdb1 instead of /dev/vda1 for example for the qcow2 disk). Boot each node, change the network files (/etc/sysconfig/network, /etc/sysconfig/network-scripts/ifcfg-eth0) as appropriate. For example node02 will have IP: and FQDN:node02.hddomain. Copy also the hosts file (after a network restart): scp /etc/hosts. To be sure everything is ok reboot the cluster and verify the network (just ssh around 😉 ).

At this point we are ready to install the cloudera hadoop distribution. I used the cloudera manager and simply followed the instructions for the CDH4 version. The actual installation is just like for a normal cluster and I will not report details, I can add the it went on without any problem. I saved now, as a template, the two disks (qcow2 and img) of the a node with a different name for later use.

At this point you might consider to close the network, you might want to change the virtual network definition. In my case it was enough to route the virtual network to the internal interface of the bbc cluster (i.e. the direct connection between our two nodes which has no way out to the internet!). See the picture above, where I defined a new virtual network routed to eth1.

We have a second machine: bill. The idea was to use also part of this machine for hadoop. I created a second connection within my usual virt-manager named qemu+ssh://root@bill/system. The node creation is totally equivalent as described above but I changed the network to be in the subnetwork. I guess you figure out how. The main problem was now to “connect” the two virtual network between the two nodes. I played a bit with the routing tables and this is how it looks like:

[user@buffalo ~]$ route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface     *        U     0      0        0 eth0        *        U     0      0        0 eth1   *        U     0      0        0 virbr0   bill.clusterdom   UG    0      0        0 eth1
link-local      *          U     1002   0        0 eth0
link-local      *          U     1003   0        0 eth1
default         UG    0      0        0 eth0

The two “real” nodes buffalo and bill have two network cards: eth0 attached to our office network and eth1 for the internal communication between the two nodes. From this routing table we see that we route the network to the virtual network bridge device virbr0 and we route the network (for the nodes on bill) to the internal (between real nodes) eth1 device. On the bill machine we have an equivalente routing table:

[user@bill ~]$ route
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface     *        U     0      0        0 eth0        *        U     0      0        0 eth1   buffalo.cluster   UG    0      0        0 eth1   *        U     0      0        0 virbr1
link-local      *          U     1002   0        0 eth0
link-local      *          U     1003   0        0 eth1
default         UG    0      0        0 eth0

Here the two subnetworks have an exchanged routing: routes to the eth1 and routes to the virbr1. You can now test your distributed virtual cluster: again ssh around! If everything is working you can then move to the cloudera manager interface and add the additional nodes.

In our case we have now a virtual cluster with 1 master node and 10 computing nodes. The examples in my previous post sorting cats.. were running on this cluster. The last things I did were to add the users (it is required only on the master node) and to mount via NFS the real cluster /home and /data directories for an easy access to the data and users personal files.

The advantages for us are:

  • We can test our codes in an environment which mimics a bit better a real production environment.
  • We can control also a bit better the resources used by hadoop. Consider that beside adding virtual nodes to the cluster it is also very easy to change the number of CPUs and the RAM available to each node. This makes this little cluster pretty dynamic! (thought, you still need to reboot the nodes for this operation).
  • We can take the advantages of the hadoop framework even for smaller tasks, as we have seen, for example in sorting relatively large files, including the copying overload, we can be considerably faster then quickly tring on the command line.

To test next

How task now is, beside working time to time ;), to test if we can take advantage of hadoop for large files which typically will not fit our relatively small RAM. It is of course an algorithm design problem but we hope that this framework helps in this process. Which of course was already partly demonstrated in the sorting example…

Not to forget: it will be probably needed a bit of fine tuning for the performances..

Help for this post needed

In this post I gave a description for the installation which does not accurately matches the procedure I actually followed. The reason is simple: I had to do few trial and tests for the configuration. So there is the possibility that some steps are not accurately described or a bit unclear. Please if you find mistakes or not too clear procedures, let me know by dropping a comment. Of course questions and simple clarifications request are welcome and I will be happy to update and answer!

[This post will also appear on our official group blog.]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.