Some unix commands I want to remember and an example from recsys
There are some simple unix commands that are pretty useful but time to time I forget about. I often use them in handling large data sets and I am always surprised how a good pipe might save time and resources making possible to handle large amount of data within a small resources. It might make the difference between having an answer or not.
Here is the quick list:
nl, xargs, paste, cut, shuf, rename, tee, wc
There are definitely many others but I forgot about 😉
Here I just do a very little simple tutorial about using some this commands. I will use some examples for typical problems in recommender systems where I usually deal with users, items and ratings.
I tend to program what I need, indeed keeping in mind these unix commands might be useful to help in using what exists already. For example awk is almost my second shell but most of the unix commands have the solution already. But lets go back to our commands. This is a little extract from their man pages:
xargs - build and execute command lines from standard input: a long SYNOPSIS... rename - rename files: rename [options] expression replacement file... cut - remove sections from each line of files: cut OPTION... [FILE]... wc - print newline, word, and byte counts for each file: wc [OPTION]... [FILE]... shuf - generate random permutations: shuf [OPTION]... [FILE] tee - read from standard input and write to standard output and files: tee [OPTION]... [FILE]... nl - number lines of files: nl [OPTION]... [FILE]... paste - merge lines of files: paste [OPTION]... [FILE]...
I had the tendency to substitute the xargs and the rename commands with small shell scripts using a for or a foreach loop or complex awk scripts. This often is a waste of my mental resources 😉 . These commands can be included almost unchanged in any type of shell script being sh, bash, csh, or ksh. Again, I use a lot awk instead of cut. See also this forum discussion for alternatives. An other well know command is wc and its options. But the most useful for which a substitution might be difficult are the last three: tee, nl, paste.
But lets see some examples. The files I had to deal with in the last few months where often ASCII files containing a used_ID, an item_ID and a rating_VALUE in each line. In some cases the lines could be millions.
The first couple of things to do is to verify the data integrity and structure (note: here I just do simple examples to put in use the unix command, it is not a guide to data integrity checks). The data set I use is the movielens data from the grouplens.org research lab. Lets start.
Lets see the file:
~> head -4 u.data 196 242 3 881250949 186 302 3 891717742 22 377 1 878887116 244 51 2 880606923
The file contains a timestamp in column 4 in addition to userd_ID, item_ID, rating_VALUE.
Count the number of ratings.
~> wc -l u.data 100000 ~> wc u.data 100000 400000 1979173 u.data
All the four fields are present in the file (see the 400000 value).
Verify that there is no duplicated ratings (i.e. no user has multiple ratings for the same object):
~> cat u.data | awk '{print $1,$2;}' | sort -n -k 1,1 -k 2,2 | uniq -c | sort -n -k 1,1 -k 2,2 -k 3,3 | tail -4 1 943 1074 1 943 1188 1 943 1228 1 943 1330
The last lines must have a 1 as first element. If it is different from 1, it means that there are duplicated ratings. This simple commands pipe uses sort to sort numerically (-n) first by field 1 then by field 2 (-k 1,1 -k 2,2) after that we have selected only the firsts two fields (user_ID, item_ID) with the awk command. uniq is used to count (-c) the duplicated lines. Because we expect only one rating per user/item pair, all the lines should start with 1. The second sort is used to move to the bottom of the output the eventual non unique lines. The by field sorting (-k 1,1 -k 2,2 -k 3,3) is redundant but would give a better looking output ;). Probably a “grep -v ‘^ *1′” would be faster, but here I just want to give plain examples.
Count the number of unique users:
~> cat u.data | awk '{print $1;}' | sort -n | uniq | wc -l 943
Get the users ratings distribution and plot it:
~> cat u.data | awk '{print $1;}' | sort -n | uniq -c | awk '{print $1;}' | sort -n | uniq -c | awk '{print $2,$1;}'> x.u.dist ./gplot.csh x.u.dist
The plot looks like:
This would be how many users (x axis) gave a given number of ratings (y axis). The gplot.csh command is described in this previous post.
The command starts by extracting the user_IDs from the rating list and count the number of ratings for each user (first uniq -c), the following pipe determines the frequency of these numbers.
For the items we can proceed with the same commands. We just have to substitute a $2 instead of a $1 in the first awk commands.
~> cat u.data | awk '{print $2;}' | sort -n | uniq | wc -l 1682
Get the items ratings distribution and plot it:
~> cat u.data | awk '{print $2;}' | sort -n | uniq -c | awk '{print $1;}' | sort -n | uniq -c | awk '{print $2,$1;}'> x.i.dist ~> ./gplot.csh x.i.dist
To see the use of the commands shuf and nl we can consider for example the problem of anonymizing or indexing the user_ID or the item_ID. This command for example:
~> cat u.data | awk '{print $1;}' | sort -n | uniq | nl > x.u.index
gives an index table (user_INDEX, user_ID) where user_INDEX is granted to be sequential and continuous. Probably with the movielens file it is redundant, but consider this lines:
~> cat u.data | shuf | head -10000 > x.10kset ~> cat x.10kset | awk '{print $1;}' | sort -n | uniq | wc -l 921 ~> cat x.10kset | awk '{print $2;}' | sort -n | uniq | wc -l 1237 ~> cat x.10kset | awk '{print $1;}' | sort -n | uniq | nl > x.10kset.index ~> tail -4 x.10kset.index 918 940 919 941 920 942 921 943
In this case we work on a subset of 10000 ratings and the user_IDs (and item_IDs) are not anymore sequential in the file. Please note, while shuf is handy, for scientific consistent random distributions I would consider carefully its usage.
I will leave space in a future post for the use of rename, cut, paste and xargs.
I hope this little list with examples of Unix commands comes handy. It is working already for me: now, It will be more difficult to forget about them 😉
Appendix
Switches note. For portability I suggest to verify the availability of the commands switches in different Unix flavours and versions.
sort notes. Modern sort commands have additional options which might help. For example, as also Ole reports in a comment in this post , sort has a –parallel switch to execute itself in parallel. More related to this post you can consider also the -u and –unique switcthes.