Parallel Grep and Awk

4 minute read

tl;dr

Get GNU parallel (e.g. brew install parallel, apt-get install parallel, etc.).

Run grep in parallel blocks on a single file.

parallel --pipepart --block 10M -a <filename> -k grep <grep-args>

Run grep on multiple files in parallel, in this case all files in a directory and its subdirectories. Add /dev/null to force grep to prepend the filename to the matching line.

find . -type f | xargs -n 1 -P 4 grep <grep-args> /dev/null

Run grep in parallel blocks on multiple files in serial. Manually prepend the filename since grep can’t do it in this case.

# for-loop
for filename in `find . -type f`
do 
  parallel --pipepart --block 10M -a $filename -k "grep <grep-args> | awk -v OFS=: '{print \"$filename\",\$0}'"
done

# using xargs
find . -type f | xargs -I filename parallel --pipepart --block 10M -a filename -k "grep <grep-args> | awk -v OFS=: '{print \"filename\",\$0}'"

Run grep in parallel blocks on multiple files in parallel. Take care to prepend the filename since grep can’t do it in this case. Warning, this may be an inefficient use of multithreading.

find . -type f | xargs -n 1 -P 4 -I filename parallel --pipepart --block 10M -a filename -k "grep <grep-args> | awk -v OFS=: '{print \"filename\",\$0}'"

Parallel grep on one file

Say I want to know how many times feature “15577606” appears in the KDD CUP 2010 kddb LIBSVM machine learning benchmark training set. This is a binary classification data set containing 19 million lines (each line is a feature vector) and 30 million features – a large grep task.

wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/kddb.bz2
bunzip2 kddb.bz2
time grep 15577606 kddb > /dev/null

A standard grep takes me 1m24s. Grep picks out just 199 lines containing that feature.

The GNU parallel utility gives me a nice 5.6x speedup at 15s using multiple threads.

time \
  parallel --pipepart --block 10M -a kddb -k grep 15577606 \
  > /dev/null

Parallel feature cardinality with awk on one file

Now I’ll use this to do something useful: count the occurrence of each of the $O(10^7)$ features in the training file. I’ll use a map-reduce pattern. In the map phase I’ll run “feature_count.awk” with the following contents.

#!/usr/bin/awk -f 

{
  # loop over each feature but skip the label
  for (i = 2; i <= NF; i++) {
    # split the feature at the ':' character
    split($i, a, ":")
    # count the number of times the feature appears
    n[a[1]]++
  }
}
END {
  for (i in n) {
  	# print out the feature and its count
    print i, n[i]
  }
}

The reduce stage is an awk one liner that adds the counts by feature. Naively you would run it like this.

time ./feature_count.awk kddb | \
  awk '{n[$1] += $2} END {for (i in n) {print i, n[i]}}' > \
  kddb_feature_count_naive.txt

This takes me 19m50s. With parallel you could do

time parallel --pipepart --block 10M -a kddb -k ./feature_count.awk | \
  awk '{n[$1] += $2} END {for (i in n) {print i, n[i]}}' > \
  kddb_feature_count.txt

GNU parallel gives me 4m50s – a 4.1x speedup.

What I learned about the data

A quick analysis of the feature statistics follows.

First thing I learned, 72% of the features appear in the training set just once, as in, they appear in just one single feature vector. This is a red flag because normally you’d want features to appear many times for the model to learn something generalizable from them.

I’ll do a separate feature cardinality run on the test data set.

wget http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/kddb.t.bz2
bunzip2 kddb.t.bz2
./feature_count.awk kddb.t > kddb.t_feature_count.txt

There are 2,990,384 features in the test set.

The superset of all features in the training and test sets is 29,890,095.

# set union for first column of two files 
awk '
!($1 in n) {m++; n[$1]} 
END {print m}
' kddb.t_feature_count.txt kddb_feature_count.txt

But only 7% of the test set features appear in the training set.

# set intersection for first column of two files 
awk '
NR == FNR {n[$1]} 
NR > FNR && ($1 in n) {m++} 
END {print m}
' kddb.t_feature_count.txt kddb_feature_count.txt

Even fewer (4%) make more than 10 appearances in the training set.

# set intersection plus filter for first column of two files 
awk -v min_occurrences=10 '
NR == FNR && $2 > min_occurrences {n[$1]} 
NR > FNR && ($1 in n) {m++} 
END {print m} ' kddb_feature_count.txt kddb.t_feature_count.txt

If I were to build an ML model for this task, I would remove features that appear less than 10 times in the training set as a preprocessing step. Here’s a quick way to generate a feature include-list.

# features that appear more than min_occurrences times
awk -v min_occurrences=10 '
$2 > min_occurrences {print $1} 
' kddb_feature_count.txt > kddb_feature_include_list.txt

That’s 3,814,194 eligible features in the training set, 13% of the original dimensionality. This would bring nice speedups to model training and prediction, at no cost to accuracy.

Comments