Yuhang He's Blog

Some birds are not meant to be caged, their feathers are just too bright.

Delve deeper into AWK

During my work as a deep learning researcher, too often I encounter various file processing with sophisticated requirements.

awk is often the shell script I turn to, but it is too complicated to be well grasped in a short time. Thus, I decide to record some error-prone awk usages here.

  • Anyway, awk works as:
awk -F 'xxx' 'BEGIN {print "xxx"} {print "xxx"} END {print "xxx"}' input_file
  • awk for extracting particular file lines containing key words. Note that the key words should be surrounded by /.
awk '/key_words/{print xxx }' input_file
  • Tricks of the trade of the combination of FNR and NR. Similarily, both FNR and NR record the line number AWK currently reads in. But differently, FNR is loyal to the line number being processing in the current file, when a new file arrives, FNR resets itself and begins with 1 again, while NR directly ignores this. One useful application is to stitch or split two files. One obvisous example goes as:
name 12345             12345 salary         name 12345 salary

From left to right, we have two original files and we want to create the third file as shown in the rightmost, which is the combination of the first two files. To achieve this, FNR and NR come to help:

awk -F ' ' 'NR==FNR{a[$2]=$0; next}{print a[$1]" "$2}' file1 file2

where NR==FNR decides whether the file1 is being processed. While file1 is being processed, array $a$ is constructed with $2 of file1 being the array indices and $0 being the array value. When the condition NR==FNR violates, that is, file2 is being processed, {print a[$1]" "$2} is executed, which outputs the expected file.

  • Combing awk and eval to dispatch datasets. For example, we have an image list file img_list.txt containing large amounts of image lists. We also created a large number of empty files ranging from 1, 2, …, n. Our goal is to dispatch these images (every 10,000 images) to different files. The relevant shell script can be written as
awk 'BEGIN{cnt=0}{if((NR-1)%10000==0){cnt+=1}{print "cp",$1,cnt}}' img_list.txt
| while read line;do eval $line ;done
  • Prepending line number for a file.
awk '$0 = NR" "$0' file
  • Finding the lines existing in two files simultaneously
awk 'NFR==NR {a[$0];next} $0 in a' file1 file2