Command Line Magic: Advanced Text Processing with awk and sed
While Python or Ruby are excellent for complex text processing, nothing beats the speed and ubiquity of awk and sed. These tools are installed on virtually every Unix-like system. Mastering them allows you to parse gigabytes of log files and transform data on the fly.
sed: The Stream Editor
sed is designed to parse and transform text. It reads text line by line, applies a set of rules, and outputs the result.
1. Simple Substitutions
The most common use of sed is finding and replacing text using regular expressions.
sed 's/ERROR/WARNING/g' server.log
The s command stands for substitute. /ERROR/ is the pattern to find, WARNING is the replacement, and g stands for global (replace all occurrences on a line, not just the first).
2. Deleting Lines
You can easily delete lines matching a specific pattern. To delete all lines containing the word “debug”:
sed '/debug/d' application.log
3. In-Place Editing
By default, sed outputs to standard out. To modify the file directly, use the -i (in-place) flag:
sed -i 's/127.0.0.1/0.0.0.0/g' /etc/nginx/nginx.conf
Tip: Use sed -i.bak to automatically create a backup file before making changes.
4. Printing Specific Line Ranges
To extract lines 100 through 150 from a massive log file without opening it in less:
sed -n '100,150p' huge_log.txt
(The -n flag suppresses default output, and p explicitly prints the matched lines).
awk: The Data Extraction Language
While sed is great for altering text, awk is a full programming language specifically designed for processing columnar data and extracting fields.
By default, awk splits every line into fields based on whitespace.
$0represents the entire line.$1represents the first column,$2the second, and so on.NFrepresents the Number of Fields (columns) in the line.
1. Extracting Columns
Print only the User (1st column) and Home Directory (6th column) from /etc/passwd. (We tell awk to use : as the field separator using -F):
awk -F':' '{print $1 " lives at " $6}' /etc/passwd
2. Filtering by Condition
Print lines where the 3rd column (e.g., HTTP status code) is greater than 400:
awk '$3 >= 400 {print $0}' access.log
3. Data Aggregation
awk really shines at aggregating data. Let’s calculate the total sum of numbers in the 5th column (e.g., bytes transferred):
awk '{sum += $5} END {print "Total Bytes: " sum}' access.log
This script adds the 5th column of every line to the sum variable. The END block runs after all lines have been processed, printing the final total.
4. Finding Unique IP Addresses
Extract the first column (IP addresses), sort them, and count occurrences using a combination of tools:
awk '{print $1}' access.log | sort | uniq -c | sort -nr
This classic one-liner prints a list of IP addresses ordered by how many times they appear in an Nginx access log.
Combining Them
The true UNIX philosophy is piping simple tools together. For example, use sed to filter a block of time, and awk to average the response times within that block. Learning sed and awk transforms the terminal from a basic interface into a powerful data analytics engine.