What Just Broke?
Part of Day One
This is part of Day One: Python for Platform Engineers.
QA says the login flow is broken. You ssh to the app server and tail -f the log — it's moving fast, there are hundreds of ERROR lines, and you can't tell what's actually happening versus noise. grep ERROR app.log | head -20 gives you 20 lines that could be anything.
You need to understand the failure, not just find it. That's the gap between grep and Python.
The Concept: Finding vs. Understanding
flowchart TD
subgraph G [The grep Way: Finding]
G1([Log File]) --> G2[grep ERROR]
G2 --> G3([List of Lines])
end
subgraph P [The Python Way: Understanding]
P1([Log File]) --> P2{Parse & Filter}
P2 --> P3[Count by Component]
P2 --> P4[Find Time Window]
P2 --> P5[Calculate Error Rate]
P3 & P4 & P5 --> P6([Actionable Summary])
end
style G1 fill:#1a202c,stroke:#cbd5e0,stroke-width:2px,color:#fff
style G2 fill:#c53030,stroke:#cbd5e0,stroke-width:2px,color:#fff
style G3 fill:#2d3748,stroke:#cbd5e0,stroke-width:2px,color:#fff
style P1 fill:#1a202c,stroke:#cbd5e0,stroke-width:2px,color:#fff
style P2 fill:#4a5568,stroke:#cbd5e0,stroke-width:2px,color:#fff
style P3 fill:#2d3748,stroke:#cbd5e0,stroke-width:2px,color:#fff
style P4 fill:#2d3748,stroke:#cbd5e0,stroke-width:2px,color:#fff
style P5 fill:#2d3748,stroke:#cbd5e0,stroke-width:2px,color:#fff
style P6 fill:#2f855a,stroke:#cbd5e0,stroke-width:2px,color:#fff
style G fill:transparent,stroke:#cbd5e0,stroke-dasharray: 5 5,color:#fff
style P fill:transparent,stroke:#cbd5e0,stroke-dasharray: 5 5,color:#fff
What grep Does Well (and Where It Stops)
| grep: finding errors | |
|---|---|
grep finds every line containing "ERROR". That's useful. But it doesn't answer:
- How many errors are there versus total log volume?
- Which component is producing them?
- When did they start? Did the count spike at a specific time?
- Is it one recurring error or twenty different ones?
To answer those questions in bash you're writing awk, accumulating into temp files, or running five separate commands. Python does it in one script you can keep and reuse.
Start Simple: Count the Errors
with open(...)automatically closes the file when the block exits — no need to callf.close().- Memory Efficient: We iterate over the file one line at a time. This works on a 1KB file or a 10GB file without crashing your server.
Scaling to Giant Logs
Earlier tutorials often show f.readlines(), which reads the entire file into memory at once. For an SRE, this is a trap. A 5GB production log will immediately trigger an OOM (Out of Memory) killer on your app server. Always use for line in f: to process logs safely.
| Running it | |
|---|---|
That one number already tells you something. 6.9% error rate on login is a problem. 0.01% might be acceptable noise.
Find Which Component Is Failing
Most application logs follow a pattern. A common one:
2024-01-15 14:23:01 ERROR [AuthService] Token validation failed for user 4821
2024-01-15 14:23:01 INFO [RequestRouter] GET /api/login 401
2024-01-15 14:23:02 ERROR [AuthService] Token validation failed for user 9334
2024-01-15 14:23:02 ERROR [DatabasePool] Connection timeout after 5000ms
The component name is in brackets. Pull it out:
Counteris a dictionary that counts things.Counter()starts empty;counter[key] += 1increments the count for that key.most_common()returns items sorted by count, highest first.line.split("[")splits the string on[, giving a list.[1]takes the second element (after the first[)..split("]")[0]then takes the part before]. Brittle if the format changes, but effective for a consistent log format.most_common()returns(key, count)pairs sorted by count descending — the worst offender is first.
| Output | |
|---|---|
Now you know: it's almost entirely AuthService. The database pool errors are probably downstream. You know where to look.
Find When the Errors Started
If your log lines start with a timestamp, you can find the first and last error window:
Cross that timestamp against your deploy log and you know immediately whether the errors started with the deploy.
Putting It Together
This is a script you keep. Add it to your dotfiles or internal tooling repo. Run it on any structured log file.
Different log format?
Not every log uses [ComponentName]. If yours uses key=value pairs instead:
# Format: "2024-01-15T14:23:01Z level=ERROR component=AuthService message=..."
parts = dict(p.split("=", 1) for p in line.split() if "=" in p)
component = parts.get("component", "unknown")
The extraction changes; the rest of the script stays the same.
Practice Exercises
Exercise 1: Find the most common error message
Instead of grouping by component, group by the error message text. Use the same Counter approach, but extract the part of the line after the component name.
Answer
Exercise 2: Filter by time range
Modify the script to only count errors that occurred after a specific timestamp. For example: only errors after 2024-01-15 14:00:00.
Quick Recap
| Concept | What It Does |
|---|---|
open(file) + for line in f: |
Read file one line at a time (memory efficient) |
line[:19] |
Extract fixed-width timestamp from start of line |
Counter |
Count occurrences of any key |
most_common() |
Return items sorted by count, highest first |
line.split("[")[1].split("]")[0] |
Extract text between brackets |
What's Next
- Did the Config Change? — When you suspect the error is configuration drift, not application code
Further Reading
Official Documentation
collections.Counter— The counting dictionary- Python file I/O —
open(),readlines(),withstatement
Deep Dives
- Regular expressions in Python — For parsing log formats that aren't cleanly delimited by brackets or spaces (covered in depth in the Efficiency section)
Exploring Linux
- Reading Logs — The Linux side:
journalctl,tail -f, and navigating log files from the command line - grep — When your log parsing can stay in bash,
grepis what you reach for first