Skip to content

Is It Still Up?

Part of Day One

This is part of Day One: Python for Platform Engineers.

You're deploying. Traffic is cut over to the new pods, the old ones are terminating, and you need to know the moment your API is healthy again before you proceed. Your options: sit there hitting refresh, write a curl loop in bash, or do it properly.

This is where Python earns its place. Not because curl can't poll — it can. Because you need to know why the check failed, not just that it did.


The Bash Version and Its Problem

Bash health poller
1
2
3
4
while ! curl -sf http://api.internal/health; do
  sleep 5
done
echo "API is up"

This works for interactive use. In a deployment pipeline, it has problems:

  • No timeout — it'll run forever if the API never comes back
  • No distinction between "connection refused" (server not started yet) and "HTTP 503" (server started, app not ready)
  • Exit code is from curl, not from your intent — harder to integrate with pipeline logic
  • No useful output about how long it waited

Here's what the Python version does at each step:

flowchart TD
    A([Start polling]) --> B[Send HTTP GET]
    B --> C{Response?}
    C -->|HTTP 200| D([✓ Healthy — continue deploy])
    C -->|HTTP 503| E[Print status, wait 5s]
    C -->|Connection refused| E
    C -->|Request timeout| E
    E --> F{Elapsed ≥ timeout?}
    F -->|No| B
    F -->|Yes| G([✗ Did not recover — exit 1])

    style A fill:#1a202c,stroke:#cbd5e0,stroke-width:2px,color:#fff
    style B fill:#2d3748,stroke:#cbd5e0,stroke-width:2px,color:#fff
    style C fill:#4a5568,stroke:#cbd5e0,stroke-width:2px,color:#fff
    style D fill:#2f855a,stroke:#cbd5e0,stroke-width:2px,color:#fff
    style E fill:#2d3748,stroke:#cbd5e0,stroke-width:2px,color:#fff
    style F fill:#4a5568,stroke:#cbd5e0,stroke-width:2px,color:#fff
    style G fill:#c53030,stroke:#cbd5e0,stroke-width:2px,color:#fff

The Python Version

health_check.py
import requests
import time
import sys


def wait_for_health(url, timeout=120, interval=5):
    """Poll url until it returns HTTP 200 or timeout expires.

    Returns True if healthy, False if timed out.
    """
    start = time.time()

    while True:
        elapsed = time.time() - start

        if elapsed >= timeout:
            return False

        try:
            resp = requests.get(url, timeout=3)  # (1)!
            if resp.status_code == 200:
                print(f"✓ {url} is healthy ({elapsed:.0f}s)")
                return True
            else:
                print(f"  HTTP {resp.status_code} ({elapsed:.0f}s / {timeout}s)")

        except requests.exceptions.ConnectionError:
            print(f"  Connection refused ({elapsed:.0f}s / {timeout}s)")  # (2)!

        except requests.exceptions.Timeout:
            print(f"  No response within 3s ({elapsed:.0f}s / {timeout}s)")  # (3)!

        time.sleep(interval)


if __name__ == "__main__":
    url = "http://api.internal/health"

    if not wait_for_health(url, timeout=120, interval=5):
        print(f"✗ {url} did not recover within 120s")
        sys.exit(1)  # (4)!
  1. timeout=3 is the per-request timeout — how long to wait for a single HTTP response. Separate from the outer timeout=120 loop timeout.
  2. Connection refused means the server isn't listening yet. Different from an HTTP error — the application hasn't started.
  3. Server accepted the connection but didn't respond in time — usually means the app is starting but not ready.
  4. sys.exit(1) signals failure to whatever called this script — your CI/CD pipeline, a Makefile, a parent script.

Running It

Running the health check
1
2
3
4
5
6
python health_check.py
#   Connection refused (0s / 120s)
#   Connection refused (5s / 120s)
#   HTTP 503 (10s / 120s)
#   HTTP 503 (15s / 120s)
# ✓ http://api.internal/health is healthy (20s)

The output tells you exactly what the server was doing during the wait. In a CI log, that's useful information. "It spent 10 seconds failing to connect, then 10 seconds returning 503s before coming healthy" is a different story than "it came up immediately."


Checking a JSON Field, Not Just Status Code

Your /health endpoint might return 200 with a body that indicates partial readiness:

{"status": "degraded", "database": false, "cache": true}

A status-code-only check would pass this. Look at the body itself:

Checking the response body
1
2
3
4
5
6
7
resp = requests.get(url, timeout=3)
if resp.status_code == 200:
    data = resp.json()
    if data.get("status") == "healthy":
        return True
    else:
        print(f"  Status: {data.get('status')}")

This snippet replaces the if resp.status_code == 200: block inside the wait_for_health() loop.

This is where Python genuinely beats a curl loop — parsing JSON inline without calling jq or juggling subshells.


Making It Reusable Across a Deploy Script

A health check that lives in a function can be called from a larger deployment script:

Integrated into a deploy script
from health_check import wait_for_health
import sys

def deploy():
    print("→ Applying manifests")
    # ... kubectl apply ...

    print("→ Waiting for rollout")
    # ... kubectl rollout status ...

    print("→ Verifying health")
    if not wait_for_health("http://api.internal/health", timeout=120):
        print("✗ Deploy failed — API did not become healthy")
        sys.exit(1)

    print("✓ Deploy complete")

deploy()

Bash functions exist, but sharing logic across files and integrating cleanly with exit codes gets awkward fast. The moment a health check needs to live inside a deploy pipeline — not a terminal — you've outgrown the curl loop.


Practice Exercises

Exercise 1: Add exponential backoff

The current poller waits exactly 5 seconds between each attempt. Modify it so the interval doubles after each failed attempt, up to a maximum of 30 seconds. (This reduces load on a recovering service while still catching a fast recovery.)

Answer
Exponential backoff
1
2
3
4
5
6
7
interval = 5
max_interval = 30

while True:
    # ... attempt ...
    time.sleep(interval)
    interval = min(interval * 2, max_interval)
Exercise 2: Accept the URL as a command-line argument

Hardcoding the URL makes the script less reusable. Modify health_check.py so the URL is passed as the first argument: python health_check.py http://api.internal/health

Answer

URL from command line
1
2
3
4
5
6
7
import sys

if len(sys.argv) < 2:
    print(f"Usage: {sys.argv[0]} <url>")
    sys.exit(1)

url = sys.argv[1]
For a full CLI with flags, the Efficiency section covers click.


Quick Recap

Concept What It Does
requests.get(url, timeout=3) HTTP GET with per-request timeout
ConnectionError Server isn't listening (process not started)
Timeout Server accepted connection but didn't respond
resp.status_code HTTP status (200, 503, etc.)
resp.json() Parse response body as JSON
sys.exit(1) Signal failure to calling process

What's Next

  • What Just Broke? — When the API came back but something still isn't right and you need to read the logs

Further Reading

Official Documentation

Exploring Kubernetes

  • kubectl Commands — When health checking is part of a larger deploy: kubectl rollout status and related commands