green snake
Photo by Pixabay on Pexels.com

Operations-Friendly Observability: A FastAPI Implementation Guide for Logs, Metrics, and Traces — Request ID, JSON Logs, Prometheus, OpenTelemetry, and Dashboard Design


Summary (Inverted Pyramid)

  • The goal is to establish “visibility” for running FastAPI in production so you can respond to incidents and improve performance quickly.
  • Start with adding a Request ID, structured (JSON) logs, and liveness/readiness health checks.
  • Next, expose app metrics (latency, throughput, error rate — the RED metrics) in Prometheus format and prepare alert conditions.
  • Finally, introduce distributed tracing with OpenTelemetry to visualize latency breakdowns across external APIs and databases and pinpoint bottlenecks.

Who Benefits (Concrete Personas)

  • Learner A (college senior, solo developer):
    The API works but occasionally slows down for unknown reasons. Wants to build visibility step by step, starting with logs and metrics.
  • Small Team B (3-person contract shop):
    Root cause analysis takes too long during incidents. Wants Request IDs for tracking and an auto-notification flow when error rate spikes.
  • SaaS Team C (startup):
    Latency propagates across multiple services. Wants OpenTelemetry-based distributed tracing to quickly see where time is being spent.

1. The Minimum Observability Set: Start with Just This

Observability stands on three pillars: logs, metrics, and traces. The minimum set to introduce first:

  1. Request ID (correlation ID)
  2. Structured (JSON) logs
  3. Health checks (liveness/readiness)

1.1 Middleware to Attach a Request ID

# app/obsv/request_id.py
import uuid
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.types import ASGIApp, Receive, Scope, Send

REQUEST_ID_HEADER = "X-Request-ID"

class RequestIdMiddleware(BaseHTTPMiddleware):
    def __init__(self, app: ASGIApp) -> None:
        super().__init__(app)

    async def dispatch(self, request, call_next):
        rid = request.headers.get(REQUEST_ID_HEADER) or str(uuid.uuid4())
        response = await call_next(request)
        response.headers[REQUEST_ID_HEADER] = rid
        # Putting it on request.state makes it easy to reference in logs, etc.
        request.state.request_id = rid
        return response

1.2 JSON Logging (Standard logging Only)

# app/obsv/logging.py
import json, logging, sys
from typing import Any

class JsonFormatter(logging.Formatter):
    def format(self, record: logging.LogRecord) -> str:
        payload: dict[str, Any] = {
            "time": self.formatTime(record, "%Y-%m-%dT%H:%M:%S"),
            "level": record.levelname,
            "logger": record.name,
            "msg": record.getMessage(),
        }
        if hasattr(record, "request_id"):
            payload["request_id"] = record.request_id
        if record.exc_info:
            payload["exc"] = self.formatException(record.exc_info)
        return json.dumps(payload, ensure_ascii=False)

def setup_logging(level: str = "INFO"):
    handler = logging.StreamHandler(sys.stdout)
    handler.setFormatter(JsonFormatter())
    root = logging.getLogger()
    root.handlers[:] = [handler]
    root.setLevel(level.upper())

1.3 App Integration (with Health Checks)

# app/main.py
from fastapi import FastAPI, Request
from app.obsv.request_id import RequestIdMiddleware, REQUEST_ID_HEADER
from app.obsv.logging import setup_logging
import logging

app = FastAPI(title="Observability-Ready API")
app.add_middleware(RequestIdMiddleware)
setup_logging("INFO")
log = logging.getLogger("app")

@app.get("/health/liveness")
def liveness():
    return {"status": "alive"}

@app.get("/health/readiness")
def readiness():
    # Check DB connectivity, etc. in production
    return {"status": "ready"}

@app.get("/")
def root(request: Request):
    # Include Request ID in logs
    extra = {"request_id": getattr(request.state, "request_id", None)}
    log.info("hello request", extra=extra)
    return {"ok": True, "request_id": request.headers.get(REQUEST_ID_HEADER)}

Decision Points

  • Attach the Request ID to both the header and the response to make cross-referencing with upstream (Nginx/ALB) logs easier.
  • JSON logs are easy to aggregate and search—once you switch, you won’t want to go back to plain text.

2. Metrics: Publish RED via Prometheus

Metrics are numerical series to observe trends over time. Start by covering RED:

  • Rate: number of requests
  • Errors: number of errors (4xx/5xx)
  • Duration: latency distribution (histogram)

2.1 Add a Prometheus Endpoint

# app/obsv/metrics.py
import time
from typing import Callable
from starlette.middleware.base import BaseHTTPMiddleware
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
from starlette.responses import Response

REQ_COUNTER = Counter("http_requests_total", "request count", ["method", "path", "code"])
REQ_ERRORS = Counter("http_requests_error_total", "error count", ["method", "path", "code"])
REQ_LATENCY = Histogram(
    "http_request_duration_seconds",
    "request latency",
    ["method", "path", "code"],
    buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2, 5]
)

class PrometheusMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request, call_next: Callable):
        start = time.perf_counter()
        response = await call_next(request)
        dur = time.perf_counter() - start
        labels = {
            "method": request.method,
            "path": request.url.path,  # In prod, normalize path params to template names
            "code": str(response.status_code),
        }
        REQ_COUNTER.labels(**labels).inc()
        REQ_LATENCY.labels(**labels).observe(dur)
        if response.status_code >= 400:
            REQ_ERRORS.labels(**labels).inc()
        return response

def metrics_endpoint():
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

2.2 Integrate into the App

# app/main.py (additions)
from app.obsv.metrics import PrometheusMiddleware, metrics_endpoint

app.add_middleware(PrometheusMiddleware)

@app.get("/metrics")
def metrics():
    return metrics_endpoint()

2.3 Minimal Dashboard (PromQL)

  • Latency P90
    histogram_quantile(
      0.90,
      sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
    )
    
  • Error Rate
    sum(rate(http_requests_error_total[5m])) /
    sum(rate(http_requests_total[5m]))
    
  • QPS
    sum(rate(http_requests_total[1m]))
    

Decision Points

  • Normalize route parameters (e.g., /users/123) to prevent cardinality explosions. Use Starlette route names or your own mapping.
  • Tune buckets so P50/P90/P99 are stable and useful.

3. Make App Logs More Useful: Request Summary Logs

Put latency, status, path, Request ID, and (if applicable) user ID all in one line for easy searching.

# app/obsv/access_log.py
import time, logging
from starlette.middleware.base import BaseHTTPMiddleware
from typing import Callable

log = logging.getLogger("access")

class AccessLogMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request, call_next: Callable):
        start = time.perf_counter()
        response = await call_next(request)
        dur_ms = int((time.perf_counter() - start) * 1000)
        rid = getattr(request.state, "request_id", None)
        extra = {"request_id": rid}
        log.info(
            f'{request.client.host} {request.method} {request.url.path} '
            f'{response.status_code} {dur_ms}ms',
            extra=extra
        )
        return response

App integration:

# app/main.py (additions)
from app.obsv.access_log import AccessLogMiddleware
app.add_middleware(AccessLogMiddleware)

Decision Points

  • Access logs should be one line per request.
  • When in doubt, NCSA-style + extra fields (request_id/bytes, etc.) is plenty.

4. Distributed Tracing with OpenTelemetry

Distributed tracing visualizes each processing segment (span) of a request over time. It’s powerful for finding bottlenecks.

4.1 Dependencies (Example)

opentelemetry-distro
opentelemetry-exporter-otlp
opentelemetry-instrumentation-fastapi
opentelemetry-instrumentation-requests

4.2 One-Liner Run (Minimal)

opentelemetry-instrument \
  --traces_exporter otlp \
  --service_name fastapi-app \
  uvicorn app.main:app --host 0.0.0.0 --port 8000

Configure endpoint and sampling via environment variables (examples):

OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1  # 10% sampling

4.3 Manually Creating a Span in Code

# app/obsv/trace.py
from opentelemetry import trace
tracer = trace.get_tracer("app")

def heavy_calc(x: int) -> int:
    with tracer.start_as_current_span("heavy_calc"):
        # Real work (example)
        s = 0
        for i in range(x):
            s += i * i
        return s

4.4 Visualize External HTTP/DB Too

  • With opentelemetry-instrumentation-requests, HTTP calls via requests get spans automatically.
  • Add DB instrumentation (e.g., SQLAlchemy) to capture query timings too.

Decision Points

  • Use sampling in production (fixed-rate or tail sampling).
  • You don’t have to match Request ID and Trace ID, but do provide cross-links for log↔trace navigation.

5. Dashboards & Alerts: What to Watch

High-frequency views:

  • RED: QPS, error rate, P90 latency
  • Resources: CPU, memory, GC count, FD count
  • DB: connections, slow query count, query P95
  • External APIs: failure rate, retry count
  • Queues (e.g., Celery): queue length, delay, failure rate

Alert Examples

  • Error rate > 2% sustained for 5 minutes
  • P90 > 300 ms sustained for 10 minutes
  • Readiness failures (stop rolling new versions)

Decision Points

  • Add durations so brief spikes don’t page you constantly.
  • Start with a small, curated alert set to reduce false positives.

6. End-to-End Template (Minimal)

# app/main.py (summary)
from fastapi import FastAPI, Request
from app.obsv.request_id import RequestIdMiddleware
from app.obsv.logging import setup_logging
from app.obsv.metrics import PrometheusMiddleware, metrics_endpoint
from app.obsv.access_log import AccessLogMiddleware
import logging

setup_logging("INFO")
app = FastAPI(title="Obs API")
app.add_middleware(RequestIdMiddleware)
app.add_middleware(PrometheusMiddleware)
app.add_middleware(AccessLogMiddleware)
log = logging.getLogger("app")

@app.get("/health/liveness")
def liveness():
    return {"status": "alive"}

@app.get("/health/readiness")
def readiness():
    return {"status": "ready"}

@app.get("/metrics")
def metrics():
    return metrics_endpoint()

@app.get("/calc")
def calc(n: int, request: Request):
    s = sum(i*i for i in range(n))
    log.info("calc done", extra={"request_id": getattr(request.state, "request_id", None)})
    return {"n": n, "sum": s}

7. Operational Tips: Small Designs That Matter in Production

  • Do not log PII. Stick to messages and IDs.
  • Log levels: start at INFO, use ERROR for exceptions, and WARN for expected validation failures.
  • Use different sampling rates and log verbosity for staging vs production.
  • Log critical business events (e.g., order creation) in both audit logs and metrics.
  • During deploys, rely on readiness for external monitoring to avoid stepping on rolling updates.

8. Security & Privacy

  • Never include tokens or passwords in traces or logs. Mask header logs.
  • Logging request/response bodies automatically is risky (PII leakage). Prefer summaries only.
  • Do not include personally identifying info in metrics either (careful label design).
  • Protect monitoring UIs (Grafana/Jaeger, etc.) with auth and network restrictions.

9. Common Pitfalls & Remedies

Symptom Cause Fix
Metrics are heavy Too many label values (high cardinality) Normalize paths; don’t use user IDs as labels
Logs are hard to read Inconsistent formats Standardize on JSON; always emit 1-line request summaries
Can’t trace requests No/undistributed Request ID Add via middleware; cross-reference upstream logs
Tracing is expensive Full sampling Use fixed-rate sampling; always-sample only critical routes
Alert fatigue Bad thresholds/durations Align with business SLOs; review and suppress periodically

10. Sample Adoption Plans by Audience

  • Solo dev:
    1. Request ID + JSON logs → 2) /metrics → 3) Dashboard with just P90 & error rate
  • Small team:
    1. Monitoring stack (Prometheus/Grafana) → 2) Two alerts (error rate/P90) → 3) Tracing at 10% sampling
  • Growing SaaS:
    1. Cross-service correlation ID policy → 2) Custom spans for key domains → 3) Visualize business KPI metrics

11. Roadmap (Phased Rollout)

  1. Minimum set: Request ID, JSON logs, health checks.
  2. Metrics: Publish RED via Prometheus with a simple dashboard.
  3. Alerts: Start with two — error rate and P90.
  4. Tracing: Introduce OpenTelemetry at 10% sampling; visualize key external APIs and DB.
  5. Tuning: Path normalization & label hygiene, dashboard refinement, audit logging for key business events.

12. References


Conclusion

  • The fastest path to successful observability is starting with the minimum set: Request ID, JSON logs, and health checks.
  • Next, publish the RED metrics via Prometheus and prepare dashboards/alerts for P90 and error rate.
  • Finally, adopt OpenTelemetry for distributed tracing to understand latency breakdowns across external APIs and databases and fix the most impactful issues first.
  • Keep an eye on cardinality, PII handling, and alert maintenance, and make “visibility” a shared language across the team. Operations settle down and improvement cycles speed up.

By greeden

Leave a Reply

Your email address will not be published. Required fields are marked *

日本語が含まれない投稿は無視されますのでご注意ください。(スパム対策)