Table of Contents

Operations-Friendly Observability: A FastAPI Implementation Guide for Logs, Metrics, and Traces — Request ID, JSON Logs, Prometheus, OpenTelemetry, and Dashboard Design

Summary (Inverted Pyramid)

The goal is to establish “visibility” for running FastAPI in production so you can respond to incidents and improve performance quickly.
Start with adding a Request ID, structured (JSON) logs, and liveness/readiness health checks.
Next, expose app metrics (latency, throughput, error rate — the RED metrics) in Prometheus format and prepare alert conditions.
Finally, introduce distributed tracing with OpenTelemetry to visualize latency breakdowns across external APIs and databases and pinpoint bottlenecks.

Who Benefits (Concrete Personas)

Learner A (college senior, solo developer):
The API works but occasionally slows down for unknown reasons. Wants to build visibility step by step, starting with logs and metrics.
Small Team B (3-person contract shop):
Root cause analysis takes too long during incidents. Wants Request IDs for tracking and an auto-notification flow when error rate spikes.
SaaS Team C (startup):
Latency propagates across multiple services. Wants OpenTelemetry-based distributed tracing to quickly see where time is being spent.

1. The Minimum Observability Set: Start with Just This

Observability stands on three pillars: logs, metrics, and traces. The minimum set to introduce first:

Request ID (correlation ID)
Structured (JSON) logs
Health checks (liveness/readiness)

1.1 Middleware to Attach a Request ID

# app/obsv/request_id.py
import uuid
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.types import ASGIApp, Receive, Scope, Send

REQUEST_ID_HEADER = "X-Request-ID"

class RequestIdMiddleware(BaseHTTPMiddleware):
    def __init__(self, app: ASGIApp) -> None:
        super().__init__(app)

    async def dispatch(self, request, call_next):
        rid = request.headers.get(REQUEST_ID_HEADER) or str(uuid.uuid4())
        response = await call_next(request)
        response.headers[REQUEST_ID_HEADER] = rid
        # Putting it on request.state makes it easy to reference in logs, etc.
        request.state.request_id = rid
        return response

1.2 JSON Logging (Standard `logging` Only)

# app/obsv/logging.py
import json, logging, sys
from typing import Any

class JsonFormatter(logging.Formatter):
    def format(self, record: logging.LogRecord) -> str:
        payload: dict[str, Any] = {
            "time": self.formatTime(record, "%Y-%m-%dT%H:%M:%S"),
            "level": record.levelname,
            "logger": record.name,
            "msg": record.getMessage(),
        }
        if hasattr(record, "request_id"):
            payload["request_id"] = record.request_id
        if record.exc_info:
            payload["exc"] = self.formatException(record.exc_info)
        return json.dumps(payload, ensure_ascii=False)

def setup_logging(level: str = "INFO"):
    handler = logging.StreamHandler(sys.stdout)
    handler.setFormatter(JsonFormatter())
    root = logging.getLogger()
    root.handlers[:] = [handler]
    root.setLevel(level.upper())

1.3 App Integration (with Health Checks)

# app/main.py
from fastapi import FastAPI, Request
from app.obsv.request_id import RequestIdMiddleware, REQUEST_ID_HEADER
from app.obsv.logging import setup_logging
import logging

app = FastAPI(title="Observability-Ready API")
app.add_middleware(RequestIdMiddleware)
setup_logging("INFO")
log = logging.getLogger("app")

@app.get("/health/liveness")
def liveness():
    return {"status": "alive"}

@app.get("/health/readiness")
def readiness():
    # Check DB connectivity, etc. in production
    return {"status": "ready"}

@app.get("/")
def root(request: Request):
    # Include Request ID in logs
    extra = {"request_id": getattr(request.state, "request_id", None)}
    log.info("hello request", extra=extra)
    return {"ok": True, "request_id": request.headers.get(REQUEST_ID_HEADER)}

Decision Points

Attach the Request ID to both the header and the response to make cross-referencing with upstream (Nginx/ALB) logs easier.
JSON logs are easy to aggregate and search—once you switch, you won’t want to go back to plain text.

2. Metrics: Publish RED via Prometheus

Metrics are numerical series to observe trends over time. Start by covering RED:

Rate: number of requests
Errors: number of errors (4xx/5xx)
Duration: latency distribution (histogram)

2.1 Add a Prometheus Endpoint

# app/obsv/metrics.py
import time
from typing import Callable
from starlette.middleware.base import BaseHTTPMiddleware
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
from starlette.responses import Response

REQ_COUNTER = Counter("http_requests_total", "request count", ["method", "path", "code"])
REQ_ERRORS = Counter("http_requests_error_total", "error count", ["method", "path", "code"])
REQ_LATENCY = Histogram(
    "http_request_duration_seconds",
    "request latency",
    ["method", "path", "code"],
    buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2, 5]
)

class PrometheusMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request, call_next: Callable):
        start = time.perf_counter()
        response = await call_next(request)
        dur = time.perf_counter() - start
        labels = {
            "method": request.method,
            "path": request.url.path,  # In prod, normalize path params to template names
            "code": str(response.status_code),
        }
        REQ_COUNTER.labels(**labels).inc()
        REQ_LATENCY.labels(**labels).observe(dur)
        if response.status_code >= 400:
            REQ_ERRORS.labels(**labels).inc()
        return response

def metrics_endpoint():
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

2.2 Integrate into the App

# app/main.py (additions)
from app.obsv.metrics import PrometheusMiddleware, metrics_endpoint

app.add_middleware(PrometheusMiddleware)

@app.get("/metrics")
def metrics():
    return metrics_endpoint()

2.3 Minimal Dashboard (PromQL)

Latency P90

histogram_quantile(
  0.90,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

Error Rate

sum(rate(http_requests_error_total[5m])) /
sum(rate(http_requests_total[5m]))

QPS
```
sum(rate(http_requests_total[1m]))
```

Decision Points

Normalize route parameters (e.g., /users/123) to prevent cardinality explosions. Use Starlette route names or your own mapping.
Tune buckets so P50/P90/P99 are stable and useful.

3. Make App Logs More Useful: Request Summary Logs

Put latency, status, path, Request ID, and (if applicable) user ID all in one line for easy searching.

# app/obsv/access_log.py
import time, logging
from starlette.middleware.base import BaseHTTPMiddleware
from typing import Callable

log = logging.getLogger("access")

class AccessLogMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request, call_next: Callable):
        start = time.perf_counter()
        response = await call_next(request)
        dur_ms = int((time.perf_counter() - start) * 1000)
        rid = getattr(request.state, "request_id", None)
        extra = {"request_id": rid}
        log.info(
            f'{request.client.host} {request.method} {request.url.path} '
            f'{response.status_code} {dur_ms}ms',
            extra=extra
        )
        return response

App integration:

# app/main.py (additions)
from app.obsv.access_log import AccessLogMiddleware
app.add_middleware(AccessLogMiddleware)

Decision Points

Access logs should be one line per request.
When in doubt, NCSA-style + extra fields (request_id/bytes, etc.) is plenty.

4. Distributed Tracing with OpenTelemetry

Distributed tracing visualizes each processing segment (span) of a request over time. It’s powerful for finding bottlenecks.

4.1 Dependencies (Example)

opentelemetry-distro
opentelemetry-exporter-otlp
opentelemetry-instrumentation-fastapi
opentelemetry-instrumentation-requests

4.2 One-Liner Run (Minimal)

opentelemetry-instrument \
  --traces_exporter otlp \
  --service_name fastapi-app \
  uvicorn app.main:app --host 0.0.0.0 --port 8000

Configure endpoint and sampling via environment variables (examples):

OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1  # 10% sampling

4.3 Manually Creating a Span in Code

# app/obsv/trace.py
from opentelemetry import trace
tracer = trace.get_tracer("app")

def heavy_calc(x: int) -> int:
    with tracer.start_as_current_span("heavy_calc"):
        # Real work (example)
        s = 0
        for i in range(x):
            s += i * i
        return s

4.4 Visualize External HTTP/DB Too

With opentelemetry-instrumentation-requests, HTTP calls via requests get spans automatically.
Add DB instrumentation (e.g., SQLAlchemy) to capture query timings too.

Decision Points

Use sampling in production (fixed-rate or tail sampling).
You don’t have to match Request ID and Trace ID, but do provide cross-links for log↔trace navigation.

5. Dashboards & Alerts: What to Watch

High-frequency views:

RED: QPS, error rate, P90 latency
Resources: CPU, memory, GC count, FD count
DB: connections, slow query count, query P95
External APIs: failure rate, retry count
Queues (e.g., Celery): queue length, delay, failure rate

Alert Examples

Error rate > 2% sustained for 5 minutes
P90 > 300 ms sustained for 10 minutes
Readiness failures (stop rolling new versions)

Decision Points

Add durations so brief spikes don’t page you constantly.
Start with a small, curated alert set to reduce false positives.

6. End-to-End Template (Minimal)

# app/main.py (summary)
from fastapi import FastAPI, Request
from app.obsv.request_id import RequestIdMiddleware
from app.obsv.logging import setup_logging
from app.obsv.metrics import PrometheusMiddleware, metrics_endpoint
from app.obsv.access_log import AccessLogMiddleware
import logging

setup_logging("INFO")
app = FastAPI(title="Obs API")
app.add_middleware(RequestIdMiddleware)
app.add_middleware(PrometheusMiddleware)
app.add_middleware(AccessLogMiddleware)
log = logging.getLogger("app")

@app.get("/health/liveness")
def liveness():
    return {"status": "alive"}

@app.get("/health/readiness")
def readiness():
    return {"status": "ready"}

@app.get("/metrics")
def metrics():
    return metrics_endpoint()

@app.get("/calc")
def calc(n: int, request: Request):
    s = sum(i*i for i in range(n))
    log.info("calc done", extra={"request_id": getattr(request.state, "request_id", None)})
    return {"n": n, "sum": s}

7. Operational Tips: Small Designs That Matter in Production

Do not log PII. Stick to messages and IDs.
Log levels: start at INFO, use ERROR for exceptions, and WARN for expected validation failures.
Use different sampling rates and log verbosity for staging vs production.
Log critical business events (e.g., order creation) in both audit logs and metrics.
During deploys, rely on readiness for external monitoring to avoid stepping on rolling updates.

8. Security & Privacy

Never include tokens or passwords in traces or logs. Mask header logs.
Logging request/response bodies automatically is risky (PII leakage). Prefer summaries only.
Do not include personally identifying info in metrics either (careful label design).
Protect monitoring UIs (Grafana/Jaeger, etc.) with auth and network restrictions.

9. Common Pitfalls & Remedies

Symptom	Cause	Fix
Metrics are heavy	Too many label values (high cardinality)	Normalize paths; don’t use user IDs as labels
Logs are hard to read	Inconsistent formats	Standardize on JSON; always emit 1-line request summaries
Can’t trace requests	No/undistributed Request ID	Add via middleware; cross-reference upstream logs
Tracing is expensive	Full sampling	Use fixed-rate sampling; always-sample only critical routes
Alert fatigue	Bad thresholds/durations	Align with business SLOs; review and suppress periodically

10. Sample Adoption Plans by Audience

Solo dev:
1. Request ID + JSON logs → 2) /metrics → 3) Dashboard with just P90 & error rate
Small team:
1. Monitoring stack (Prometheus/Grafana) → 2) Two alerts (error rate/P90) → 3) Tracing at 10% sampling
Growing SaaS:
1. Cross-service correlation ID policy → 2) Custom spans for key domains → 3) Visualize business KPI metrics

11. Roadmap (Phased Rollout)

Minimum set: Request ID, JSON logs, health checks.
Metrics: Publish RED via Prometheus with a simple dashboard.
Alerts: Start with two — error rate and P90.
Tracing: Introduce OpenTelemetry at 10% sampling; visualize key external APIs and DB.
Tuning: Path normalization & label hygiene, dashboard refinement, audit logging for key business events.

12. References

FastAPI
- FastAPI Official Docs
- Middleware
Starlette
- Starlette Middleware
Prometheus / Grafana
OpenTelemetry
Logging
- Python logging

Conclusion

The fastest path to successful observability is starting with the minimum set: Request ID, JSON logs, and health checks.
Next, publish the RED metrics via Prometheus and prepare dashboards/alerts for P90 and error rate.
Finally, adopt OpenTelemetry for distributed tracing to understand latency breakdowns across external APIs and databases and fix the most impactful issues first.
Keep an eye on cardinality, PII handling, and alert maintenance, and make “visibility” a shared language across the team. Operations settle down and improvement cycles speed up.

Operations-Friendly Observability: A FastAPI Implementation Guide for Logs, Metrics, and Traces — Request ID, JSON Logs, Prometheus, OpenTelemetry, and Dashboard Design

Operations-Friendly Observability: A FastAPI Implementation Guide for Logs, Metrics, and Traces — Request ID, JSON Logs, Prometheus, OpenTelemetry, and Dashboard Design

Summary (Inverted Pyramid)

Who Benefits (Concrete Personas)

1. The Minimum Observability Set: Start with Just This

1.1 Middleware to Attach a Request ID

1.2 JSON Logging (Standard `logging` Only)

1.3 App Integration (with Health Checks)

2. Metrics: Publish RED via Prometheus

2.1 Add a Prometheus Endpoint

2.2 Integrate into the App

2.3 Minimal Dashboard (PromQL)

3. Make App Logs More Useful: Request Summary Logs

4. Distributed Tracing with OpenTelemetry

4.1 Dependencies (Example)

4.2 One-Liner Run (Minimal)

4.3 Manually Creating a Span in Code

4.4 Visualize External HTTP/DB Too

5. Dashboards & Alerts: What to Watch

6. End-to-End Template (Minimal)

7. Operational Tips: Small Designs That Matter in Production

8. Security & Privacy

9. Common Pitfalls & Remedies

10. Sample Adoption Plans by Audience

11. Roadmap (Phased Rollout)

12. References

Conclusion

By greeden

Leave a Reply Cancel reply

You Missed

5 Must-Watch Cutting-Edge Gadgets in 2025

Amazon DynamoDB Deep DiveA “Scalable NoSQL Design” Guide Through Comparison with Cloud Bigtable, Cloud Firestore, and Azure Cosmos DB

Deep Dive into GPT-5.1-Codex-MaxHow It Compares to Previous Versions, Gemini 3, and Claude as a Serious Agentic Coding Model

November 20, 2025 · World News RoundupPressure over Ukraine peace plan, renewed airstrikes in Gaza, G20 South Africa, yen plunge and AI rally, and the shock of shrinking aid

Operations-Friendly Observability: A FastAPI Implementation Guide for Logs, Metrics, and Traces — Request ID, JSON Logs, Prometheus, OpenTelemetry, and Dashboard Design

Summary (Inverted Pyramid)

Who Benefits (Concrete Personas)

1. The Minimum Observability Set: Start with Just This

1.1 Middleware to Attach a Request ID

1.2 JSON Logging (Standard logging Only)

1.3 App Integration (with Health Checks)

2. Metrics: Publish RED via Prometheus

2.1 Add a Prometheus Endpoint

2.2 Integrate into the App

2.3 Minimal Dashboard (PromQL)

3. Make App Logs More Useful: Request Summary Logs

4. Distributed Tracing with OpenTelemetry

4.1 Dependencies (Example)

4.2 One-Liner Run (Minimal)

4.3 Manually Creating a Span in Code

4.4 Visualize External HTTP/DB Too

5. Dashboards & Alerts: What to Watch

6. End-to-End Template (Minimal)

7. Operational Tips: Small Designs That Matter in Production

8. Security & Privacy

9. Common Pitfalls & Remedies

10. Sample Adoption Plans by Audience

11. Roadmap (Phased Rollout)

12. References

Conclusion

Share this:

By greeden

Related Post

Leave a Reply Cancel reply

You Missed

1.2 JSON Logging (Standard `logging` Only)