Operations-Friendly Observability: A FastAPI Implementation Guide for Logs, Metrics, and Traces — Request ID, JSON Logs, Prometheus, OpenTelemetry, and Dashboard Design
Summary (Inverted Pyramid)
- The goal is to establish “visibility” for running FastAPI in production so you can respond to incidents and improve performance quickly.
- Start with adding a Request ID, structured (JSON) logs, and liveness/readiness health checks.
- Next, expose app metrics (latency, throughput, error rate — the RED metrics) in Prometheus format and prepare alert conditions.
- Finally, introduce distributed tracing with OpenTelemetry to visualize latency breakdowns across external APIs and databases and pinpoint bottlenecks.
Who Benefits (Concrete Personas)
- Learner A (college senior, solo developer):
The API works but occasionally slows down for unknown reasons. Wants to build visibility step by step, starting with logs and metrics. - Small Team B (3-person contract shop):
Root cause analysis takes too long during incidents. Wants Request IDs for tracking and an auto-notification flow when error rate spikes. - SaaS Team C (startup):
Latency propagates across multiple services. Wants OpenTelemetry-based distributed tracing to quickly see where time is being spent.
1. The Minimum Observability Set: Start with Just This
Observability stands on three pillars: logs, metrics, and traces. The minimum set to introduce first:
- Request ID (correlation ID)
- Structured (JSON) logs
- Health checks (liveness/readiness)
1.1 Middleware to Attach a Request ID
# app/obsv/request_id.py
import uuid
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.types import ASGIApp, Receive, Scope, Send
REQUEST_ID_HEADER = "X-Request-ID"
class RequestIdMiddleware(BaseHTTPMiddleware):
def __init__(self, app: ASGIApp) -> None:
super().__init__(app)
async def dispatch(self, request, call_next):
rid = request.headers.get(REQUEST_ID_HEADER) or str(uuid.uuid4())
response = await call_next(request)
response.headers[REQUEST_ID_HEADER] = rid
# Putting it on request.state makes it easy to reference in logs, etc.
request.state.request_id = rid
return response
1.2 JSON Logging (Standard logging
Only)
# app/obsv/logging.py
import json, logging, sys
from typing import Any
class JsonFormatter(logging.Formatter):
def format(self, record: logging.LogRecord) -> str:
payload: dict[str, Any] = {
"time": self.formatTime(record, "%Y-%m-%dT%H:%M:%S"),
"level": record.levelname,
"logger": record.name,
"msg": record.getMessage(),
}
if hasattr(record, "request_id"):
payload["request_id"] = record.request_id
if record.exc_info:
payload["exc"] = self.formatException(record.exc_info)
return json.dumps(payload, ensure_ascii=False)
def setup_logging(level: str = "INFO"):
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(JsonFormatter())
root = logging.getLogger()
root.handlers[:] = [handler]
root.setLevel(level.upper())
1.3 App Integration (with Health Checks)
# app/main.py
from fastapi import FastAPI, Request
from app.obsv.request_id import RequestIdMiddleware, REQUEST_ID_HEADER
from app.obsv.logging import setup_logging
import logging
app = FastAPI(title="Observability-Ready API")
app.add_middleware(RequestIdMiddleware)
setup_logging("INFO")
log = logging.getLogger("app")
@app.get("/health/liveness")
def liveness():
return {"status": "alive"}
@app.get("/health/readiness")
def readiness():
# Check DB connectivity, etc. in production
return {"status": "ready"}
@app.get("/")
def root(request: Request):
# Include Request ID in logs
extra = {"request_id": getattr(request.state, "request_id", None)}
log.info("hello request", extra=extra)
return {"ok": True, "request_id": request.headers.get(REQUEST_ID_HEADER)}
Decision Points
- Attach the Request ID to both the header and the response to make cross-referencing with upstream (Nginx/ALB) logs easier.
- JSON logs are easy to aggregate and search—once you switch, you won’t want to go back to plain text.
2. Metrics: Publish RED via Prometheus
Metrics are numerical series to observe trends over time. Start by covering RED:
- Rate: number of requests
- Errors: number of errors (4xx/5xx)
- Duration: latency distribution (histogram)
2.1 Add a Prometheus Endpoint
# app/obsv/metrics.py
import time
from typing import Callable
from starlette.middleware.base import BaseHTTPMiddleware
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
from starlette.responses import Response
REQ_COUNTER = Counter("http_requests_total", "request count", ["method", "path", "code"])
REQ_ERRORS = Counter("http_requests_error_total", "error count", ["method", "path", "code"])
REQ_LATENCY = Histogram(
"http_request_duration_seconds",
"request latency",
["method", "path", "code"],
buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2, 5]
)
class PrometheusMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request, call_next: Callable):
start = time.perf_counter()
response = await call_next(request)
dur = time.perf_counter() - start
labels = {
"method": request.method,
"path": request.url.path, # In prod, normalize path params to template names
"code": str(response.status_code),
}
REQ_COUNTER.labels(**labels).inc()
REQ_LATENCY.labels(**labels).observe(dur)
if response.status_code >= 400:
REQ_ERRORS.labels(**labels).inc()
return response
def metrics_endpoint():
return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)
2.2 Integrate into the App
# app/main.py (additions)
from app.obsv.metrics import PrometheusMiddleware, metrics_endpoint
app.add_middleware(PrometheusMiddleware)
@app.get("/metrics")
def metrics():
return metrics_endpoint()
2.3 Minimal Dashboard (PromQL)
- Latency P90
histogram_quantile( 0.90, sum(rate(http_request_duration_seconds_bucket[5m])) by (le) )
- Error Rate
sum(rate(http_requests_error_total[5m])) / sum(rate(http_requests_total[5m]))
- QPS
sum(rate(http_requests_total[1m]))
Decision Points
- Normalize route parameters (e.g.,
/users/123
) to prevent cardinality explosions. Use Starlette route names or your own mapping. - Tune buckets so P50/P90/P99 are stable and useful.
3. Make App Logs More Useful: Request Summary Logs
Put latency, status, path, Request ID, and (if applicable) user ID all in one line for easy searching.
# app/obsv/access_log.py
import time, logging
from starlette.middleware.base import BaseHTTPMiddleware
from typing import Callable
log = logging.getLogger("access")
class AccessLogMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request, call_next: Callable):
start = time.perf_counter()
response = await call_next(request)
dur_ms = int((time.perf_counter() - start) * 1000)
rid = getattr(request.state, "request_id", None)
extra = {"request_id": rid}
log.info(
f'{request.client.host} {request.method} {request.url.path} '
f'{response.status_code} {dur_ms}ms',
extra=extra
)
return response
App integration:
# app/main.py (additions)
from app.obsv.access_log import AccessLogMiddleware
app.add_middleware(AccessLogMiddleware)
Decision Points
- Access logs should be one line per request.
- When in doubt, NCSA-style + extra fields (request_id/bytes, etc.) is plenty.
4. Distributed Tracing with OpenTelemetry
Distributed tracing visualizes each processing segment (span) of a request over time. It’s powerful for finding bottlenecks.
4.1 Dependencies (Example)
opentelemetry-distro
opentelemetry-exporter-otlp
opentelemetry-instrumentation-fastapi
opentelemetry-instrumentation-requests
4.2 One-Liner Run (Minimal)
opentelemetry-instrument \
--traces_exporter otlp \
--service_name fastapi-app \
uvicorn app.main:app --host 0.0.0.0 --port 8000
Configure endpoint and sampling via environment variables (examples):
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4317
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1 # 10% sampling
4.3 Manually Creating a Span in Code
# app/obsv/trace.py
from opentelemetry import trace
tracer = trace.get_tracer("app")
def heavy_calc(x: int) -> int:
with tracer.start_as_current_span("heavy_calc"):
# Real work (example)
s = 0
for i in range(x):
s += i * i
return s
4.4 Visualize External HTTP/DB Too
- With
opentelemetry-instrumentation-requests
, HTTP calls viarequests
get spans automatically. - Add DB instrumentation (e.g., SQLAlchemy) to capture query timings too.
Decision Points
- Use sampling in production (fixed-rate or tail sampling).
- You don’t have to match Request ID and Trace ID, but do provide cross-links for log↔trace navigation.
5. Dashboards & Alerts: What to Watch
High-frequency views:
- RED: QPS, error rate, P90 latency
- Resources: CPU, memory, GC count, FD count
- DB: connections, slow query count, query P95
- External APIs: failure rate, retry count
- Queues (e.g., Celery): queue length, delay, failure rate
Alert Examples
- Error rate > 2% sustained for 5 minutes
- P90 > 300 ms sustained for 10 minutes
- Readiness failures (stop rolling new versions)
Decision Points
- Add durations so brief spikes don’t page you constantly.
- Start with a small, curated alert set to reduce false positives.
6. End-to-End Template (Minimal)
# app/main.py (summary)
from fastapi import FastAPI, Request
from app.obsv.request_id import RequestIdMiddleware
from app.obsv.logging import setup_logging
from app.obsv.metrics import PrometheusMiddleware, metrics_endpoint
from app.obsv.access_log import AccessLogMiddleware
import logging
setup_logging("INFO")
app = FastAPI(title="Obs API")
app.add_middleware(RequestIdMiddleware)
app.add_middleware(PrometheusMiddleware)
app.add_middleware(AccessLogMiddleware)
log = logging.getLogger("app")
@app.get("/health/liveness")
def liveness():
return {"status": "alive"}
@app.get("/health/readiness")
def readiness():
return {"status": "ready"}
@app.get("/metrics")
def metrics():
return metrics_endpoint()
@app.get("/calc")
def calc(n: int, request: Request):
s = sum(i*i for i in range(n))
log.info("calc done", extra={"request_id": getattr(request.state, "request_id", None)})
return {"n": n, "sum": s}
7. Operational Tips: Small Designs That Matter in Production
- Do not log PII. Stick to messages and IDs.
- Log levels: start at INFO, use ERROR for exceptions, and WARN for expected validation failures.
- Use different sampling rates and log verbosity for staging vs production.
- Log critical business events (e.g., order creation) in both audit logs and metrics.
- During deploys, rely on
readiness
for external monitoring to avoid stepping on rolling updates.
8. Security & Privacy
- Never include tokens or passwords in traces or logs. Mask header logs.
- Logging request/response bodies automatically is risky (PII leakage). Prefer summaries only.
- Do not include personally identifying info in metrics either (careful label design).
- Protect monitoring UIs (Grafana/Jaeger, etc.) with auth and network restrictions.
9. Common Pitfalls & Remedies
Symptom | Cause | Fix |
---|---|---|
Metrics are heavy | Too many label values (high cardinality) | Normalize paths; don’t use user IDs as labels |
Logs are hard to read | Inconsistent formats | Standardize on JSON; always emit 1-line request summaries |
Can’t trace requests | No/undistributed Request ID | Add via middleware; cross-reference upstream logs |
Tracing is expensive | Full sampling | Use fixed-rate sampling; always-sample only critical routes |
Alert fatigue | Bad thresholds/durations | Align with business SLOs; review and suppress periodically |
10. Sample Adoption Plans by Audience
- Solo dev:
- Request ID + JSON logs → 2)
/metrics
→ 3) Dashboard with just P90 & error rate
- Request ID + JSON logs → 2)
- Small team:
- Monitoring stack (Prometheus/Grafana) → 2) Two alerts (error rate/P90) → 3) Tracing at 10% sampling
- Growing SaaS:
- Cross-service correlation ID policy → 2) Custom spans for key domains → 3) Visualize business KPI metrics
11. Roadmap (Phased Rollout)
- Minimum set: Request ID, JSON logs, health checks.
- Metrics: Publish RED via Prometheus with a simple dashboard.
- Alerts: Start with two — error rate and P90.
- Tracing: Introduce OpenTelemetry at 10% sampling; visualize key external APIs and DB.
- Tuning: Path normalization & label hygiene, dashboard refinement, audit logging for key business events.
12. References
- FastAPI
- Starlette
- Prometheus / Grafana
- OpenTelemetry
- Logging
Conclusion
- The fastest path to successful observability is starting with the minimum set: Request ID, JSON logs, and health checks.
- Next, publish the RED metrics via Prometheus and prepare dashboards/alerts for P90 and error rate.
- Finally, adopt OpenTelemetry for distributed tracing to understand latency breakdowns across external APIs and databases and fix the most impactful issues first.
- Keep an eye on cardinality, PII handling, and alert maintenance, and make “visibility” a shared language across the team. Operations settle down and improvement cycles speed up.