green snake
Photo by Pixabay on Pexels.com

Become Robust with Large Files: Designing FastAPI Upload/Download — Streaming, Range Support, Signed URLs, Virus Scanning, and Integrity Verification


Summary (the big picture up front)

  • Split files into Small (~10 MB), Medium (~100 MB), and Large (GB-class), and switch between handling on the API server vs. streaming directly to external storage.
  • Keep memory usage steady with asynchronous streaming, and ensure integrity with Content-Length / ETag.
  • Offload Range (partial fetches) and compression/transcoding to storage or the CDN. Let the API focus on metadata management and issuing signed URLs.
  • For security, layer defenses with both extension and MIME validation, file-size ceilings, virus scanning, and pre-save hashing.

Who benefits from this

  • Learner A (graduation project: a proto video platform)
    Wants to learn the basics to safely handle large files in a small API.
  • Small Team B (3-person agency)
    Needs to accept bulk image uploads from clients without hogging server memory.
  • SaaS Dev C (startup)
    Plans to use S3-compatible storage, centering on signed URLs for direct PUT/GET.

1. Strategy: switch by size and path

1.1 Three classes

  • Small (~10 MB): Receive on the API server → save to storage. Validation is simple.
  • Medium (~100 MB): The API server streams end-to-end. Must watch CPU/memory/I/O.
  • Large (GB-class): Use signed URLs (S3-compatible, etc.) so the client uploads/downloads directly. API only handles metadata and authorization.

1.2 Role separation

  • API: authorization, metadata, pre/post hooks (virus scan, trigger thumbnail jobs)
  • Storage/CDN: persistence, range delivery, compression, caching
  • Worker: heavy transforms and virus scans

Key point

  • Design the route first. Assume heavy I/O is pushed out of the API.

2. For small files: basic upload via form-data

# app/main.py
from fastapi import FastAPI, UploadFile, File, HTTPException
from pathlib import Path
import magic  # e.g., python-magic; for MIME sniffing (optional to install)
import hashlib

app = FastAPI(title="File Upload Basics")
BASE = Path("uploads")
BASE.mkdir(exist_ok=True)

MAX_MB = 10

def _digest_and_validate(file_bytes: bytes, expect_discrete: bool = True):
    # MIME detection (don’t rely on extension alone)
    mime = magic.from_buffer(file_bytes[:4096], mime=True)
    # Example: allow images only
    if not mime.startswith("image/"):
        raise HTTPException(400, "unsupported media type")
    # Content hash (can serve as an ETag basis)
    etag = hashlib.sha256(file_bytes).hexdigest()
    # Extend here for checks like disguised image formats, etc.
    return mime, etag

@app.post("/upload")
async def upload(file: UploadFile = File(...)):
    size_hint = 0
    chunks = []
    while True:
        chunk = await file.read(1024 * 1024)  # 1MB
        if not chunk:
            break
        chunks.append(chunk)
        size_hint += len(chunk)
        if size_hint > MAX_MB * 1024 * 1024:
            raise HTTPException(413, "file too large")
    content = b"".join(chunks)
    mime, etag = _digest_and_validate(content)

    dest = BASE / file.filename
    dest.write_bytes(content)
    return {"filename": file.filename, "size": size_hint, "mime": mime, "etag": etag}
  • Dual validation (extension + MIME) reduces spoofing.
  • Reject early with 413 Payload Too Large.
  • For small files, reading all at once can be acceptable—still watch memory usage.

Key point

  • For small files, prioritize correctness and safety first. MIME checks and size limits are step one.

3. For medium files: async streaming keeps memory steady

# app/stream.py
from fastapi import APIRouter, UploadFile, File, HTTPException
from pathlib import Path
import aiofiles
import hashlib

router = APIRouter(prefix="/stream", tags=["stream"])

CHUNK = 1024 * 1024  # 1MB
LIMIT = 100 * 1024 * 1024  # 100MB
DEST = Path("streams"); DEST.mkdir(exist_ok=True)

@router.post("/upload")
async def upload_stream(file: UploadFile = File(...)):
    size = 0
    sha256 = hashlib.sha256()
    dest = DEST / file.filename
    async with aiofiles.open(dest, "wb") as f:
        while True:
            chunk = await file.read(CHUNK)
            if not chunk:
                break
            size += len(chunk)
            if size > LIMIT:
                raise HTTPException(413, "file too large")
            sha256.update(chunk)
            await f.write(chunk)
    return {"filename": file.filename, "size": size, "etag": sha256.hexdigest()}
  • Use aiofiles for non-blocking writes.
  • Compute the hash on the stream, usable as the integrity/ETag reference.
  • Also set your web server’s client_max_body_size (Nginx, etc.).

Key point

  • Read/write in chunks to keep peak memory constant. Hash concurrently as you stream.

4. For large files: direct upload/download via signed URLs

4.1 Why signed URLs

  • Avoid routing through the API server, thus bypassing server bandwidth and CPU load.
  • Issue short-lived URLs (minutes) with scoped permissions + expiration for safety.

4.2 Flow (upload)

  1. Client: POST /files/init with filename/size/MIME
  2. API: generate and return a signed PUT URL for storage
  3. Client: PUT directly to the signed URL
  4. API: receive completion notification, finalize metadata (persist to DB)

Pseudo-code (issuing the signed URL):

# app/signed.py
from fastapi import APIRouter, HTTPException
from pydantic import BaseModel, Field
from datetime import timedelta

router = APIRouter(prefix="/files", tags=["signed"])

class InitReq(BaseModel):
    filename: str = Field(..., max_length=256)
    size: int = Field(..., ge=1)
    mime: str

@router.post("/init")
async def init_upload(req: InitReq):
    if req.size > 10 * 1024 * 1024 * 1024:  # 10GB
        raise HTTPException(413, "file too large")
    # Perform detailed validation here: MIME, extension, forbidden patterns, etc.
    # Generate the signed URL via your storage SDK (e.g., expiration = 5 minutes)
    signed_put_url = "<signed-put-url>"
    object_key = f"user-uploads/{req.filename}"
    return {"object_key": object_key, "put_url": signed_put_url, "expires_in": 300}

For downloads, issue a signed GET URL. Let storage/CDN handle Range and compression.

Key point

  • The API issues URLs and focuses on metadata & auth. It doesn’t touch giant payloads.

5. Downloads: Range support and conditional responses

5.1 Direct serving (small/medium)

# app/download.py
from fastapi import APIRouter, HTTPException
from fastapi.responses import StreamingResponse, FileResponse, Response
from pathlib import Path
import os

router = APIRouter(prefix="/download", tags=["download"])

BASE = Path("streams")

@router.get("/file/{name}")
async def get_file(name: str):
    path = BASE / name
    if not path.exists():
        raise HTTPException(404, "not found")
    # For small files, FileResponse is fine
    return FileResponse(path, media_type="application/octet-stream", filename=name)

5.2 Range (partial fetch)

Prefer delegating Range to the web server/storage/CDN; implement yourself only if necessary. For large files, serve directly from storage.

Key point

  • Only small files should be served by the API. Push Range down to Nginx/CDN/storage.

6. Security: a layered-defense checklist

  • Max size limits: enforce at app, Nginx, and reverse-proxy layers.
  • MIME + extension dual checks: reduce spoofing.
  • Fixed output extensions/paths: never use user input raw (prevents directory traversal).
  • Virus scan: cloud AV/external scanning API/worker-based async scans.
  • Hash verification: record ETag (sha256, etc.) in metadata at receipt.
  • Short-lived, scope-limited signed URLs: least-privilege principle.
  • Logging & traceability: capture uploader, object key, hash, IP, timestamps.

Key point

  • Place defenses at input → storage → delivery. Short lifetimes and least privilege are foundational.

7. Metadata and integrity management

7.1 Suggested schema

  • id (internal ID)
  • object_key (storage key)
  • filename (display)
  • mime
  • size
  • etag (sha256, etc.)
  • owner_id (for authorization)
  • state (init / uploaded / verified / ready)
  • created_at, updated_at

7.2 State transitions

  • init (URL issued) → client PUTs → uploaded
  • Worker runs virus scan + hash verificationverified
  • Transcoding / thumbnail generation → ready

Key point

  • Manage with a state machine. It stays traceable even with async jobs.

8. Working with async jobs (offload heavy work)

  • Virus scanning, thumbnailing, video transcoding belong to workers.
  • On failure, enable retries (exponential backoff) and manual requeue.
  • On success, webhook/notify to update the front end.

Key point

  • The API handles intake and notifications; the jobs absorb the heavy lifting.

9. Monitoring & operations

  • Metrics: upload counts, failure rate, avg size, signed-URL issuances, virus detections.
  • Alerts: spikes in 413s, signed-URL expiry ratio, streaks of scan failures.
  • Dashboards: capacity trends, extension distribution, MIME distribution.

Key point

  • Use numbers to spot growth and bottlenecks. Early warning is vital.

10. Sample: a minimal app wired together

# app/main.py
from fastapi import FastAPI
from app.stream import router as stream_router
from app.download import router as download_router
from app.signed import router as signed_router

app = FastAPI(title="File Handling Best Practices")
app.include_router(stream_router)
app.include_router(download_router)
app.include_router(signed_router)

@app.get("/health")
def health():
    return {"ok": True}

11. Common pitfalls and fixes

Symptom Cause Fix
Memory spikes Read-all-at-once Switch to streaming; tune chunk size
Spoofed files slip in Extension-only checks Dual-check MIME + extension; reject risky MIME
Range delivery is costly API is serving content CDN/storage direct delivery with signed URLs
URL leaks & abuse Long-lived, overly broad permissions Short-lived, scoped signed URLs
Poor incident forensics Thin logging Store structured metadata + audit logs

12. Adoption roadmap

  1. Small files: size caps + MIME checks, fixed directory storage.
  2. Medium files: async streaming, hashing, use FileResponse appropriately.
  3. Large files: signed URLs for direct PUT/GET; API specializes in metadata.
  4. Security: AV scans, short-lived URLs, least privilege.
  5. Operations: metrics, alerts, dashboards.

13. Search keywords (good entry points)

  • FastAPI UploadFile, StreamingResponse, FileResponse
  • Content-Range / Range requests
  • S3 pre-signed URL / signed URL generation
  • python-magic MIME detection
  • Virus scanning API / clamav
  • ETag / Content-Length integrity
  • Nginx client_max_body_size / proxy_read_timeout
  • CDN Range / cache control

Wrap-up

  • Switch the path by size and purpose: small files safely via API, stream medium files, and use signed URLs for direct transfer of large files.
  • Layer security: size caps, dual MIME/extension checks, short-lived URLs, AV scans, hashes.
  • Delegate delivery to storage/CDN while the API focuses on auth, metadata, notifications. That’s the shortest path to fast and resilient implementations.

By greeden

Leave a Reply

Your email address will not be published. Required fields are marked *

日本語が含まれない投稿は無視されますのでご注意ください。(スパム対策)