Table of Contents

Become Robust with Large Files: Designing FastAPI Upload/Download — Streaming, Range Support, Signed URLs, Virus Scanning, and Integrity Verification

Summary (the big picture up front)

Split files into Small (~10 MB), Medium (~100 MB), and Large (GB-class), and switch between handling on the API server vs. streaming directly to external storage.
Keep memory usage steady with asynchronous streaming, and ensure integrity with Content-Length / ETag.
Offload Range (partial fetches) and compression/transcoding to storage or the CDN. Let the API focus on metadata management and issuing signed URLs.
For security, layer defenses with both extension and MIME validation, file-size ceilings, virus scanning, and pre-save hashing.

Who benefits from this

Learner A (graduation project: a proto video platform)
Wants to learn the basics to safely handle large files in a small API.
Small Team B (3-person agency)
Needs to accept bulk image uploads from clients without hogging server memory.
SaaS Dev C (startup)
Plans to use S3-compatible storage, centering on signed URLs for direct PUT/GET.

1. Strategy: switch by size and path

1.1 Three classes

Small (~10 MB): Receive on the API server → save to storage. Validation is simple.
Medium (~100 MB): The API server streams end-to-end. Must watch CPU/memory/I/O.
Large (GB-class): Use signed URLs (S3-compatible, etc.) so the client uploads/downloads directly. API only handles metadata and authorization.

1.2 Role separation

API: authorization, metadata, pre/post hooks (virus scan, trigger thumbnail jobs)
Storage/CDN: persistence, range delivery, compression, caching
Worker: heavy transforms and virus scans

Key point

Design the route first. Assume heavy I/O is pushed out of the API.

2. For small files: basic upload via form-data

# app/main.py
from fastapi import FastAPI, UploadFile, File, HTTPException
from pathlib import Path
import magic  # e.g., python-magic; for MIME sniffing (optional to install)
import hashlib

app = FastAPI(title="File Upload Basics")
BASE = Path("uploads")
BASE.mkdir(exist_ok=True)

MAX_MB = 10

def _digest_and_validate(file_bytes: bytes, expect_discrete: bool = True):
    # MIME detection (don’t rely on extension alone)
    mime = magic.from_buffer(file_bytes[:4096], mime=True)
    # Example: allow images only
    if not mime.startswith("image/"):
        raise HTTPException(400, "unsupported media type")
    # Content hash (can serve as an ETag basis)
    etag = hashlib.sha256(file_bytes).hexdigest()
    # Extend here for checks like disguised image formats, etc.
    return mime, etag

@app.post("/upload")
async def upload(file: UploadFile = File(...)):
    size_hint = 0
    chunks = []
    while True:
        chunk = await file.read(1024 * 1024)  # 1MB
        if not chunk:
            break
        chunks.append(chunk)
        size_hint += len(chunk)
        if size_hint > MAX_MB * 1024 * 1024:
            raise HTTPException(413, "file too large")
    content = b"".join(chunks)
    mime, etag = _digest_and_validate(content)

    dest = BASE / file.filename
    dest.write_bytes(content)
    return {"filename": file.filename, "size": size_hint, "mime": mime, "etag": etag}

Dual validation (extension + MIME) reduces spoofing.
Reject early with 413 Payload Too Large.
For small files, reading all at once can be acceptable—still watch memory usage.

Key point

For small files, prioritize correctness and safety first. MIME checks and size limits are step one.

3. For medium files: async streaming keeps memory steady

# app/stream.py
from fastapi import APIRouter, UploadFile, File, HTTPException
from pathlib import Path
import aiofiles
import hashlib

router = APIRouter(prefix="/stream", tags=["stream"])

CHUNK = 1024 * 1024  # 1MB
LIMIT = 100 * 1024 * 1024  # 100MB
DEST = Path("streams"); DEST.mkdir(exist_ok=True)

@router.post("/upload")
async def upload_stream(file: UploadFile = File(...)):
    size = 0
    sha256 = hashlib.sha256()
    dest = DEST / file.filename
    async with aiofiles.open(dest, "wb") as f:
        while True:
            chunk = await file.read(CHUNK)
            if not chunk:
                break
            size += len(chunk)
            if size > LIMIT:
                raise HTTPException(413, "file too large")
            sha256.update(chunk)
            await f.write(chunk)
    return {"filename": file.filename, "size": size, "etag": sha256.hexdigest()}

Use aiofiles for non-blocking writes.
Compute the hash on the stream, usable as the integrity/ETag reference.
Also set your web server’s client_max_body_size (Nginx, etc.).

Key point

Read/write in chunks to keep peak memory constant. Hash concurrently as you stream.

4. For large files: direct upload/download via signed URLs

4.1 Why signed URLs

Avoid routing through the API server, thus bypassing server bandwidth and CPU load.
Issue short-lived URLs (minutes) with scoped permissions + expiration for safety.

4.2 Flow (upload)

Client: POST /files/init with filename/size/MIME
API: generate and return a signed PUT URL for storage
Client: PUT directly to the signed URL
API: receive completion notification, finalize metadata (persist to DB)

Pseudo-code (issuing the signed URL):

# app/signed.py
from fastapi import APIRouter, HTTPException
from pydantic import BaseModel, Field
from datetime import timedelta

router = APIRouter(prefix="/files", tags=["signed"])

class InitReq(BaseModel):
    filename: str = Field(..., max_length=256)
    size: int = Field(..., ge=1)
    mime: str

@router.post("/init")
async def init_upload(req: InitReq):
    if req.size > 10 * 1024 * 1024 * 1024:  # 10GB
        raise HTTPException(413, "file too large")
    # Perform detailed validation here: MIME, extension, forbidden patterns, etc.
    # Generate the signed URL via your storage SDK (e.g., expiration = 5 minutes)
    signed_put_url = "<signed-put-url>"
    object_key = f"user-uploads/{req.filename}"
    return {"object_key": object_key, "put_url": signed_put_url, "expires_in": 300}

For downloads, issue a signed GET URL. Let storage/CDN handle Range and compression.

Key point

The API issues URLs and focuses on metadata & auth. It doesn’t touch giant payloads.

5. Downloads: Range support and conditional responses

5.1 Direct serving (small/medium)

# app/download.py
from fastapi import APIRouter, HTTPException
from fastapi.responses import StreamingResponse, FileResponse, Response
from pathlib import Path
import os

router = APIRouter(prefix="/download", tags=["download"])

BASE = Path("streams")

@router.get("/file/{name}")
async def get_file(name: str):
    path = BASE / name
    if not path.exists():
        raise HTTPException(404, "not found")
    # For small files, FileResponse is fine
    return FileResponse(path, media_type="application/octet-stream", filename=name)

5.2 Range (partial fetch)

Prefer delegating Range to the web server/storage/CDN; implement yourself only if necessary. For large files, serve directly from storage.

Key point

Only small files should be served by the API. Push Range down to Nginx/CDN/storage.

6. Security: a layered-defense checklist

Max size limits: enforce at app, Nginx, and reverse-proxy layers.
MIME + extension dual checks: reduce spoofing.
Fixed output extensions/paths: never use user input raw (prevents directory traversal).
Virus scan: cloud AV/external scanning API/worker-based async scans.
Hash verification: record ETag (sha256, etc.) in metadata at receipt.
Short-lived, scope-limited signed URLs: least-privilege principle.
Logging & traceability: capture uploader, object key, hash, IP, timestamps.

Key point

Place defenses at input → storage → delivery. Short lifetimes and least privilege are foundational.

7. Metadata and integrity management

7.1 Suggested schema

id (internal ID)
object_key (storage key)
filename (display)
mime
size
etag (sha256, etc.)
owner_id (for authorization)
state (init / uploaded / verified / ready)
created_at, updated_at

7.2 State transitions

init (URL issued) → client PUTs → uploaded
Worker runs virus scan + hash verification → verified
Transcoding / thumbnail generation → ready

Key point

Manage with a state machine. It stays traceable even with async jobs.

8. Working with async jobs (offload heavy work)

Virus scanning, thumbnailing, video transcoding belong to workers.
On failure, enable retries (exponential backoff) and manual requeue.
On success, webhook/notify to update the front end.

Key point

The API handles intake and notifications; the jobs absorb the heavy lifting.

9. Monitoring & operations

Metrics: upload counts, failure rate, avg size, signed-URL issuances, virus detections.
Alerts: spikes in 413s, signed-URL expiry ratio, streaks of scan failures.
Dashboards: capacity trends, extension distribution, MIME distribution.

Key point

Use numbers to spot growth and bottlenecks. Early warning is vital.

10. Sample: a minimal app wired together

# app/main.py
from fastapi import FastAPI
from app.stream import router as stream_router
from app.download import router as download_router
from app.signed import router as signed_router

app = FastAPI(title="File Handling Best Practices")
app.include_router(stream_router)
app.include_router(download_router)
app.include_router(signed_router)

@app.get("/health")
def health():
    return {"ok": True}

11. Common pitfalls and fixes

Symptom	Cause	Fix
Memory spikes	Read-all-at-once	Switch to streaming; tune chunk size
Spoofed files slip in	Extension-only checks	Dual-check MIME + extension; reject risky MIME
Range delivery is costly	API is serving content	CDN/storage direct delivery with signed URLs
URL leaks & abuse	Long-lived, overly broad permissions	Short-lived, scoped signed URLs
Poor incident forensics	Thin logging	Store structured metadata + audit logs

12. Adoption roadmap

Small files: size caps + MIME checks, fixed directory storage.
Medium files: async streaming, hashing, use FileResponse appropriately.
Large files: signed URLs for direct PUT/GET; API specializes in metadata.
Security: AV scans, short-lived URLs, least privilege.
Operations: metrics, alerts, dashboards.

13. Search keywords (good entry points)

FastAPI UploadFile, StreamingResponse, FileResponse
Content-Range / Range requests
S3 pre-signed URL / signed URL generation
python-magic MIME detection
Virus scanning API / clamav
ETag / Content-Length integrity
Nginx client_max_body_size / proxy_read_timeout
CDN Range / cache control

Wrap-up

Switch the path by size and purpose: small files safely via API, stream medium files, and use signed URLs for direct transfer of large files.
Layer security: size caps, dual MIME/extension checks, short-lived URLs, AV scans, hashes.
Delegate delivery to storage/CDN while the API focuses on auth, metadata, notifications. That’s the shortest path to fast and resilient implementations.

Become Robust with Large Files: Designing FastAPI Upload/Download — Streaming, Range Support, Signed URLs, Virus Scanning, and Integrity Verification

Become Robust with Large Files: Designing FastAPI Upload/Download — Streaming, Range Support, Signed URLs, Virus Scanning, and Integrity Verification

Summary (the big picture up front)

Who benefits from this

1. Strategy: switch by size and path

1.1 Three classes

1.2 Role separation

2. For small files: basic upload via form-data

3. For medium files: async streaming keeps memory steady

4. For large files: direct upload/download via signed URLs

4.1 Why signed URLs

4.2 Flow (upload)

5. Downloads: Range support and conditional responses

5.1 Direct serving (small/medium)

5.2 Range (partial fetch)

6. Security: a layered-defense checklist

7. Metadata and integrity management

7.1 Suggested schema

7.2 State transitions

8. Working with async jobs (offload heavy work)

9. Monitoring & operations

10. Sample: a minimal app wired together

11. Common pitfalls and fixes

12. Adoption roadmap

13. Search keywords (good entry points)

Wrap-up

By greeden

Leave a Reply Cancel reply

You Missed

Deep Dive into AWS IAM: A Practical Guide to Permission Design and Zero Trust, Compared with GCP IAM and Azure AD / Entra ID

Practical Guide to Accessibility for Audio & Video Content: Subtitles, Captions, Audio Description, Player Controls, and Optimisation for Diverse Use Environments

Become Robust with Large Files: Designing FastAPI Upload/Download — Streaming, Range Support, Signed URLs, Virus Scanning, and Integrity Verification

Summary (the big picture up front)

Who benefits from this

1. Strategy: switch by size and path

1.1 Three classes

1.2 Role separation

2. For small files: basic upload via form-data

3. For medium files: async streaming keeps memory steady

4. For large files: direct upload/download via signed URLs

4.1 Why signed URLs

4.2 Flow (upload)

5. Downloads: Range support and conditional responses

5.1 Direct serving (small/medium)

5.2 Range (partial fetch)

6. Security: a layered-defense checklist

7. Metadata and integrity management

7.1 Suggested schema

7.2 State transitions

8. Working with async jobs (offload heavy work)

9. Monitoring & operations

10. Sample: a minimal app wired together

11. Common pitfalls and fixes

12. Adoption roadmap

13. Search keywords (good entry points)

Wrap-up

Share this:

By greeden

Related Post

Leave a Reply Cancel reply

You Missed