Become Robust with Large Files: Designing FastAPI Upload/Download — Streaming, Range Support, Signed URLs, Virus Scanning, and Integrity Verification
Summary (the big picture up front)
- Split files into Small (~10 MB), Medium (~100 MB), and Large (GB-class), and switch between handling on the API server vs. streaming directly to external storage.
- Keep memory usage steady with asynchronous streaming, and ensure integrity with Content-Length / ETag.
- Offload Range (partial fetches) and compression/transcoding to storage or the CDN. Let the API focus on metadata management and issuing signed URLs.
- For security, layer defenses with both extension and MIME validation, file-size ceilings, virus scanning, and pre-save hashing.
Who benefits from this
- Learner A (graduation project: a proto video platform)
Wants to learn the basics to safely handle large files in a small API. - Small Team B (3-person agency)
Needs to accept bulk image uploads from clients without hogging server memory. - SaaS Dev C (startup)
Plans to use S3-compatible storage, centering on signed URLs for direct PUT/GET.
1. Strategy: switch by size and path
1.1 Three classes
- Small (~10 MB): Receive on the API server → save to storage. Validation is simple.
- Medium (~100 MB): The API server streams end-to-end. Must watch CPU/memory/I/O.
- Large (GB-class): Use signed URLs (S3-compatible, etc.) so the client uploads/downloads directly. API only handles metadata and authorization.
1.2 Role separation
- API: authorization, metadata, pre/post hooks (virus scan, trigger thumbnail jobs)
- Storage/CDN: persistence, range delivery, compression, caching
- Worker: heavy transforms and virus scans
Key point
- Design the route first. Assume heavy I/O is pushed out of the API.
2. For small files: basic upload via form-data
# app/main.py
from fastapi import FastAPI, UploadFile, File, HTTPException
from pathlib import Path
import magic # e.g., python-magic; for MIME sniffing (optional to install)
import hashlib
app = FastAPI(title="File Upload Basics")
BASE = Path("uploads")
BASE.mkdir(exist_ok=True)
MAX_MB = 10
def _digest_and_validate(file_bytes: bytes, expect_discrete: bool = True):
# MIME detection (don’t rely on extension alone)
mime = magic.from_buffer(file_bytes[:4096], mime=True)
# Example: allow images only
if not mime.startswith("image/"):
raise HTTPException(400, "unsupported media type")
# Content hash (can serve as an ETag basis)
etag = hashlib.sha256(file_bytes).hexdigest()
# Extend here for checks like disguised image formats, etc.
return mime, etag
@app.post("/upload")
async def upload(file: UploadFile = File(...)):
size_hint = 0
chunks = []
while True:
chunk = await file.read(1024 * 1024) # 1MB
if not chunk:
break
chunks.append(chunk)
size_hint += len(chunk)
if size_hint > MAX_MB * 1024 * 1024:
raise HTTPException(413, "file too large")
content = b"".join(chunks)
mime, etag = _digest_and_validate(content)
dest = BASE / file.filename
dest.write_bytes(content)
return {"filename": file.filename, "size": size_hint, "mime": mime, "etag": etag}
- Dual validation (extension + MIME) reduces spoofing.
- Reject early with 413 Payload Too Large.
- For small files, reading all at once can be acceptable—still watch memory usage.
Key point
- For small files, prioritize correctness and safety first. MIME checks and size limits are step one.
3. For medium files: async streaming keeps memory steady
# app/stream.py
from fastapi import APIRouter, UploadFile, File, HTTPException
from pathlib import Path
import aiofiles
import hashlib
router = APIRouter(prefix="/stream", tags=["stream"])
CHUNK = 1024 * 1024 # 1MB
LIMIT = 100 * 1024 * 1024 # 100MB
DEST = Path("streams"); DEST.mkdir(exist_ok=True)
@router.post("/upload")
async def upload_stream(file: UploadFile = File(...)):
size = 0
sha256 = hashlib.sha256()
dest = DEST / file.filename
async with aiofiles.open(dest, "wb") as f:
while True:
chunk = await file.read(CHUNK)
if not chunk:
break
size += len(chunk)
if size > LIMIT:
raise HTTPException(413, "file too large")
sha256.update(chunk)
await f.write(chunk)
return {"filename": file.filename, "size": size, "etag": sha256.hexdigest()}
- Use aiofiles for non-blocking writes.
- Compute the hash on the stream, usable as the integrity/ETag reference.
- Also set your web server’s
client_max_body_size
(Nginx, etc.).
Key point
- Read/write in chunks to keep peak memory constant. Hash concurrently as you stream.
4. For large files: direct upload/download via signed URLs
4.1 Why signed URLs
- Avoid routing through the API server, thus bypassing server bandwidth and CPU load.
- Issue short-lived URLs (minutes) with scoped permissions + expiration for safety.
4.2 Flow (upload)
- Client:
POST /files/init
with filename/size/MIME - API: generate and return a signed PUT URL for storage
- Client: PUT directly to the signed URL
- API: receive completion notification, finalize metadata (persist to DB)
Pseudo-code (issuing the signed URL):
# app/signed.py
from fastapi import APIRouter, HTTPException
from pydantic import BaseModel, Field
from datetime import timedelta
router = APIRouter(prefix="/files", tags=["signed"])
class InitReq(BaseModel):
filename: str = Field(..., max_length=256)
size: int = Field(..., ge=1)
mime: str
@router.post("/init")
async def init_upload(req: InitReq):
if req.size > 10 * 1024 * 1024 * 1024: # 10GB
raise HTTPException(413, "file too large")
# Perform detailed validation here: MIME, extension, forbidden patterns, etc.
# Generate the signed URL via your storage SDK (e.g., expiration = 5 minutes)
signed_put_url = "<signed-put-url>"
object_key = f"user-uploads/{req.filename}"
return {"object_key": object_key, "put_url": signed_put_url, "expires_in": 300}
For downloads, issue a signed GET URL. Let storage/CDN handle Range and compression.
Key point
- The API issues URLs and focuses on metadata & auth. It doesn’t touch giant payloads.
5. Downloads: Range support and conditional responses
5.1 Direct serving (small/medium)
# app/download.py
from fastapi import APIRouter, HTTPException
from fastapi.responses import StreamingResponse, FileResponse, Response
from pathlib import Path
import os
router = APIRouter(prefix="/download", tags=["download"])
BASE = Path("streams")
@router.get("/file/{name}")
async def get_file(name: str):
path = BASE / name
if not path.exists():
raise HTTPException(404, "not found")
# For small files, FileResponse is fine
return FileResponse(path, media_type="application/octet-stream", filename=name)
5.2 Range (partial fetch)
Prefer delegating Range to the web server/storage/CDN; implement yourself only if necessary. For large files, serve directly from storage.
Key point
- Only small files should be served by the API. Push Range down to Nginx/CDN/storage.
6. Security: a layered-defense checklist
- Max size limits: enforce at app, Nginx, and reverse-proxy layers.
- MIME + extension dual checks: reduce spoofing.
- Fixed output extensions/paths: never use user input raw (prevents directory traversal).
- Virus scan: cloud AV/external scanning API/worker-based async scans.
- Hash verification: record ETag (sha256, etc.) in metadata at receipt.
- Short-lived, scope-limited signed URLs: least-privilege principle.
- Logging & traceability: capture uploader, object key, hash, IP, timestamps.
Key point
- Place defenses at input → storage → delivery. Short lifetimes and least privilege are foundational.
7. Metadata and integrity management
7.1 Suggested schema
- id (internal ID)
- object_key (storage key)
- filename (display)
- mime
- size
- etag (sha256, etc.)
- owner_id (for authorization)
- state (init / uploaded / verified / ready)
- created_at, updated_at
7.2 State transitions
init
(URL issued) → client PUTs →uploaded
- Worker runs virus scan + hash verification →
verified
- Transcoding / thumbnail generation →
ready
Key point
- Manage with a state machine. It stays traceable even with async jobs.
8. Working with async jobs (offload heavy work)
- Virus scanning, thumbnailing, video transcoding belong to workers.
- On failure, enable retries (exponential backoff) and manual requeue.
- On success, webhook/notify to update the front end.
Key point
- The API handles intake and notifications; the jobs absorb the heavy lifting.
9. Monitoring & operations
- Metrics: upload counts, failure rate, avg size, signed-URL issuances, virus detections.
- Alerts: spikes in 413s, signed-URL expiry ratio, streaks of scan failures.
- Dashboards: capacity trends, extension distribution, MIME distribution.
Key point
- Use numbers to spot growth and bottlenecks. Early warning is vital.
10. Sample: a minimal app wired together
# app/main.py
from fastapi import FastAPI
from app.stream import router as stream_router
from app.download import router as download_router
from app.signed import router as signed_router
app = FastAPI(title="File Handling Best Practices")
app.include_router(stream_router)
app.include_router(download_router)
app.include_router(signed_router)
@app.get("/health")
def health():
return {"ok": True}
11. Common pitfalls and fixes
Symptom | Cause | Fix |
---|---|---|
Memory spikes | Read-all-at-once | Switch to streaming; tune chunk size |
Spoofed files slip in | Extension-only checks | Dual-check MIME + extension; reject risky MIME |
Range delivery is costly | API is serving content | CDN/storage direct delivery with signed URLs |
URL leaks & abuse | Long-lived, overly broad permissions | Short-lived, scoped signed URLs |
Poor incident forensics | Thin logging | Store structured metadata + audit logs |
12. Adoption roadmap
- Small files: size caps + MIME checks, fixed directory storage.
- Medium files: async streaming, hashing, use FileResponse appropriately.
- Large files: signed URLs for direct PUT/GET; API specializes in metadata.
- Security: AV scans, short-lived URLs, least privilege.
- Operations: metrics, alerts, dashboards.
13. Search keywords (good entry points)
- FastAPI UploadFile, StreamingResponse, FileResponse
- Content-Range / Range requests
- S3 pre-signed URL / signed URL generation
- python-magic MIME detection
- Virus scanning API / clamav
- ETag / Content-Length integrity
- Nginx client_max_body_size / proxy_read_timeout
- CDN Range / cache control
Wrap-up
- Switch the path by size and purpose: small files safely via API, stream medium files, and use signed URLs for direct transfer of large files.
- Layer security: size caps, dual MIME/extension checks, short-lived URLs, AV scans, hashes.
- Delegate delivery to storage/CDN while the API focuses on auth, metadata, notifications. That’s the shortest path to fast and resilient implementations.