[Class Report] Introduction to System Development, Week 28 — Intro to Operational Testing & Log Analysis
In Week 28, we worked on operational testing (load & stability checks) of the generative-AI–integrated system we implemented last week, and visualizing behavior via log analysis. The aim was to learn validation that assumes real-world operation and the skills to go from finding issues → making improvements.
■ Instructor’s Opening: “Testing isn’t a ‘starting point’—it’s work you should repeat”
Prof. Tanaka: “Once real people start using your system, unexpected operations and input combinations will appear. It’s crucial to use operational testing to discover what kinds of failures might occur in advance, and to build mechanisms to verify them with logs.”
The class proceeded with the perspective that “reliable services are backed by tests and logs.”
■ Today’s Goals
- Run a simple load test and measure response time and success rate.
- Standardize the log format (timestamp, status, prompt hash, etc.) so it’s aggregatable.
- Visualize “error rate,” “frequent input patterns,” “regeneration frequency,” etc., through log analysis and extract improvement points.
■ Exercise ①: Designing and Running Load Tests (safely in the learning environment)
We began with load-test design. Each team created scenarios like the following:
- Light load: Continuous requests at 1 req/sec for 5 minutes
- Peak load: A burst of 10 req/sec for 30 seconds
- Spike and recovery: Repeated cycles of surge → wait → return to normal
Then, within the constraints of the learning environment, we ran a simple script to send simulated requests and measured:
- Average response time (ms)
- 95th percentile response time (ms)
- Success rate (HTTP 200 / total requests)
- Number of timeouts and exceptions
Simple load script used in class (simulated, for learning)
import time, requests def send_requests(url, n, interval): results = [] for i in range(n): start = time.time() try: r = requests.post(url, json={"prompt":"test"}, timeout=5) latency = (time.time() - start) * 1000 results.append(("ok", r.status_code, latency)) except Exception as e: latency = (time.time() - start) * 1000 results.append(("error", str(e), latency)) time.sleep(interval) return results
Student reaction: “Timeouts increased during peaks. Caching and retry strategies look promising!”
■ Exercise ②: Deciding on a Log Format and Ensuring Consistent Output
We decided on a log format useful for operations and integrated it into the app. The minimal fields adopted this time were:
timestamp
(ISO 8601)request_id
(unique)prompt_hash
(for aggregating identical prompts)user_id
(anonymized or masked)status
(ok / retry / fallback / error)latency_ms
error_type
(if present)note
(flags for review, etc.)
Logs were output in JSONL (1 line = 1 event) for easier post-processing.
■ Exercise ③: Log Analysis — Basic Aggregation with Python
We read collected logs and performed simple analyses to check frequent errors and response-time distributions. Here’s a snippet used in class (learning example):
import json
from collections import Counter
from statistics import mean, median
logs = []
with open("app_logs.jsonl", "r", encoding="utf-8") as f:
for line in f:
logs.append(json.loads(line))
# Success rate
total = len(logs)
ok = sum(1 for l in logs if l["status"] == "ok")
error = total - ok
print(f"total={total}, ok={ok}, error={error}, error_rate={error/total:.2%}")
# Response times (ms)
latencies = [l["latency_ms"] for l in logs if isinstance(l.get("latency_ms"), (int,float))]
print(f"avg={mean(latencies):.1f}ms, median={median(latencies):.1f}ms, max={max(latencies):.1f}ms")
# Error-type counts
error_types = Counter(l.get("error_type", "none") for l in logs)
print("error types:", error_types.most_common(10))
# Top prompt hashes
prompt_counts = Counter(l.get("prompt_hash") for l in logs)
print("top prompts:", prompt_counts.most_common(10))
Student insight: “Failures cluster around a specific prompt hash. We should separate whether it’s an input-pattern problem or a model-side constraint.”
■ Exercise ④: A Quick Intro to Visualization (prototype-level in class)
Based on the analysis results, we made quick charts to grasp trends (prototype displays with matplotlib in class):
- Time series: transitions in request count and error count
- Histogram: response-time distribution
- Bar chart: frequency of error types
(In class we used the default style without specifying colors.)
■ Improvement Cycle — Extracting Issues from Logs and Proposing Countermeasures
Representative issues found via log analysis and example countermeasures proposed by the class:
-
Issue: Many timeouts with specific prompts.
Countermeasures: Limit input length (trim in preprocessing); shorten model-call timeout and strengthen fallbacks. -
Issue: Success rate drops at peak times.
Countermeasures: Extend cache TTL; implement backoff retries; add a simple rate limiter. -
Issue: Users frequently request regeneration.
Countermeasures: Tighten prompt specification; strengthen output schema checks; automate regeneration conditions.
■ Instructor’s Comment
“Logs are voices from the past. The numbers show you where things are off. What matters is breaking discovered issues into small improvements and trying them right away.”
■ Students’ Reflections
- “By looking at logs, I can discuss problems objectively now.”
- “Load testing made me notice inconveniences from the user’s perspective for the first time.”
- “I was happy we could block the ‘frequent input patterns’ found in analysis via preprocessing.”
■ Next Week’s Preview: Implementing Operational Improvements & A/B Testing Basics
Next week we’ll implement the improvements identified this week and re-test to quantify the effects. We’ll also learn the basics of A/B testing (comparing two improvement plans) and evaluate which is more effective.
Through operational testing and log analysis, first-year students gained experience closer to real-world practice—not just “building,” but operating and improving. A system isn’t finished when it’s built; it’s only complete once it’s being used—that realization grew in Week 28.