Table of Contents

[Class Report] Introduction to System Development, Week 28 — Intro to Operational Testing & Log Analysis

In Week 28, we worked on operational testing (load & stability checks) of the generative-AI–integrated system we implemented last week, and visualizing behavior via log analysis. The aim was to learn validation that assumes real-world operation and the skills to go from finding issues → making improvements.

■ Instructor’s Opening: “Testing isn’t a ‘starting point’—it’s work you should repeat”

Prof. Tanaka: “Once real people start using your system, unexpected operations and input combinations will appear. It’s crucial to use operational testing to discover what kinds of failures might occur in advance, and to build mechanisms to verify them with logs.”

The class proceeded with the perspective that “reliable services are backed by tests and logs.”

■ Today’s Goals

Run a simple load test and measure response time and success rate.
Standardize the log format (timestamp, status, prompt hash, etc.) so it’s aggregatable.
Visualize “error rate,” “frequent input patterns,” “regeneration frequency,” etc., through log analysis and extract improvement points.

■ Exercise ①: Designing and Running Load Tests (safely in the learning environment)

We began with load-test design. Each team created scenarios like the following:

Light load: Continuous requests at 1 req/sec for 5 minutes
Peak load: A burst of 10 req/sec for 30 seconds
Spike and recovery: Repeated cycles of surge → wait → return to normal

Then, within the constraints of the learning environment, we ran a simple script to send simulated requests and measured:

Average response time (ms)
95th percentile response time (ms)
Success rate (HTTP 200 / total requests)
Number of timeouts and exceptions

Simple load script used in class (simulated, for learning)

import time, requests
def send_requests(url, n, interval):
    results = []
    for i in range(n):
        start = time.time()
        try:
            r = requests.post(url, json={"prompt":"test"}, timeout=5)
            latency = (time.time() - start) * 1000
            results.append(("ok", r.status_code, latency))
        except Exception as e:
            latency = (time.time() - start) * 1000
            results.append(("error", str(e), latency))
        time.sleep(interval)
    return results

Student reaction: “Timeouts increased during peaks. Caching and retry strategies look promising!”

■ Exercise ②: Deciding on a Log Format and Ensuring Consistent Output

We decided on a log format useful for operations and integrated it into the app. The minimal fields adopted this time were:

timestamp (ISO 8601)
request_id (unique)
prompt_hash (for aggregating identical prompts)
user_id (anonymized or masked)
status (ok / retry / fallback / error)
latency_ms
error_type (if present)
note (flags for review, etc.)

Logs were output in JSONL (1 line = 1 event) for easier post-processing.

■ Exercise ③: Log Analysis — Basic Aggregation with Python

We read collected logs and performed simple analyses to check frequent errors and response-time distributions. Here’s a snippet used in class (learning example):

import json
from collections import Counter
from statistics import mean, median

logs = []
with open("app_logs.jsonl", "r", encoding="utf-8") as f:
    for line in f:
        logs.append(json.loads(line))

# Success rate
total = len(logs)
ok = sum(1 for l in logs if l["status"] == "ok")
error = total - ok
print(f"total={total}, ok={ok}, error={error}, error_rate={error/total:.2%}")

# Response times (ms)
latencies = [l["latency_ms"] for l in logs if isinstance(l.get("latency_ms"), (int,float))]
print(f"avg={mean(latencies):.1f}ms, median={median(latencies):.1f}ms, max={max(latencies):.1f}ms")

# Error-type counts
error_types = Counter(l.get("error_type", "none") for l in logs)
print("error types:", error_types.most_common(10))

# Top prompt hashes
prompt_counts = Counter(l.get("prompt_hash") for l in logs)
print("top prompts:", prompt_counts.most_common(10))

Student insight: “Failures cluster around a specific prompt hash. We should separate whether it’s an input-pattern problem or a model-side constraint.”

■ Exercise ④: A Quick Intro to Visualization (prototype-level in class)

Based on the analysis results, we made quick charts to grasp trends (prototype displays with matplotlib in class):

Time series: transitions in request count and error count
Histogram: response-time distribution
Bar chart: frequency of error types

(In class we used the default style without specifying colors.)

■ Improvement Cycle — Extracting Issues from Logs and Proposing Countermeasures

Representative issues found via log analysis and example countermeasures proposed by the class:

Issue: Many timeouts with specific prompts.
Countermeasures: Limit input length (trim in preprocessing); shorten model-call timeout and strengthen fallbacks.
Issue: Success rate drops at peak times.
Countermeasures: Extend cache TTL; implement backoff retries; add a simple rate limiter.
Issue: Users frequently request regeneration.
Countermeasures: Tighten prompt specification; strengthen output schema checks; automate regeneration conditions.

■ Instructor’s Comment

“Logs are voices from the past. The numbers show you where things are off. What matters is breaking discovered issues into small improvements and trying them right away.”

■ Students’ Reflections

“By looking at logs, I can discuss problems objectively now.”
“Load testing made me notice inconveniences from the user’s perspective for the first time.”
“I was happy we could block the ‘frequent input patterns’ found in analysis via preprocessing.”

■ Next Week’s Preview: Implementing Operational Improvements & A/B Testing Basics

Next week we’ll implement the improvements identified this week and re-test to quantify the effects. We’ll also learn the basics of A/B testing (comparing two improvement plans) and evaluate which is more effective.

Through operational testing and log analysis, first-year students gained experience closer to real-world practice—not just “building,” but operating and improving. A system isn’t finished when it’s built; it’s only complete once it’s being used—that realization grew in Week 28.

[Class Report] Introduction to System Development, Week 28 — Intro to Operational Testing & Log Analysis

[Class Report] Introduction to System Development, Week 28 — Intro to Operational Testing & Log Analysis

■ Instructor’s Opening: “Testing isn’t a ‘starting point’—it’s work you should repeat”

■ Today’s Goals

■ Exercise ①: Designing and Running Load Tests (safely in the learning environment)

■ Exercise ②: Deciding on a Log Format and Ensuring Consistent Output

■ Exercise ③: Log Analysis — Basic Aggregation with Python

■ Exercise ④: A Quick Intro to Visualization (prototype-level in class)

■ Improvement Cycle — Extracting Issues from Logs and Proposing Countermeasures

■ Instructor’s Comment

■ Students’ Reflections

■ Next Week’s Preview: Implementing Operational Improvements & A/B Testing Basics

By greeden

Leave a Reply Cancel reply

You Missed

What Is ChatGPT’s Group Chat Feature?

Amazon CloudWatch Deep Dive: A Complete Guide to Monitoring / Logs / Metrics Design, Compared with Cloud Monitoring (GCP) and Azure Monitor

[Class Report] Introduction to System Development, Week 28 — Intro to Operational Testing & Log Analysis

■ Instructor’s Opening: “Testing isn’t a ‘starting point’—it’s work you should repeat”

■ Today’s Goals

■ Exercise ①: Designing and Running Load Tests (safely in the learning environment)

■ Exercise ②: Deciding on a Log Format and Ensuring Consistent Output

■ Exercise ③: Log Analysis — Basic Aggregation with Python

■ Exercise ④: A Quick Intro to Visualization (prototype-level in class)

■ Improvement Cycle — Extracting Issues from Logs and Proposing Countermeasures

■ Instructor’s Comment

■ Students’ Reflections

■ Next Week’s Preview: Implementing Operational Improvements & A/B Testing Basics

Share this:

By greeden

Related Post

Leave a Reply Cancel reply

You Missed