[Class Report] Intro to System Development – Week 29: Verifying Improvements & Introduction to A/B Testing
In Week 29, we implemented improvements addressing the issues found in last week’s log analysis and learned the basics of retesting and A/B testing (comparative evaluation) to measure their impact. By experiencing the implement → verify → compare cycle, this week fostered the mindset that “improvement isn’t done until the numbers prove it.”
■ Instructor’s Kickoff: “Form a hypothesis, verify it, then refine again”
Mr. Tanaka: “Even if you think something got better, you won’t know without numbers. A/B testing is a handy method to objectively decide which version is truly better.”
■ Today’s Goals
- Implement last week’s improvements (e.g., extending cache TTL, input length limits, fallback adjustments).
- Compare key metrics like response time, success rate, and regenerate rate before vs. after improvements.
- Design a simple A/B test and measure which of two variants is more effective.
■ Exercise ①: Implementing and Deploying Improvements (within the lab environment)
Each team selected two to three high-priority improvements from the prior analysis, implemented them, and rolled them out to the test environment. Representative examples:
- Cache TTL extension: 300 seconds → 900 seconds (reduces duplicate requests in short windows)
- Stronger input pre-processing: Replace overly long inputs with a summarization prompt
- Improved fallback copy: Provide concrete next actions to users (e.g., “Check the official site”)
After implementation, we re-ran load tests and collected logs with the same load script.
■ Exercise ②: Measuring and Comparing Key KPIs
From the collected logs, we compared key metrics of the pre-improvement (control) vs. post-improvement (test) versions. Example metrics covered in class:
- Average response time (ms)
- Success rate (ok / total requests)
- Regenerate request rate (ratio of users who clicked “regenerate”)
- Fallback occurrence rate (ratio of fallback status)
Simple aggregation snippet (for class use):
# logs_control / logs_test are the respective log lists
def summarize(logs):
total = len(logs)
ok = sum(1 for l in logs if l["status"] == "ok")
return {
"total": total,
"ok_rate": ok / total,
"avg_latency": sum(l["latency_ms"] for l in logs) / total
}
Student observations (examples):
- Extending the cache TTL improved average response time by about 15%.
- Teams that added input summarization saw fewer timeouts and improved success rates.
(Numbers are sample measurements from class and varied by team.)
■ Exercise ③: Designing and Running a Simple A/B Test
We learned the purpose and design steps of A/B testing and ran a simple in-class A/B test.
Basic A/B Testing Steps
- Form a hypothesis: e.g., “Extending cache TTL reduces average response time and lowers regenerate rate.”
- Create variants: A = current (TTL = 300), B = improved (TTL = 900)
- Split traffic: Route half of requests to each (in the lab, we simulated with random assignment or time-based splits)
- Collect enough samples: Short windows create high variance
- Compare on metrics: Evaluate differences in key KPIs and discuss statistical significance (we only introduced significance testing in class)
- Conclude and feed into the next improvement cycle
In class, we used a lightweight method: run A and B in separate one-hour windows and then compare metrics.
■ Discussion: How to Read Results and What to Watch Out For
Key points we confirmed as a class when interpreting results:
- Small sample sizes can lead to misjudgments
- Seasonality and network conditions can cause variance
- Consider multiple metrics for a holistic view (not just response time; weigh success rate and UX metrics, too)
- Consider side effects of improvements (e.g., longer TTL could reduce content freshness)
Student takeaway: “Variant A is faster, but Variant B has a lower regenerate rate… which to prioritize depends on our objective.”
■ Instructor’s Closing Comment
“A/B testing builds a data-driven culture. Form a small hypothesis, verify it, and connect the results to the next hypothesis—this is a good habit for engineers. The key is to face the results honestly.”
■ Student Reflections
- “Actually comparing showed that outcomes often differ from our expectations.”
- “A/B design seems simple but is hard—avoiding bias is the crux.”
- “Shortest learning loop is improvement → verification in short cycles.”
■ Next Week’s Preview: Operationalizing Improvements & Documentation
Next week, we will operationalize these improvements for production-like use (create checklists) and document change history and impact reports. We’ll work on the “systems” that sustain continuous improvement.
By implementing improvements and verifying them with numbers, Week 29 let students experience the cycle of hypothesis → implementation → verification → re-hypothesis, taking a solid step toward practical improvement skills.