teacher asking a question to the class
Photo by Max Fischer on Pexels.com

[Class Report] Introduction to System Development — Week 30: Institutionalizing Improvements and Documentation

In Week 30, we worked on embedding the improvements from Weeks 1–29 into operations and creating documentation (runbooks, change logs, monitoring design) so the results persist. The theme was to avoid “build-and-forget” and instead package work so the team and successors can handle it.


■ Teacher’s Introduction: “Knowledge doesn’t live in code alone. Procedures and records are everything.”

Mr. Tanaka: “To keep improvements from being one-offs, it’s crucial to write procedures, assign ownership, and build a mechanism for measuring impact. What we’re making today is a blueprint for ‘future you’ and for ‘the next person who joins.’”


■ Today’s Goals

  1. Create a Runbook (operational checklist) to embed improvements into operations.
  2. Establish a format and template for the Changelog.
  3. Decide SLO/KPI for monitoring and sketch a simple dashboard.
  4. Draft a first-response playbook for incidents.

■ Exercise ①: Create a Runbook (operations procedure)

In teams, students translated their improvements (e.g., extending cache TTL, input sanitization, fallback tuning) into step-by-step procedures anyone can execute. Main sections included in the Runbook:

  • Purpose (what improvement to expect)
  • Target environments (test / staging / production)
  • Prerequisites (required env vars, permissions, dependent services)
  • Execution steps (commands and file diffs)
  • Rollback steps (how to revert)
  • Verification checklist (which KPIs/logs to confirm)
  • Fields to record executor, approver, and execution timestamp

Short Runbook Example (excerpt)

Purpose: Change cache TTL from 300 → 900 to improve peak response
Prereq: Write access to config/cache_config.py
Steps:
  1. In staging, update TTL = 300 to 900 in config/cache_config.py
  2. Restart the service: systemctl restart myapp
  3. Run a 10-minute load test (script: load_test.py)
Verification:
  - Average response time has improved compared with the previous run
  - Error rate (status != ok) has not increased
Rollback:
  - Revert TTL to 300 and restart the service

■ Exercise ②: Create a Changelog Template

We created a concise template for recording changes. These records directly support later root-cause analysis and evaluation.

Changelog Entry Example

  • Date: 2025-10-01
  • Author: Yamada (Team A)
  • Summary: Extended cache TTL from 300 → 900
  • Purpose: Improve peak-time response (reduce average latency)
  • Impact scope: API responses (all users)
  • Environments: staging → production (gradual rollout)
  • Result: Average latency −15% (load test), no issues (ops check)
  • Reference: runbooks/cache_ttl_update.md

■ Exercise ③: Monitoring Design and Dashboard Sketch

To continuously observe improvements, we chose monitoring metrics (KPI/SLO) and designed a simple dashboard layout.

Recommended KPIs (examples)

  • Average response time (avg latency, ms)
  • 95th percentile response time (p95, ms)
  • Success rate (ratio with status == ok)
  • Fallback rate (fallback / total)
  • Regeneration request rate (share of users who pressed “regenerate”)

Each team also set thresholds and alert conditions:

  • Example: p95 > 2000 ms sustained for 5 minutes → alert on Slack
  • Example: success rate < 95% → email on-call

For class, dashboards were shared as simple tables plus time-series sketches.


■ Exercise ④: Incident First-Response Playbook (lightweight)

To speed recovery when problems occur, we drafted a first-response flow.

First-Response Flow (summary)

  1. Detection (monitoring alert received)
  2. Scope assessment (which endpoints/users are affected)
  3. Immediate mitigation (split traffic / enable fallbacks)
  4. Root-cause data collection (extract logs for the relevant window)
  5. Escalation (notify owner; if needed, report via the instructor to ops)
  6. Post-recovery review (RCA) and Changelog entry

■ The Importance of a Documentation Culture (discussion)

We wrapped up by discussing operating rules for documentation.

  • Don’t stop at “writing”—assign an update workflow and responsible owners.
  • Always record changes in the Changelog (automate where possible).
  • For runbooks, the top priority is “can the reader reproduce it?” Use bullet points and command examples.
  • Standardize naming and file layout (e.g., docs/runbooks/, docs/changelog.md).

Student takeaway: “When you imagine who’s reading next, you naturally write more carefully!”


■ Teacher’s Closing Comment

“Beyond tech itself, building mechanisms for communicating is the secret to team results. The runbooks and changelogs you made today will make someone else’s job dramatically easier. Small habits compound into big trust.”


■ Next Week: Term Wrap-Up and a Preview of Year-2 Topics

Next week we’ll wrap up the term (submit reflection sheets and organize portfolios) and preview System Design (UML, DB design, architecture) for the coming semester. Let’s build momentum for second-year learning!


After Week 30, first-year students have gained the ability not only to “build” but also to operate and communicate. The mechanism for institutionalizing small improvements will be a powerful asset in future development work. Great job, everyone!

By greeden

Leave a Reply

Your email address will not be published. Required fields are marked *

日本語が含まれない投稿は無視されますのでご注意ください。(スパム対策)