GA Updated: 2026-02-08

Monitor Phase Documentation

What Happens in This Phase

The Monitor phase provides continuous visibility into system health, performance, and security. Unlike other phases where documentation consists of static artifacts, monitoring "documentation" is fundamentally different—it includes live dashboards, real-time metrics, and automated alerts that function as observable infrastructure.

This distinction matters: the dashboards and alert configurations themselves ARE documentation. They're version-controlled code (Grafana JSON, Prometheus YAML, Terraform modules) that should be automatically deployed and continuously validated against actual system behavior.

Two Types of Monitor Phase Artifacts

Dynamic Monitoring Artifacts (Observable Infrastructure)

These are the configurations that define what gets monitored and how. They're code, not prose.

Artifact Type	Examples	Format
Dashboard definitions	Service health, security events, SLO tracking	Grafana JSON, Datadog YAML
Alert rules	Threshold alerts, anomaly detection	Prometheus YAML, PagerDuty configs
SLO/SLI definitions	Availability targets, latency budgets	OpenSLO YAML, service configs
Log aggregation configs	What gets collected, retention policies	Fluentd configs, CloudWatch rules

AI's role: Generate and maintain these configurations directly—dashboard-as-code, alerts-from-SLOs, observability-as-code.

Supporting Documentation (Traditional Docs)

These explain the "why" behind monitoring decisions and guide human response.

Artifact Type	Purpose
Monitoring strategy	Why these metrics matter, coverage philosophy
Alert response procedures	What to do when alerts fire
Threshold rationale	Why specific values were chosen
Runbook procedures	Step-by-step incident response
Compliance mapping	Which metrics satisfy which requirements

AI's role: Generate and maintain these docs, keeping them synchronized with the dynamic artifacts.

Where AI Helps

For Dynamic Artifacts (Monitoring-as-Code)

Generating dashboard configurations from SLO definitions and service metadata
Creating alert rules from SLOs and operational requirements
Maintaining consistency across dashboard definitions as services evolve
Validating configurations against actual system behavior

For Supporting Documentation

Generating runbook procedures when alert thresholds are defined
Documenting threshold rationale from operational history and SLO targets
Creating compliance mappings from metric definitions to requirements
Keeping docs synchronized when monitoring configs change

Example: Prometheus Alert Rules (Dynamic Artifact)

This example demonstrates GenAI Completion—AI generates alert configurations directly from SLO definitions, producing deterministic monitoring-as-code that requires validation against operational experience.

Context Setup

Before prompting, the SRE provided:

Service SLO definitions (99.9% availability, p99 < 500ms, error rate < 0.1%)
Prometheus metric naming conventions for the service
Runbook URL patterns for alert annotations
Severity level standards from on-call policy

Prompt

"Generate Prometheus alerting rules for the logistics API based on these SLOs: 99.9% availability, p99 latency under 500ms, error rate below 0.1%. Include appropriate severity levels and annotations for runbook links."

Output

# logistics-api-alerts.yaml
groups:
  - name: logistics-api-slos
    rules:
      - alert: LogisticsAPIAvailabilityBreach
        expr: |
          (1 - (sum(rate(http_requests_total{service="logistics-api",code=~"5.."}[5m]))
          / sum(rate(http_requests_total{service="logistics-api"}[5m])))) < 0.999
        for: 5m
        labels:
          severity: critical
          service: logistics-api
        annotations:
          summary: "Logistics API availability below 99.9% SLO"
          runbook_url: "https://runbooks.example.com/logistics-api/availability"

      - alert: LogisticsAPILatencyBreach
        expr: |
          histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="logistics-api"}[5m])) > 0.5
        for: 5m
        labels:
          severity: warning
          service: logistics-api
        annotations:
          summary: "Logistics API p99 latency exceeds 500ms SLO"
          runbook_url: "https://runbooks.example.com/logistics-api/latency"

Example: Alert Runbook (Supporting Documentation)

This example demonstrates GenAI Generation—AI synthesizes alert context and service architecture into operational procedures that require SRE validation but follow predictable structure.

Context Setup

Before prompting, the SRE provided:

Alert definition with threshold and severity
Service architecture diagram showing dependencies
Historical incident data for common failure patterns
Escalation policy from on-call rotation
Existing remediation scripts and kubectl commands

Prompt

"Generate a runbook for the LogisticsAPIAvailabilityBreach alert. Include triage steps, common causes, escalation paths, and recovery procedures based on our service architecture."

Output

# Runbook: LogisticsAPIAvailabilityBreach

## Alert Context
- **Severity**: Critical
- **SLO**: 99.9% availability
- **Threshold**: Fires when availability drops below 99.9% for 5 minutes

## Immediate Triage (First 5 minutes)
1. Check service health dashboard: [link]
2. Verify alert is not a false positive (check upstream dependencies)
3. Identify error pattern in logs: `kubectl logs -l app=logistics-api --tail=100`

## Common Causes
| Symptom | Likely Cause | Quick Fix |
|---------|--------------|-----------|
| 503 errors spike | Pod crashes | Check pod restarts, scale up |
| Connection timeouts | Database overload | Check DB connections, consider read replica |
| Intermittent 500s | Dependency failure | Check downstream service health |

## Escalation
- **15 min unresolved**: Page on-call SRE
- **30 min unresolved**: Engage service owner
- **Production impact confirmed**: Incident commander protocol

Observable Infrastructure Principles

Monitor phase artifacts should be treated as infrastructure:

Version controlled: Dashboard JSON and alert YAML live in git alongside application code
Automatically deployed: CI/CD pipelines deploy monitoring configs, not manual UI changes
Continuously validated: Tests verify alerts fire correctly, dashboards load, metrics exist
Self-documenting: Well-structured configs with clear naming reduce need for separate docs

This approach means monitoring documentation stays current because it IS the monitoring system—not a separate description that drifts out of sync.

What AI-Generated Monitor Artifacts Often Miss

For Dynamic Artifacts

Operational context - Alert thresholds that work in theory but cause alert fatigue in practice
Cross-service dependencies - Metrics that matter for end-to-end flows, not individual services
Environment differences - Thresholds that vary between dev, staging, and production

For Supporting Docs

Tribal knowledge - Why certain alerts exist based on past incidents
Team-specific escalation - Who actually responds vs. documented paths
Compliance nuances - Which metrics satisfy specific audit requirements

Human reviewers must validate that monitoring artifacts reflect operational reality, not theoretical ideals.

Governance Checklist

Before accepting AI-assisted Monitor phase artifacts:

Dynamic Artifacts

[ ] Alert thresholds validated against operational experience (not just SLO math)
[ ] Dashboard configs tested in staging before production deployment
[ ] Monitoring-as-code integrated into CI/CD pipeline
[ ] Rollback procedures exist for monitoring config changes

Supporting Documentation

[ ] Runbooks tested during incident simulations
[ ] Escalation paths current with team structure
[ ] Compliance metrics mapped to specific ATO requirements
[ ] Threshold rationale documented for future maintainers

Brownfield Additions

For modernization efforts, Monitor phase requires additional focus:

Legacy vs. modern performance comparison - Side-by-side dashboards demonstrating modernization success
Retirement validation metrics - Observable evidence that legacy system can be safely decommissioned
Unified monitoring - Single observability layer across legacy and modern systems during transition
Migration of monitoring-as-code - Converting legacy monitoring (manual UI configs) to version-controlled definitions

← Back to Documentation Play