Engineering & Platform Metrics

Metrics for measuring platform health, reliability, and engineering productivity.

DORA Framework: Delivery metrics follow the DORA (DevOps Research and Assessment) framework, the industry standard for measuring software delivery performance. DORA research shows that high-performing teams are 2x more likely to exceed profitability targets.

Reliability Metrics

Uptime (Availability)

Definition: The percentage of time the service is operational.

Formula:

Uptime = (Total time - Downtime) / Total time × 100

Benchmarks (Monthly):

99.0%: ~7.3 hours downtime (unacceptable for most SaaS)
99.9%: ~43 minutes downtime (three nines)
99.95%: ~22 minutes downtime (typical SLA)
99.99%: ~4.3 minutes downtime (four nines, high reliability)

What it tells you: Basic platform reliability.

Common mistakes:

Excluding “scheduled maintenance” (customers don’t care why it’s down)
Measuring only core service, not dependencies
Not weighting by traffic/usage

SLA Adherence

Definition: Percentage of time SLA commitments were met.

Formula:

SLA Adherence = Periods meeting SLA / Total periods × 100

Target: 100% (SLA breaches have contractual and trust implications)

Error Rate

Definition: Percentage of requests that result in errors.

Formula:

Error Rate = Error responses (5xx) / Total requests × 100

Benchmarks:

Below 0.1%: Excellent
0.1-0.5%: Good
0.5-1%: Acceptable
Above 1%: Needs attention

What it tells you: Service reliability from user perspective.

Latency (Response Time)

Definition: Time taken to respond to requests.

Formula:

P50 Latency = Median response time
P95 Latency = 95th percentile response time
P99 Latency = 99th percentile response time

Always report P95 or P99, not averages (averages hide tail latency).

Benchmarks (API):

P50: < 100ms
P95: < 500ms
P99: < 1000ms

What it tells you: User experience. Slow responses impact satisfaction and conversion.

Incident Metrics

Incident Count

Definition: Number of incidents in a period.

Formula:

Incident Count = Total incidents reported

Categorize by severity:

SEV1/P1: Critical, full outage
SEV2/P2: Major, significant degradation
SEV3/P3: Minor, limited impact

What it tells you: System stability. Trend matters.

Mean Time to Detect (MTTD)

Definition: Time from incident start to detection.

Formula:

MTTD = Average of (Detection time - Incident start time)

Target: < 5 minutes for critical services

What it tells you: Monitoring and alerting effectiveness.

Mean Time to Acknowledge (MTTA)

Definition: Time from alert to human acknowledgment.

Formula:

MTTA = Average of (Acknowledgment time - Alert time)

Target: < 15 minutes

What it tells you: On-call responsiveness.

Mean Time to Resolve (MTTR)

Definition: Time from incident start to resolution.

Formula:

MTTR = Average of (Resolution time - Incident start time)

Benchmarks:

SEV1: < 1 hour
SEV2: < 4 hours
SEV3: < 24 hours

What it tells you: Incident response capability.

Change Failure Rate

Definition: Percentage of deployments that cause incidents.

Formula:

Change Failure Rate = Deployments causing incidents / Total deployments × 100

Benchmarks (DORA):

Elite: 0-15%
High: 16-30%
Medium: 31-45%
Low: 46-60%

What it tells you: Deployment quality and release process health.

Delivery Metrics (DORA)

The four key DORA metrics measure software delivery performance. In 2024, DORA added a fifth metric (Rework Rate), though benchmarks are still emerging.

Sources:

Deployment Frequency

Definition: How often code is deployed to production.

Formula:

Deployment Frequency = Deployments / Time period

Benchmarks (DORA):

Elite: On-demand or multiple times per day
High: Once per day to once per week
Medium: Once per week to once per month
Low: Less than once per month

What it tells you: Ability to deliver value quickly.

Lead Time for Changes

Definition: Time from code commit to production deployment.

Formula:

Lead Time = Median of (Deploy time - Commit time)

Benchmarks (DORA):

Elite: Less than 1 day
High: 1 day to 1 week
Medium: 1 week to 1 month
Low: 1 month to 6 months

What it tells you: Development and release pipeline efficiency.

Mean Time to Recovery (MTTR)

Definition: Time to restore service after a failure.

Formula:

MTTR = Average of (Recovery time - Failure time)

Benchmarks (DORA):

Elite: Less than 1 hour
High: Less than 1 day
Medium: 1 day to 1 week
Low: More than 1 week

What it tells you: Resilience and recovery capability.

Capacity Metrics

Infrastructure Utilization

Definition: Percentage of provisioned capacity being used.

Formula:

Utilization = Actual usage / Provisioned capacity × 100

Measure for CPU, memory, storage, network.

Targets:

Below 40%: Over-provisioned, wasting money
40-70%: Healthy headroom
70-85%: Efficient, monitoring needed
Above 85%: At risk, scale soon

Infrastructure Cost per Customer

Definition: Total infrastructure cost divided by active customers.

Formula:

Cost per Customer = Total infrastructure cost / Active customers

What it tells you: Unit economics of delivery. Should decrease or stay flat as you scale.

Headroom

Definition: Available capacity before scaling is required.

Formula:

Headroom = (Max capacity - Current usage) / Max capacity × 100

Target: Maintain 20-30% headroom for traffic spikes.

Security Metrics

Vulnerability Count

Definition: Number of known vulnerabilities in systems.

Categorize by severity: Critical, High, Medium, Low

Targets:

Critical: 0 (fix immediately)
High: < 10 (fix within days)
Medium: < 50 (fix within weeks)

Time to Patch

Definition: Time from vulnerability disclosure to patch deployment.

Formula:

Time to Patch = Average of (Patch deploy time - Disclosure time)

Targets:

Critical: < 24 hours
High: < 7 days
Medium: < 30 days

Security Incident Count

Definition: Number of security incidents in a period.

Target: Zero breaches. Track and reduce attempted attacks.

Summary Table

Metric	Type	Primary Indicator Of
Uptime	Reliability	Platform availability
Error Rate	Reliability	Service quality
P95 Latency	Reliability	User experience
Incident Count	Incidents	System stability
MTTR	Incidents	Recovery capability
Deployment Frequency	Delivery	Release velocity
Lead Time for Changes	Delivery	Pipeline efficiency
Change Failure Rate	Delivery	Release quality
Infrastructure Cost per Customer	Capacity	Delivery economics