April 8, 2026
10 min read
Ian Lintner

๐Ÿ”ฌ Weekly k6 Benchmarks in CI: Automated Performance Trend Detection for OAuth2 Servers

Weekly k6 benchmark CI trend chart comparing Rust, Go, Java, Node.js, and Python OAuth2 servers
performancek6github-actionsrustoauth2benchmarksQACI/CD
Share this article:

Performance bugs are the kind of bugs that don't throw exceptions. They creep in through an innocent dependency bump, a subtle query regression, or a new feature that adds 2ms to your hot path. By the time anyone notices, the degradation has been baked into a dozen releases and nobody knows which commit caused it.

The fix is not "test harder before release." The fix is continuous performance measurement with trend awareness โ€” and GitHub Actions makes this surprisingly practical.

This post breaks down the Weekly Benchmarks workflow from the rust-oauth2-server project: a CI-native system that load-tests a Rust OAuth2 server against four major open-source alternatives (Keycloak, Ory Hydra, Authentik, and node-oidc-provider) every week, detects performance regressions, and creates follow-up issues automatically.


๐ŸŽฏ TL;DR

WhatHow
Toolk6 (Grafana) โ€” Go-based load generator with JS test scripts
ServersRust, Rust+Mongo, Keycloak (Java), Ory Hydra (Go), Authentik (Python), node-oidc-provider (Node.js)
Scenariosclient-credentials, token-introspect, discovery, health
ScheduleEvery Monday 03:17 UTC, plus manual dispatch
IntelligenceGit-aware planner auto-selects which servers to re-benchmark based on what changed
OutputComparison report, CSV export, Mermaid throughput charts, CI manifest, GitHub Issues for version bumps

๐Ÿ—๏ธ Why Weekly Benchmarks in CI?

Most teams benchmark reactively: someone reports "the API feels slow," and an engineer spends a day reproducing it on a laptop. The numbers are unreproducible because the laptop was running Slack, Docker Desktop, and 47 Chrome tabs.

A scheduled CI benchmark solves several problems at once:

  1. Consistent environment โ€” same runner hardware, same Docker resource limits, same network topology every week.
  2. Historical trend line โ€” you can compare this week's numbers to last week's, last month's, and the baseline from when the project was "fast."
  3. Automated regression detection โ€” if a Rust server's p99 latency jumps 40% after a dependency update, the workflow artifact tells you which run introduced it.
  4. Cross-server comparison โ€” benchmarking against Keycloak, Hydra, Authentik, and node-oidc isn't just for bragging rights. It validates that your server's performance ratios haven't shifted โ€” if your Rust server was 3x faster than Keycloak and now it's 1.5x, something regressed.

The honesty caveat

The workflow itself documents its own limitations:

"GitHub-hosted runners are good for directional trend checks, not lab-grade benchmark reproducibility."

This is the right framing. CI benchmarks are a smoke detector, not a calibration lab. You're looking for relative changes and trend breaks, not absolute throughput numbers you'd put in a sales deck.


๐Ÿง  The Smart Planner: Git-Aware Server Selection

This is the most technically interesting piece. The workflow doesn't blindly re-run all six servers every week. It uses a planner (plan_benchmark_run.sh) that inspects the last 7 days of commits on main and decides which servers actually need re-benchmarking.

Change classification rules

The planner categorizes every changed file:

File patternClassificationServers selected
src/*, crates/*, Cargo.toml, Cargo.lock, Dockerfile*First-party runtime changerust, rust-mongo
benchmarks/k6/*, benchmarks/setup/*, benchmarks/run-benchmarks.shBenchmark harness changerust, rust-mongo
benchmarks/docker-compose.yml (Keycloak image tag change)Third-party version bumpkeycloak + rust baseline
Docs, README, .github/* (non-benchmark)No benchmark impactSkip entirely

The version bump detection is particularly clever. For each third-party server, the planner:

  1. Extracts the current pinned version from docker-compose.yml (or package.json for node-oidc).
  2. Extracts the previous version from the same file at the base_sha (the commit 7 days ago).
  3. If they differ, it flags that server for re-benchmarking and creates a follow-up GitHub Issue assigned to Copilot.
# Simplified version extraction for Keycloak
extract_version_from_stream() {
  local server="$1"
  case "$server" in
    keycloak) sed -n "s#.*quay.io/keycloak/keycloak:\([^[:space:]]*\).*#\1#p" | head -n1 ;;
    hydra)    sed -n "s#.*oryd/hydra:\([^[:space:]]*\).*#\1#p" | head -n1 ;;
    # ...
  esac
}

Why this matters for QA

Traditional QA approaches treat performance testing as a phase โ€” you do it before a release. The planner inverts this: performance testing happens in response to change, on a weekly cadence. If nothing changed that could affect performance, the workflow skips entirely (saving CI minutes). If a third-party dependency bumped, the workflow re-runs just that server plus your baseline.

This is shift-left for performance, but with the intelligence to not waste resources on irrelevant runs.


๐Ÿงช The k6 Test Harness: Apples-to-Apples Load Testing

Why k6?

k6 is a load testing tool written in Go with JavaScript test scripts. It's an excellent fit for CI because:

  • No JVM warmup in the test tool โ€” unlike JMeter or Gatling, k6 itself doesn't introduce measurement artifacts from its own runtime.
  • Structured JSON output โ€” every request is logged with timing data, tags, and metadata, making post-hoc analysis trivial.
  • Docker-native โ€” grafana/k6:0.50.0 runs as a container on the same Docker network as the servers under test.
  • Custom metrics โ€” k6 lets you define Rate and Trend metrics alongside the built-in HTTP metrics, so you can track domain-specific things like token_success rate.

Scenario design

The benchmark suite runs four scenarios against each server:

The client credentials scenario is the primary benchmark because it exercises the full OAuth2 token issuance path (client authentication, JWT signing, token persistence) in a single HTTP POST โ€” no browser interaction, no redirects, no user session state. It's the closest thing to an "apples-to-apples" comparison across servers that have very different architectures.

Fair comparison controls

This is where the harness gets serious about fairness:

ControlImplementation
CPU/memory limitsEvery server gets exactly 2 CPU cores and 512MB RAM via Docker deploy.resources.limits
Same databaseAll SQL-backed servers share a PostgreSQL 16 instance (separate databases)
Sequential executionOnly one server runs at a time โ€” no resource contention
Multiple iterationsEach scenario runs 3 times; results are averaged to smooth runner variance
JVM warmupJava-based servers (Keycloak) get warmup requests before measurement begins
Same client configIdentical client_id, client_secret, and scope across all servers
Same networkAll containers on a single Docker bridge network

The resource limits are particularly important. Without them, Keycloak (Java) would happily consume 2GB of RAM and look great, while Rust's 30MB footprint would be an unfair advantage in a different way. By capping everyone at 512MB, you're testing "how well does this server perform under realistic, constrained conditions?"

Load profiles

const profiles = {
  light: [
    { duration: "15s", target: 10 }, // ramp up
    { duration: "30s", target: 50 }, // steady state
    { duration: "15s", target: 50 }, // hold
    { duration: "10s", target: 0 }, // ramp down
  ],
  medium: [
    { duration: "15s", target: 50 },
    { duration: "30s", target: 200 },
    { duration: "30s", target: 200 },
    { duration: "15s", target: 0 },
  ],
  heavy: [
    { duration: "20s", target: 100 },
    { duration: "30s", target: 500 },
    { duration: "60s", target: 500 },
    { duration: "20s", target: 0 },
  ],
};

The weekly cron runs light by default (10โ†’50 VUs over 70 seconds). Manual dispatch can escalate to medium or heavy to validate scaling behavior. The ramp-up / steady-state / ramp-down pattern is critical โ€” it lets you observe how each server handles increasing load, not just steady-state throughput.


๐Ÿ“Š Results Analysis and Trend Detection

After k6 finishes, analyze-results.sh parses the JSON summaries and generates:

  1. A Markdown comparison report with per-scenario tables (req/s, avg/median/p95/p99 latency, error rate)
  2. Visual bar charts (ASCII/Unicode) showing relative throughput and latency
  3. Mermaid pie charts showing throughput share per scenario
  4. A performance multiplier table โ€” "how does each server compare to the Rust baseline?"
  5. CSV export for external analysis

The metrics that matter for trend detection

The analysis captures six key metrics per server per scenario:

MetricWhat it tells youTrend signal
Req/sRaw throughputWeek-over-week drop = regression
Avg latencyMean response timeGradual increase = creeping degradation
p95 latencyTail performanceSpike = new contention point
p99 latencyWorst-case behaviorSudden jump = new code path hitting edge case
Error rateReliability under loadAny increase = functional regression
Throughput ratio vs baselineRelative positioningRatio change = server-specific regression

The throughput ratio is the most powerful trend signal. If your Rust server typically handles 4x the requests of Keycloak for client-credentials, and that ratio drops to 2.5x, you know the regression is in your code โ€” not environmental noise. Environmental noise affects all servers equally.

CI manifest for traceability

Every run writes a ci-run-manifest.json that captures:

{
  "generated_at": "2026-04-08T12:30:34Z",
  "event_name": "workflow_dispatch",
  "since_date": "2026-04-01T12:30:34Z",
  "recent_commit_count": 5,
  "selected_servers": ["rust", "rust-mongo"],
  "profile": "light",
  "scenarios": [
    "client-credentials",
    "token-introspect",
    "discovery",
    "health"
  ],
  "iterations": 1,
  "github_run_id": "24135411134",
  "github_sha": "455c3491241395664c1d6c116af9d28c8c7a1682"
}

This is your audit trail. When you're investigating a regression three months from now, you can look at the manifest to see exactly what was tested, which servers were selected, what the commit SHA was, and how many commits had landed since the last run.


๐Ÿ”„ The Full Pipeline: From Cron to Issue

The version bump issue workflow

When the planner detects that a third-party server's pinned version changed (say Keycloak went from 24.0 to 25.0), it doesn't just re-run benchmarks. It also creates a GitHub Issue with:

  • Which versions changed and in which direction
  • A checklist of follow-up tasks (review compatibility, refresh baselines, update docs)
  • Assignment to copilot for automated triage

This closes the loop between "something changed" and "someone needs to act on it."


๐Ÿค” What Makes This Technically Interesting

1. Partial runs with baseline reuse

Most benchmark CI setups are all-or-nothing: run everything or nothing. This workflow introduces partial runs with baseline reuse. If only the Rust server changed, the workflow runs k6 against rust and rust-mongo, then merges those fresh results with the checked-in baseline files for Keycloak, Hydra, Authentik, and node-oidc that already exist in benchmarks/results/.

The result: a merged comparison artifact that covers all six servers, even though only two were actually benchmarked this week. This keeps weekly runs under 30 minutes instead of 3+ hours.

2. The cron offset trick

cron: "17 3 * * 1"

The cron fires at 03:17 UTC on Mondays, not :00. This is a deliberate choice. GitHub Actions cron scheduling is best-effort, and workflows scheduled at the top of the hour compete for runner allocation. Offsetting by 17 minutes reduces scheduler contention and makes the run more likely to start on time.

3. Language-diverse comparison as a regression control

Benchmarking Rust against Keycloak (Java), Hydra (Go), Authentik (Python), and node-oidc (Node.js) isn't just about language comparisons. It's a control group. If all five servers show a throughput drop in the same week, the regression is environmental (runner hardware, Docker version, PostgreSQL update). If only the Rust server drops, the regression is in your code.

This is the same principle as running a positive and negative control in a science experiment. The third-party servers are your control group.

4. Docker resource limits as an equalizer

deploy:
  resources:
    limits:
      cpus: "2"
      memory: 512M

By enforcing identical resource constraints, the benchmark measures efficiency, not capacity. A JVM-based server that needs 2GB to perform well will hit OOM or GC thrashing at 512MB โ€” which is exactly what happens in production when you're running multiple services on the same node. These constraints simulate a realistic, resource-constrained deployment.

5. Structured output for downstream automation

Every benchmark run produces:

  • JSON summaries per server/scenario/iteration
  • Raw k6 JSON stream with per-request timing data
  • CSV for spreadsheet analysis
  • Markdown report with Mermaid charts for PR comments
  • CI manifest for audit trail

This structured output means you can build downstream automation: a Grafana dashboard that ingests the CSV, a Slack bot that posts the Markdown report, or a custom script that compares this week's manifest to last week's and flags regressions.


๐Ÿ” Spotting Regressions: A Practical Framework

Here's how to use weekly benchmark data to catch performance problems before users do:

Week-over-week comparison

Week N:   Rust client-credentials โ†’ 1,850 req/s, p99 = 12ms
Week N+1: Rust client-credentials โ†’ 1,420 req/s, p99 = 28ms
                                     โ†“ 23% throughput drop, 133% p99 increase

A 23% throughput drop is well outside normal CI runner variance (typically ยฑ5-10%). This warrants investigation.

Ratio stability check

Week N:   Rust/Keycloak ratio = 3.8x throughput
Week N+1: Rust/Keycloak ratio = 2.1x throughput
          Keycloak absolute numbers unchanged
          โ†’ Regression is in Rust server code

Latency distribution shifts

Watch for p95/p99 diverging from median. If median stays flat but p99 doubles, you have a new slow path that only triggers under specific conditions โ€” often a cache miss, a new database query, or a lock contention issue.

Error rate as a canary

Any non-zero error rate under light load is a red flag. The thresholds in the k6 scripts enforce this:

thresholds: {
  http_req_duration: ["p(95)<2000", "p(99)<5000"],
  http_req_failed: ["rate<0.05"],
  http_reqs: ["rate>0"],
}

If http_req_failed exceeds 5%, k6 marks the test as failed. Combined with the GitHub Actions failure status, this becomes an automatic regression gate.


๐Ÿ› ๏ธ Adapting This Pattern for Your Projects

The architecture is transferable to any project that has HTTP endpoints worth benchmarking. Here's what you'd need:

  1. k6 scenario scripts โ€” one per endpoint or workflow you want to track.
  2. A Docker Compose file โ€” your server + dependencies with resource limits.
  3. A planner script โ€” optional but valuable if you have multiple services to benchmark selectively.
  4. An analysis script โ€” parse k6 JSON output, compute trends, generate reports.
  5. A GitHub Actions workflow โ€” cron schedule + workflow_dispatch for manual runs.

The key insight is that you don't need a dedicated performance testing infrastructure. GitHub Actions runners are noisy, but they're consistently noisy. The same runner variance that makes absolute numbers unreliable makes relative trends surprisingly stable.


๐Ÿ Key Takeaways

  • Performance testing belongs in CI, not in a pre-release phase. Weekly cadence with smart selection keeps it affordable.
  • Trend detection beats point-in-time measurement. A single benchmark number is noise. A trendline over 12 weeks is signal.
  • Cross-server comparison is a regression control group. If only your server slowed down, it's your bug. If everyone slowed down, it's the environment.
  • k6 is an excellent CI-native load testing tool โ€” lightweight, structured output, Docker-friendly, no JVM overhead in the test harness itself.
  • Partial runs with baseline reuse make weekly cross-server benchmarks practical without burning hours of CI time.
  • Automated issue creation for version bumps closes the loop between "dependency changed" and "someone needs to check performance impact."

The Weekly Benchmarks workflow and the full benchmark harness are open source. Fork it, adapt the k6 scenarios to your endpoints, and start building your own performance trendline.


References

I

Ian Lintner

Full Stack Developer

Published on

April 8, 2026