Performance Analysis (Theoretical): 30-Second Check-ins on 4 vCPU / 8 GB / SSD
These projections are based on benchmarked per-request costs from the codebase, not measured end-to-end under production load. Actual performance will vary with hardware, network conditions, policy complexity, and fleet composition.
This document analyzes Vigo server performance under 30-second check-in intervals on modest hardware. 30-second intervals are the recommended setting for fleets where responsiveness matters — drift correction within 30 seconds, stale detection within 90 seconds, and a dashboard that feels alive.
All numbers reference benchmarked values from the codebase or are derived from measured operations. See performance-1m.md for the per-request cost derivations.
Hardware Assumptions
- 4 vCPU (modern x86_64, e.g., Hetzner CPX21, DigitalOcean c-4)
- 8 GB RAM
- SSD storage (NVMe or SATA SSD, ~50k random IOPS)
- 1 Gbps network
Responsiveness Profile
| Metric | Value |
|---|---|
| Drift correction latency | 15-30 seconds (avg half-interval) |
| Stale detection (3x interval) | 90 seconds |
| Config publish to full fleet convergence | ~30 seconds |
| New envoy visible on dashboard | ~30 seconds |
| Force-convergence effect | Within next check-in (~15s avg) |
| Dashboard compliance counter freshness | Updates every few seconds |
This feels like a live system. An operator pushes config and sees envoys converge before they can switch browser tabs.
Request Rates
With 10% jitter (default), agents spread check-ins across a 27-33 second window (30s +/- 10%):
| Fleet Size | Avg RPS | Peak RPS (burst) | Status |
|---|---|---|---|
| 100 | 3.3 | ~5 | Trivial |
| 500 | 17 | ~25 | Trivial |
| 1,000 | 33 | ~50 | Trivial |
| 5,000 | 167 | ~250 | Comfortable |
| 10,000 | 333 | ~500 | Comfortable |
| 15,000 | 500 | ~750 | Tuning recommended |
| 25,000 | 833 | ~1,200 | Spanner or upgrade recommended |
CPU Usage
Realistic per-check-in CPU cost including gRPC framing, protobuf deserialization, goroutine scheduling, and TLS record layer: 200-300 us (no-change) and 500-800 us (full bundle).
| Fleet Size | Avg RPS | CPU (sustained no-change) | CPU (config change storm) |
|---|---|---|---|
| 1,000 | 33 | ~1% of 1 core | ~2% of 1 core |
| 5,000 | 167 | ~5% of 1 core | ~12% of 1 core |
| 10,000 | 333 | ~10% of 1 core | ~24% of 1 core |
| 15,000 | 500 | ~15% of 1 core | ~36% of 1 core |
| 25,000 | 833 | ~25% of 1 core | ~60% of 1 core |
With 4 cores available, CPU is not a concern up to 25,000 envoys. The theoretical single-core limit is ~3,000-5,000 no-change check-ins/sec; with 4 cores, ~12,000-20,000 RPS before saturation.
Config change storm: After vigocli config publish, all envoys receive a full bundle on their next check-in. At 10,000 envoys, this means ~333 full-bundle responses/sec sustained over ~30 seconds. CPU impact: ~24% of one core for 30 seconds. Noticeable but well within headroom.
Memory Usage
Memory is dominated by gRPC connection buffers (64 KiB per persistent mTLS connection), not by the FleetIndex or check-in frequency. The interval does not change per-connection memory — only the number of concurrent connections matters.
| Component | 1,000 envoys | 5,000 envoys | 10,000 envoys | 25,000 envoys |
|---|---|---|---|---|
| Go runtime + gRPC server | ~150 MiB | ~180 MiB | ~200 MiB | ~300 MiB |
| FleetIndex (state + indexes) | 1.6 MiB | 8 MiB | 16 MiB | 39 MiB |
| Policy cache | <1 MiB | <1 MiB | <1 MiB | <1 MiB |
| gRPC connection buffers | 64 MiB | 320 MiB | 640 MiB | 1.6 GiB |
| Goroutine stacks | <1 MiB | <1 MiB | ~1.5 MiB | ~4 MiB |
| SQLite page cache | 100-300 MiB | 100-300 MiB | 100-500 MiB | 200-500 MiB |
| Total | ~350 MiB | ~850 MiB | ~1.4 GiB | ~2.6 GiB |
At 10,000 envoys, memory sits at ~1.4 GiB of 8 GB — comfortable with 5.6 GiB headroom. The constraint arrives around 25,000 envoys where gRPC buffers alone consume 1.6 GiB.
Network Bandwidth
Request sizes: ~3 KiB check-in request (includes traits JSON + signature), ~100 bytes no-change response, ~30 KiB full bundle (with stub optimization).
| Fleet Size | Steady-state (bidirectional) | Config change burst (30s) |
|---|---|---|
| 1,000 | ~100 KB/s | ~1 MB/s |
| 5,000 | ~500 KB/s | ~5 MB/s |
| 10,000 | ~1 MB/s | ~10 MB/s |
| 25,000 | ~2.5 MB/s | ~25 MB/s |
Steady-state includes both request (3 KiB) and response (100 bytes) traffic. Even at 25,000 envoys, sustained bandwidth is 2.5 MB/s — 2% of a 1 Gbps link.
SQLite Write Load
The flusher runs every 10 seconds. At 30-second check-in intervals, every envoy checks in ~3 times per flush window, but the flusher only writes the latest state — so the number of dirty envoys per flush equals the fleet size (same as 1-minute intervals).
| Fleet Size | Dirty envoys per flush | Batch transactions | SQLite write time |
|---|---|---|---|
| 1,000 | ~1,000 | 2 | ~5 ms |
| 5,000 | ~5,000 | 10 | ~25 ms |
| 10,000 | ~10,000 | 20 | ~50 ms |
| 25,000 | ~25,000 | 50 | ~125 ms |
All well within the 10-second flush window. WAL mode ensures check-in handlers are never blocked by writes.
Traits writes: At 30-second intervals, agents send traits twice as often as at 1-minute intervals — but the hash-dedup still filters 95%+ of check-ins (traits don't change between 30-second cycles). No meaningful increase in traits write volume.
Comparison to 1-Minute and 5-Minute Intervals
| Metric (10k fleet) | 5-min (default) | 1-min | 30-sec |
|---|---|---|---|
| Avg RPS | 33 | 167 | 333 |
| Drift correction (avg) | 2.5 min | 30s | 15s |
| Stale detection | 15 min | 3 min | 90s |
| CPU (sustained) | ~1% | ~5% | ~10% |
| Network (steady-state) | ~100 KB/s | ~500 KB/s | ~1 MB/s |
| Memory | ~1.4 GiB | ~1.4 GiB | ~1.4 GiB |
| SQLite flush | ~50 ms | ~50 ms | ~50 ms |
Memory and SQLite load are identical across intervals because they scale by fleet size, not check-in frequency. CPU and network scale linearly with frequency.
Capacity Recommendations
No Tuning: Up to 10,000 Envoys
Default settings handle 10,000 envoys at 30-second check-ins:
- ~10% of one core (sustained), ~24% on config change burst
- ~1.4 GiB memory
- ~1 MB/s network
- 50 ms SQLite flush every 10 seconds
With Tuning: 10,000-15,000 Envoys
checkin:
interval: "30s"
jitter_percent: 10
tuning:
gogc: 200
memory_limit: "6GiB"
grpc_read_buffer: 16384
grpc_write_buffer: 16384
Reduces gRPC buffer memory by half (320 MiB at 10k instead of 640 MiB). Comfortable up to 15,000 envoys on 8 GB.
Beyond 15,000 Envoys
Options:
- Upgrade to 8 vCPU / 16 GB — handles ~30,000 envoys at 30s intervals with tuning
- Spanner (hub-spoke) — shard fleet across spoke servers, each handling its own subset
Streaming vs. Polling at 30 Seconds
At 30-second polling, the latency for ad-hoc tasks is "up to 30 seconds" (average 15 seconds). For most operational workflows, this is fast enough. Streaming is only worth enabling if you need sub-second task dispatch or if the fleet runs frequent live queries.
Bottleneck Hierarchy (30-Second Intervals)
- Memory (gRPC buffers) — 640 MiB at 10k connections. First hard limit at ~20k envoys on 8 GB (with tuned 16 KiB buffers, ~25k).
- CPU (ED25519 verify) — 10% of one core at 10k envoys. Saturates all 4 cores at ~40k-60k envoys.
- SQLite flusher — 125 ms at 25k envoys. Not a concern until ~80k+ envoys per flush window.
- Network — 2.5 MB/s at 25k envoys. Under 3% of 1 Gbps.
Confidential -- Alexander4, LLC. Not for redistribution. See documentation-license.