title: server.yaml Reference
server.yaml Reference
Complete reference for all server.yaml configuration sections. The server looks for server.yaml in the current working directory. Override with VIGO_CONFIG_FILE=/path/to/server.yaml.
Fields that accept secret:path are resolved through the secrets backend at startup.
server
Ports and TLS configuration.
server:
grpc_listen: ":1530" # gRPC port for agent check-ins
api_listen: ":8443" # REST API + web UI port
hostname: "myserver.example.com" # server's enrolled envoy hostname
tls:
cert_file: "tls/cert" # server TLS certificate
key_file: "tls/key" # server TLS private key
ca_file: "tls/ca" # certificate authority (verifies agents)
ca_key_file: "tls/ca_key" # CA private key (enables bootstrap CSR signing)
tls_sans: # extra SANs for auto-generated certs
- "192.168.1.2" # host LAN IP (needed in Docker)
- "vigo.example.com" # DNS name agents use
| Field | Default | Description |
|---|---|---|
grpc_listen |
:1530 |
gRPC listen address for agent communication (mTLS) |
api_listen |
:8443 |
REST API, web UI, and metrics listen address (TLS) |
hostname |
(auto) | Server's enrolled envoy hostname, and the address agents are told to dial (GET /bootstrap advertises it). Set this when running in Docker — the container's own hostname is its container ID, not a name agents can reach. It is automatically added to the auto-generated server cert's SANs, so you do not also need to list it under tls_sans. |
tls.cert_file |
tls/cert |
Path to server TLS certificate |
tls.key_file |
tls/key |
Path to server TLS private key |
tls.ca_file |
tls/ca |
Path to CA certificate for mTLS verification |
tls.ca_key_file |
(optional) | CA private key — enables bootstrap enrollment and CSR signing |
tls_sans |
(empty) | Additional Subject Alternative Names for auto-generated certs. localhost, the host's interface IPs, and server.hostname are already included automatically — list only names/IPs beyond those. |
database
Storage backend. SQLite only — postgres is rejected at startup in this release.
database:
dsn: "secret:vigo/db/dsn" # SQLite file path
driver: sqlite3 # sqlite3 (default) — only supported driver
max_open_conns: 0 # read-pool size (0 = auto: CPU count, min 4)
retention: "30d" # auto-prune old runs/task runs/workflows
| Field | Default | Description |
|---|---|---|
dsn |
(required) | SQLite database file path. Supports secret: prefix |
driver |
sqlite3 |
Database driver. Only sqlite3 is supported; the server rejects postgres at startup (the Postgres migration set is incomplete). |
max_open_conns |
0 |
Maximum concurrent database connections. SQLite runs two pools: writes always serialize on a dedicated single-writer connection, and this knob sizes the concurrent read pool — so it caps reader concurrency and bounds the CGO worker threads the driver pins per in-flight query. An unbounded pool is what lets a write storm exhaust the process's OS threads. 0 = auto (CPU count, min 4). |
retention |
30d |
Auto-prune data older than this duration. Applies to runs (cascades run_results), task_runs, workflow_runs, file_snapshots, convergence_history, compliance_history. audit_entries is never trimmed (tamper chain). |
secrets
How secret: references in config values are resolved.
# Local backend (default) — encrypted files on disk
secrets:
backend: local
key_file: "/srv/vigo/master.key" # AES-256-GCM key (0600); omit to auto-generate
secrets_dir: "/srv/vigo/secrets" # encrypted secret files
# Isopass backend — external secrets API
secrets:
backend: isopass
url: "https://isopass.internal:8443"
token_file: "/srv/vigo/isopass-token"
tls_skip_verify: false
| Field | Default | Description |
|---|---|---|
backend |
local |
local or isopass |
key_file |
(auto-generated) | AES-256-GCM encryption key for local backend |
secrets_dir |
/srv/vigo/secrets |
Directory for encrypted secret files |
url |
(required for isopass) | Isopass API base URL |
token_file |
(required for isopass) | Bearer token file path (0600) |
tls_skip_verify |
false |
Isopass: skip TLS verification (dev only) |
checkin
Agent polling behavior.
checkin:
interval: "1s" # agent poll frequency (default 1s; sets the fleet cadence)
jitter_percent: 20 # randomize timing by 0-N%
bundle_max_age: "24h" # signed bundle TTL on agents (0 = forever)
max_concurrent: 0 # cap on concurrent CheckIn handlers (0 = CPU-scaled default; always on)
report_max_concurrent: 0 # cap on concurrent ReportResult handlers (0 = built-in 2048; always on)
max_connections: 0 # cap on concurrently-held agent streams (0 = auto-derive safe ceiling from host RAM; -1 = uncapped)
stream_signature: all # held-stream per-message verification: all (default) | handshake-only (security relaxation)
| Field | Default | Description |
|---|---|---|
interval |
1s |
How often agents check in. The server pushes this to agents, so it sets the effective fleet poll cadence. The staleness threshold is not a fixed multiple of this — it's 2.5 × max(interval, observed-cycle), floored at 30s, so a fast cadence still tolerates a few missed beats (see offline_threshold to override) |
jitter_percent |
20 |
Random jitter to prevent thundering herd |
bundle_max_age |
24h |
How long agents trust their cached policy bundle |
max_concurrent |
0 |
Cap on concurrent CheckIn handlers. The cap gates the policy-bundle build (CPU-heavy) inside the handler, so a fleet-wide stale-checkin wave (post-publish, mass re-enroll) sheds via ResourceExhausted → agent backoff instead of CPU-cascading. Always on — 0 = a CPU-scaled built-in (max(128, GOMAXPROCS*16); GOMAXPROCS tracks the container's cgroup CPU limit on Go 1.25+), not disabled. Higher admits more before shedding (more CPU thrash under a wave); lower sheds sooner. Watch vigo_checkins_total{status="shed"} |
report_max_concurrent |
0 |
Cap on concurrent ReportResult handlers in front of the single SQLite writer. Overflow returns ResourceExhausted so the agent backs off. Always on — 0 = the built-in 2048, not disabled (unlike max_concurrent). Higher admits more before shedding (deeper queue, higher tail latency under sustained overload); lower sheds sooner (tighter tail). Watch vigo_reportresults_total{status="shed"} |
max_connections |
0 |
Cap on concurrently-held agent streams. A connected envoy costs ~623 KB of live server heap with a realistic ~150 KB inventory (FleetIndex + inventory cache + the held stream; see the sizing guide), so connection count — not request rate — is what drives the server toward its memory ceiling. Past this cap, new streams are refused with ResourceExhausted (the agent backs off and retries) so the server sheds gracefully instead of climbing toward the wall. 0 (default) = auto-derive a cap from host RAM; a positive value sets an explicit cap; -1 disables it (legacy uncapped). Caveat: the auto-derived cap is built from a connection-buffer estimate (~220 KB) and is optimistic against the real ~623 KB inventory-aware cost — on a memory-bound box it can sit above the actual safe ceiling (e.g. it auto-derives ~59,746 on a 32 GiB host, but the measured safe ceiling there is ~30,000). Don't treat the cap alone as your limit — size from the sizing guide and set an explicit value if needed. Watch vigo_streams_active against the effective value, and vigo_checkins_total{status="stream_shed"} for rejections |
stream_signature |
all |
Per-message ed25519 verification on held agent streams. all (default) verifies every check-in/report, exactly as a unary request. handshake-only verifies the handshake plus any proxied (foreign-envoy) message, but skips re-verifying the stream owner's own subsequent check-ins — relying on the verified handshake + mTLS session integrity (the same handshake-binding model Delta already uses). Per-message verify continuously proves possession of the envoy's ed25519 key on a held stream. Independently, since 0.76.35 the server pins each envoy's mTLS client cert: the verify path rejects any connection whose TLS peer-cert SubjectPublicKeyInfo ≠ the SPKI captured at enrollment (cert_spki_sha256), so a TLS-terminating proxy/MITM presenting a different CA-signed cert — even one whose CN/SAN matches the hostname — is refused before any check-in is accepted. That closes the relay/forgery path for the realistic threat in all modes, which is what makes handshake-only safe by construction here rather than a trust-the-operator relaxation. Unary requests are always verified regardless. Saves ~0.10 ms/checkin (~13% of gated per-checkin CPU; measured 0.76.29) — the largest remaining steady-state app cost once persist is gated (ADR-033 Option C). The gRPC path must be direct or L4 pass-through mTLS — a TLS-terminating proxy is rejected by the cert pin (its cert isn't the enrolled one): vigo_sig_verify_failed_total{reason="cert_pin_mismatch"} increments and a throttled WARN names the envoy. So the relaxation is safe by construction rather than relying on the operator to avoid proxies. Grandfathered envoys enrolled before 0.76.35 carry no pin and skip enforcement until they re-enroll (a once-per-envoy WARN names them). The only residual the pin can't catch is an attacker who stole the envoy's cert private key but not its ed25519 key — same host filesystem, so unrealistic; closing even that would need RFC 5705 channel-binding on the signature (tracked in pending). Prefer raising interval or sharding via spanner first. When set, the server logs a startup WARN, records a security.stream_signature_relaxed audit event, and exposes vigo_checkin_signature_mode{mode="handshake-only"}; a detected proxy increments vigo_checkin_proxy_detected_total and logs a WARN once per process |
alldoes not close the proxy gap forDelta— direct end-to-end mTLS is the content-integrity boundary in every mode. The proxy heuristic above runs in all modes (not justhandshake-only): any held stream whose TLS peer-cert CN/SAN ≠ the envoy hostname incrementsvigo_checkin_proxy_detected_totaland logs a WARN once per process. This matters becauseDeltarun-result events (the lightweight convergence-report path) carry no per-message signature in any mode — they ride the same handshake-binding model, so the ed25519 layer proves authorship of the handshake, not the content of eachDelta.allre-verifies every check-in, but it does not signDelta, so a TLS-terminating proxy on the gRPC path can forge an envoy's convergence/run-status reports even under the default. The integrity guarantee forDeltacontent is direct agent↔server mTLS end-to-end, and since 0.76.35 the cert pin enforces it: a terminating proxy (an L7 LB or service mesh on:1530) presents a non-enrolled cert and is rejected, so it can no longer establish the stream to forgeDelta— closing this for pinned (post-0.76.35-enrolled) envoys. Grandfathered envoys stay on the warn-only model until they re-enroll. (Cryptographic per-Deltacontent integrity that survives a terminating proxy would be a separate signed-convergence-evidence feature, not astream_signaturesetting.)
Sizing max_concurrent
The cap defends against the cascading-overload regime: when offered load exceeds the server's serving rate, queued requests pile up faster than they drain, the client ticker re-fires on top, and throughput collapses (measured: 70 req/s on a 5K-capable host at 10K offered). Bounding concurrency caps throughput at max_concurrent / latency ≈ the host's actual capacity, gracefully degrading rather than collapsing.
Watch vigo_grpc_checkin_in_flight as the pre-cascade signal — a sustained climb under burst is the leading edge of trouble. Size the cap to roughly your host's measured req/s capacity × p99 latency in seconds (e.g. a 5K req/s host with 100 ms p99 → max_concurrent: 500).
Default 0 resolves to a CPU-scaled built-in (max(128, GOMAXPROCS*16), where GOMAXPROCS tracks the container's cgroup CPU limit on Go 1.25+) — the cap is always on. Operators tune the knob; the burstwave test (2026-05-29) showed that an unbounded read path will CPU-cascade under a fleet-wide stale-checkin wave (post-publish, mass re-enroll), so the always-on default is the safety net.
For per-host envoy capacity (how many envoys one vigosrv host can comfortably hold at a given check-in interval, and when to federate via spanner instead of scaling up), see How-to: Size a vigosrv host.
bootstrap
Agent enrollment configuration. Requires server.tls.ca_key_file.
bootstrap:
cert_validity: "8760h" # agent cert lifetime (default: 1 year)
trusted_enrollment: # token-free enrollment by pattern + source IP
- pattern: "*"
cidrs: ["192.168.0.0/16", "10.0.0.0/8"]
| Field | Default | Description |
|---|---|---|
cert_validity |
8760h |
Lifetime of agent TLS certificates |
trusted_enrollment |
(empty) | Patterns + CIDRs for token-free enrollment |
auth
Web UI and REST API authentication.
# Basic auth (default)
auth:
method: basic
session_idle_timeout: "15m"
# OIDC (OpenID Connect SSO)
auth:
method: oidc
oidc:
issuer: "https://accounts.google.com"
client_id: "your-client-id"
client_secret: "secret:vigo/auth/oidc_client_secret"
redirect_url: "https://localhost:8443/auth/callback"
scopes: [openid, profile, email]
# Optional: provision exactly one identity as admin on its first login.
# Leave both unset (default) and all new OIDC users are viewers — grant
# admin explicitly with `vigocli webusers set-role --role admin`.
bootstrap_admin_email: "admin@example.com"
# Disable auth (development only)
auth:
method: none
| Field | Default | Description |
|---|---|---|
method |
basic |
basic, oidc, isowebauth, or none |
session_idle_timeout |
15m |
Idle timeout before session expires |
oidc.bootstrap_admin_email |
(none) | Email of the OIDC identity to provision as admin on first login. Empty disables auto-admin. |
oidc.bootstrap_admin_subject |
(none) | OIDC sub claim of the identity to provision as admin (use when email is not stable). |
smtp
Email notifications. Disabled when host is empty.
smtp:
host: "smtp.example.com"
port: 587
from: "vigo@example.com"
username: "vigo"
password: "secret:vigo/smtp/password"
tls: true
recipients: ["admin@example.com"]
events:
drift.detected: {}
run.failure:
recipients: ["ops@example.com"]
convergence.threshold:
threshold: 90
security.rootkit: {}
digest:
interval: "1h"
events: ["run.success", "drift.detected"]
evidence:
schedule: "weekly" # "weekly" or "monthly"
recipients: ["compliance@example.com"]
| Field | Default | Description |
|---|---|---|
host |
(disabled) | SMTP relay hostname |
port |
587 |
SMTP port |
from |
(required) | Sender email address |
tls |
true |
Use STARTTLS |
recipients |
(required) | Default recipient list |
events |
(all enabled) | Per-event toggles and recipient overrides |
digest.interval |
(disabled) | Batch window for digest emails |
evidence.schedule |
(disabled) | weekly or monthly evidence delivery |
integrations
External platform integrations. Each forwards events asynchronously. All API keys support secret: prefix. Empty events list = forward all events.
Alerting
integrations:
slack:
enabled: true
webhook_url: "https://hooks.slack.com/services/T.../B.../xxx"
events: ["drift.detected", "run.failure"]
pagerduty:
enabled: true
routing_key: "secret:vigo/pagerduty/routing-key"
events: ["security.rootkit", "run.failure"]
opsgenie:
enabled: true
api_key: "secret:vigo/opsgenie/api-key"
events: ["security.rootkit", "convergence.threshold"]
teams:
enabled: true
webhook_url: "https://outlook.office.com/webhook/..."
events: ["drift.detected", "run.failure"]
PSA / RMM
integrations:
connectwise:
enabled: true
url: "https://api-na.myconnectwise.net/v4_6_release/apis/3.0"
company_id: "mycompany"
public_key: "publickey"
private_key: "secret:vigo/connectwise/private-key"
board_id: 1
events: ["drift.detected", "run.failure", "envoy.stale"]
autotask:
enabled: true
url: "https://webservices.autotask.net/ATServicesRest/v1.0"
username: "apiuser"
password: "secret:vigo/autotask/password"
queue_id: 8
events: ["drift.detected", "run.failure"]
SIEM
integrations:
splunk:
enabled: true
url: "https://splunk.example.com:8088"
token: "secret:vigo/splunk/hec-token"
index: "vigo"
source: "vigo"
events: []
elastic:
enabled: true
url: "https://elasticsearch.example.com:9200"
index: "vigo-events"
api_key: "secret:vigo/elastic/api-key"
events: []
datadog:
enabled: true
api_key: "secret:vigo/datadog/api-key"
site: "datadoghq.com"
events: []
loki:
enabled: true
url: "https://logs-prod-us-central1.grafana.net"
tenant_id: "123456" # X-Scope-OrgID for multi-tenant Loki; omit for single-tenant
username: "123456" # basic-auth username (Grafana Cloud: instance ID)
password: "secret:vigo/loki/password"
# bearer_token: "secret:vigo/loki/token" # alternative to basic-auth
events: []
Loki receives events as labeled log streams. Labels stay low-cardinality (service=vigo, severity, event_name) per Loki best practice — high-cardinality fields (hostname, envoy_id) ride in the JSON log line body. Audit-trail rows arrive as audit.<eventType> events alongside envoy.offline / run.failure / security.* etc., so an operator can query the audit chain in LogQL: {service="vigo", event_name=~"audit\\..*"}.
CMDB
integrations:
servicenow:
enabled: true
instance: "mycompany.service-now.com"
username: "vigo-integration"
password: "secret:vigo/servicenow/password"
events: ["drift.detected", "run.failure"]
grc
Push compliance evidence to GRC platforms on a schedule.
grc:
integrations:
- name: vanta
enabled: true
endpoint: "https://api.vanta.com/v1/evidence"
api_key: "secret:vigo/grc/vanta-api-key"
push_interval: "6h"
frameworks:
hipaa: "hipaa-2013"
soc2: "soc2-2017"
| Field | Default | Description |
|---|---|---|
name |
(required) | Integration name for logging |
endpoint |
(required) | API URL to POST evidence to |
api_key |
(required) | Bearer token. Supports secret: prefix |
push_interval |
6h |
How often to push evidence |
frameworks |
(required) | Vigo standard key → GRC platform framework ID mapping |
spanner
Peer-equal control-plane federation (ADR-026). See Set up Spanner for the walkthrough.
# A spanner bolt — a peer-equal vigosrv that owns a hostname-pattern
# partition and shares the admissions roster with every other bolt via
# CRDT gossip. Every bolt's config has the same shape: one operator
# founds the spanner with `vigocli spanner init`, the rest join with
# `vigocli spanner join`.
spanner:
mode: spanner
spanner_id: "alexander4-prod"
patterns: ["*.us-west.*"]
snapshot:
tier1_interval: 30s
tier2_interval: 5m
transport:
multicast_group: "224.0.0.45"
multicast_port: 1534
fallback_admin: local
| Field | Default | Description |
|---|---|---|
mode |
standalone |
standalone (single vigosrv) or spanner (a peer-equal bolt). The pre-Phase-3 founder/joiner values are retired — the loader hard-errors on them. |
spanner_id |
(required when mode = spanner) | Federation identifier — operator-chosen, immutable, DNS-safe (1-64 ASCII alphanumeric + - _ ., no leading dot). Must match across every bolt in the spanner; embedded in every signed admission row. |
patterns |
(required when mode = spanner) | Hostname glob patterns this bolt owns. First-match-wins across the spanner; loader refuses overlap with another bolt's patterns at publish time + boot time. |
transport.multicast_group |
224.0.0.45 |
IPv4 multicast group for Tier-1 counter gossip. |
transport.multicast_port |
1534 |
UDP port for Tier-1 counter gossip. |
snapshot.tier1_interval |
30s |
Tier-1 counter-gossip cadence. |
snapshot.tier2_interval |
5m |
Tier-2 roster-snapshot cadence. |
fallback_admin |
local |
Fleet-wide admin-read source: local aggregates ListEnvoys / GetConvergenceSummary from the gossip-backed observability cache (every bolt's signed /fleet-snapshot); fanout queries every roster bolt's admin gRPC live and merges. Both produce identically-shaped responses. ListRuns is unaffected by this switch — runs are never gossiped, so a per-envoy query always routes to the owning bolt and a global runs list always fans out to every bolt. |
Peer discovery is gossip-driven — there is no bootstrap_peers seed list. A joining bolt learns the roster through the vigocli spanner join admission ceremony, and the CRDT-replicated roster is the peer list from then on. Bolts are identified by their Ed25519 pubkey (server/bolt/identity.wrapped), not a YAML-configured ID.
Bolt identity
Each spanner-mode vigosrv generates a per-host Ed25519 keypair at /srv/vigo/bolt/identity.wrapped on first boot (HKDF purpose vigo-bolt-wrap-v1, mirrors the service-account puddle pattern). The bolt signs its admission row and every gossiped roster snapshot with this key; peers verify signatures end-to-end and pin each bolt's pubkey at admission time. vigocli spanner status shows the local bolt's first-16-char pubkey for operator cross-check.
The retired hub-spoke keys (
role,bolt_id,hub_addr,hub_fallback_addrs,bolts,auto_failover,transport.bootstrap_peers) and the retiredmodevaluesfounder/joinerare rejected at startup with a one-line migration pointer.
peers
Primary/peer server redundancy for HA.
peers:
role: primary
servers:
- addr: "peer1.example.com:1530"
pubkey: "ed25519:<peer1-pubkey-hex>"
- addr: "peer2.example.com:1530"
pubkey: "ed25519:<peer2-pubkey-hex>"
| Field | Default | Description |
|---|---|---|
role |
primary |
primary or peer |
servers |
(empty) | List of peer servers |
servers[].addr |
(required) | Peer gRPC address (host:port) |
servers[].sync_interval |
10s |
Push interval to this peer |
servers[].pubkey |
(required) | Peer's service-account (puddle) Ed25519 key, from its startup puddle service-account identity ready log line. Authorizes HA peer RPCs (SyncStream/Promote/Status); replication fails closed without it. Local secrets backend only. Configure the same peers block on every node. |
license
License enforcement behavior.
license:
mode: "local" # "local" or "aggregate"
| Field | Default | Description |
|---|---|---|
mode |
local |
local: count nodes on this server. aggregate: every spanner bolt sums the fleet-wide node count from the gossip-replicated roster and gates enrollment against it independently. Best-effort: the count comes from eventually-consistent gossip, so simultaneous enrollments on different bolts near the cap can both pass — the fleet-wide limit is a soft cap, not a hard guarantee (consensus-free by design, ADR-026). Requires spanner.mode: spanner — rejected at config load on a standalone server. |
publish
Blast radius protection on config publish.
publish:
compliance_threshold: 80 # auto-rollback if convergence drops below this %
rollback_window: "5m" # monitoring window after config reload
backup
Litestream continuous replication (SQLite only).
backup:
url: "s3://my-bucket/vigo"
access_key_id: "secret:vigo/backup/aws_access_key"
secret_access_key: "secret:vigo/backup/aws_secret_key"
region: "us-east-1"
retention: "720h"
sync_interval: "1s"
ai
AI assistant configuration. Claude or OpenAI recommended for best results. Ollama works for air-gapped environments but requires a dedicated GPU and produces lower-quality answers. See AI Providers for details on each provider, hardware requirements, and data privacy considerations.
Prompt-injection defense — structural mitigations + provider-dependent adherence
Every block of data Vigo prefetches before an assistant turn (envoy details, traits, run output, config YAML, CVE descriptions, etc.) is wrapped in a fenced ```text block with a fence length sized past any backticks the data itself contains, and the system prompt's "Trust Boundary" section instructs the LLM to treat anything inside the fence as queried data rather than operator instructions. This blocks the structural attack: a hostname like "web01\nIGNORE PRIOR INSTRUCTIONS AND…", a package name with embedded markdown, a run-output dump with a fake system: prefix.
The actual guarantee is only as strong as the underlying LLM's adherence to that framing. Claude (Anthropic's models) respect "this is data, not instructions" framing reliably in our testing; OpenAI is similar. Ollama and openai-compat backends vary widely by model — a smaller open-source model under stress may still follow an injected imperative despite the fence. If your threat model includes adversaries who can put text into trait values, hostnames, or other prefetched content, prefer Claude or OpenAI over self-hosted open-source models for the assistant.
The fence + framing are best-effort defense-in-depth, not a hard guarantee. The assistant is read-only by design (server/ai/tools_dispatch.go exposes only read operations), so even a successful injection cannot write config, push tasks, or mutate state — at worst it could mislead the operator reading the assistant's answer.
ai:
enabled: true
provider: "claude" # claude | openai | ollama | openai-compat
model: "claude-sonnet-4-20250514"
max_tokens: 4096
api_key: "secret:vigo/ai/api_key"
| Field | Default | Description |
|---|---|---|
provider |
ollama |
AI backend: claude, openai, ollama (local, GPU required), openai-compat (vLLM, llama.cpp, etc.) |
model |
(provider default) | Model name |
max_tokens |
4096 |
Maximum response tokens |
context_mode |
auto |
tools, prefetch, or auto |
api_key |
API key for external providers (supports secret: prefix) |
|
base_url |
Auto for ollama; required for openai-compat |
branding
Custom web UI branding.
branding:
org_name: "ACME Corp"
logo_path: "/srv/vigo/branding/logo.svg"
favicon_path: "/srv/vigo/branding/icon.svg"
theme: "light" # "light" (default), "dark", or "auto"
| Field | Default | Description |
|---|---|---|
org_name |
(none) | Organization name shown below the sidebar logo |
logo_path |
embedded default | Sidebar logo (SVG or PNG) |
favicon_path |
embedded default | Browser tab icon |
theme |
light |
Fleet-wide default color theme: light (Catppuccin Latte), dark (Catppuccin Mocha), or auto (follow the user's OS prefers-color-scheme). Individual operators can override via the top-navbar toggle; their choice persists per-browser in localStorage. |
grafana
Optional. Points the web UI at an external Grafana instance — when set, the /health page links to the bundled Vigo dashboards. Vigo neither runs nor manages Grafana; this is purely a link target.
grafana:
url: "https://grafana.example.com:3000"
| Field | Default | Description |
|---|---|---|
url |
(none) | Base URL of a Grafana instance. When set, the /health page links to each bundled dashboard at <url>/d/<uid>. Unset = no links shown. |
sandgorgon
Optional. Gates the sandgorgon bare-metal lifecycle REST surface (Redfish BMC power/disk control, the NIST 800-88 decommission workflow, CSV/NetBox asset import). The subsystem is deferred until after Vigo 1.0 and exposes destructive BMC operations against plaintext-stored BMC credentials, so its routes are not mounted unless explicitly enabled. Leave off.
sandgorgon:
enabled: false
| Field | Default | Description |
|---|---|---|
enabled |
false |
Mount the sandgorgon REST routes. Deferred post-v1; off by default. |
i18n
Optional web UI localization. Off by default: with i18n disabled the UI renders in English with zero overhead and no language picker. When enabled, each request's locale is resolved cookie → Accept-Language → default_locale: the operator's saved choice (a cookie set from the top-navbar language picker, like the theme toggle) wins, then the browser's Accept-Language header (matched on the primary subtag, so fr-CA matches fr), then the configured default. Translations are key-based with English fallback — a missing string degrades to English rather than a blank.
Operator-authored data (envoy names, run results, audit entries) and all commands, flags, and config keys stay English regardless of locale; localization covers UI chrome (nav, labels, buttons) and the in-app docs.
i18n:
enabled: false
default_locale: "en"
supported_locales: ["en", "fr", "de", "es", "pl", "pt", "it"]
| Field | Default | Description |
|---|---|---|
enabled |
false |
Enable non-English locales and the top-navbar language picker. When false the UI is English-only with no per-request locale work. |
default_locale |
en |
Locale used when no cookie or Accept-Language match applies, and the fallback for any untranslated string. |
supported_locales |
all bundled | Locales offered in the picker and matched against Accept-Language. Empty = every locale with a bundled catalog. Always include default_locale. |
swarm
Peer-to-peer content distribution (six content subsystems: filecast, gitback, longdrawer, lockbox, curator, poolq). Each subsystem documents the full field set in its howto — the section below covers poolq (ADR-029); see server.yaml.example for the full template plus the other subsystems.
swarm:
enabled: ["*"] # substrate-level gate (pattern list)
poolq:
enabled: ["*"] # which envoys may run poolq (founders + log-holders)
retention: "7d" # per-topic age window; older messages prune
max_msg_bytes: 16384 # per-message body cap (default 16 KiB; hard ceiling 64 KiB)
swarm.poolq
| Field | Default | Description |
|---|---|---|
enabled |
[] |
Hostname pattern list (first-match-wins; - prefix denies; empty list disables) of envoys allowed to run poolq — publishers (founders) and log-holders. |
retention |
"7d" |
Per-topic message-age retention window. Messages older than this are pruned by the mesh aggregator, and stale-on-ingest messages are refused — together that's flap-stable pruning without tombstones. Accepts Nd / standard Go duration strings. Empty / unparseable falls back to poolqmesh.DefaultRetention (7d). |
max_msg_bytes |
16384 |
Per-message body size cap on the publish path, in bytes. 0 (or unset) = the 16 KiB compile-time default. Hard ceiling 65536 (64 KiB) — bigger payloads belong in a curator artifact that the message references by id. |
Publishing additionally requires poolq: true on the user's usercrate AND an unlocked puddle (a heavier grant than gitback:). Reading is ungated — messages are fleet-readable.
The admin moderation backstop is vigocli swarm poolq block <topic_id>: the server stops serving the topic's /range and /log endpoints fleet-wide. See the poolq howto for the publish + consume flow.
tuning
Advanced performance tuning. All fields optional.
tuning:
signature_window: "5m" # signature verification time window
last_seen_flush: "10s" # batch cadence for last_seen + trait insert/prune
run_store: "database" # "database" or "memory"
run_store_capacity: 20 # per-envoy run depth in memory mode
max_concurrent_streams: 5000 # gRPC concurrent stream limit
grpc_read_buffer: 32768 # per-conn read buffer bytes; 0 = gRPC default (32 KiB)
grpc_write_buffer: 32768 # per-conn write buffer bytes; 0 = gRPC default (32 KiB)
keepalive_time: "30s" # how often the server pings each conn to check liveness
keepalive_timeout: "30s" # deadline for the ping ACK before the conn is closed
gogc: 100 # Go GC target percentage
# GOMEMLIMIT has no key here — auto-derived to ~80% of the cgroup-aware RAM,
# or set the GOMEMLIMIT env var to override (GOMEMLIMIT=off for no limit).
gRPC connection buffers — grpc_read_buffer + grpc_write_buffer
The held stream connection itself costs ~200 KB of the ~623 KB total per-envoy heap (the rest is the FleetIndex entry + cached inventory — see the sizing guide), of which gRPC's default per-connection read+write staging buffers are ~64 KiB (32 KiB each). On large fleets that buffer footprint is the biggest operator-tunable slice of per-connection RAM: halving both to 16384 cuts ~32 KiB/conn (~48 MB across 1,500 connections). Both default to 0 → gRPC's 32 KiB; the option is only applied when set > 0. The trade-off is memory vs. syscall efficiency — smaller buffers mean more read/write syscalls per byte moved — so lower them only when connection-count RAM is the binding constraint, and re-check that large config publish bundle bursts still deliver promptly (full policy bundles ride the held stream as the only high-throughput payload).
There is intentionally no initial_window / HTTP/2 flow-control-window knob: gRPC floors the window at 64 KiB (a smaller value is ignored), so it cannot shrink per-connection RAM, and pinning it would only disable gRPC's BDP auto-tuning and throttle bundle delivery. The read/write buffers above are the connection-RAM dial.
gRPC keepalive — keepalive_time + keepalive_timeout
The server sends an HTTP/2 PING to every conn every keepalive_time (default 30s). If the PING isn't ACKed within keepalive_timeout (default 30s), gRPC closes the conn with GOAWAY and any in-flight RPC on it returns Unavailable to the agent. gRPC also pushes keepalive_timeout to the kernel's TCP_USER_TIMEOUT for the conn, so kernel-level read timeouts use the same deadline.
The trade-off is dead-client detection latency vs. tolerance for transient Go scheduler queueing delays. At high concurrent connection counts the runqueue depth grows (per docs/howto/sizing.md's measurements, ~16k runnable goroutines at 20k conns), and the deepest tail of waiting goroutines can take ~20s to dispatch — tight keepalive_timeout values kill conns under that transient pressure. The 30s default gives ~50% margin over the observed worst-case dispatch latency at the comfortable conn-count envelope. Operators running large fleets that still see Unavailable errors growing should raise to 60s before chasing other tunables; operators who need faster dead-conn reaping (small fleets where every connection matters) can drop to 20s — gRPC's own default — without re-introducing the scheduler-queueing failure mode at sub-15k conn counts.
The pre-0.69.20 default was 10s, which scaling work in 2026-05-29 showed was the actual mechanism behind the matrix's Unavailable errors at 20k+ conns regardless of check-in interval. See How-to: Size a vigosrv host for the supporting evidence.
paths
Directory paths for server subsystems.
paths:
custom_traits: "custom-traits"
docs: "docs"
tasks: "stacks/tasks"
workflows: "stacks/workflows"
agent_dist: "dist/agent"
license_dir: "/srv/vigo/license"
watcher
Polls the secrets provider on a schedule and force-pushes affected envoys when a value changes — automatic secret rotation without waiting for the next check-in. See Secrets.
watcher:
enabled: true
poll_interval: "5m"
rate_limit
Per-envoy and global ceilings on check-in RPCs. Requests over the limit get a retryable error so agents back off on their own. Note that connection count, not request rate, drives the memory ceiling — see Size a vigosrv host; this section only caps offered request rate.
rate_limit:
enabled: true
checkin_per_envoy: 6 # max check-ins per envoy per window
checkin_global: 500 # max check-ins fleet-wide per window
maintenance
Global change freeze — vigocli config publish is blocked until freeze_until passes. For code freezes, customer maintenance windows, or incident response. Bypass requires editing the file and reloading, so the freeze is real.
maintenance:
freeze_until: "2026-07-01T00:00:00Z" # RFC3339; unset = no freeze
task
Ad-hoc task-dispatch safety. With require_definition: true, every task sent via vigocli task dispatch must reference a reviewed definition under stacks/tasks/ — direct shell commands are refused. Turn this on in production to force review of anything fan-outable to the fleet.
task:
require_definition: true
export
One-way outbound compliance feeds — SIEM audit-event ingestion, CMDB inventory sync, and OSCAL-format framework reports — each toggled independently. See Compliance reporting.
export:
siem: true
cmdb: false
oscal: true
compliance
Scopes which compliance frameworks this deployment reports against. An empty standards list activates every framework Vigo knows; an explicit list restricts dashboards, reports, and waiver tooling to that subset. See the Compliance matrix for the framework catalog.
compliance:
standards: ["hipaa", "soc2"] # empty/unset = every framework
risk
Per-envoy risk scoring. An optional NVD CVE API key enriches scoring with CVSS metrics and vendor-attributed advisories; without it, CVE severity comes from scanner output alone. Risk scores drive the dashboard Risk Posture column and the cyber-insurance attestation PDF.
risk:
nvd:
api_key: "secret:vigo/risk/nvd_api_key" # optional CVSS enrichment
stream_edit
Guardrails on the file resource's stream_edit: attribute, which pipes content through agent-local scripts before it's written. Controls whether stream-edits are allowed, the permitted script paths, and the per-transform timeout. See the configcrate language for the attribute itself.
stream_edit:
enabled: true
allowed_paths: ["/srv/vigo/scripts"]
default_timeout: "10s"
scrier
Browser-based remote access (SSH, RDP, VNC) to any enrolled envoy, routed out through the envoy's outbound agent stream so no inbound ports are required on the target. Every session is logged in scrier_sessions and audited on disconnect. See Set up Scrier and Scrier.
scrier:
enabled: true
guacd_addr: "127.0.0.1:4822"
allowed_ports: [22, 3389, 5900]
max_sessions: 10
recording_enabled: false
Event Types Reference
All event types available for SMTP, webhook, and integration subscriptions:
| Event | Trigger |
|---|---|
envoy.enrolled |
New envoy enrolled |
run.success |
Successful convergence run |
run.failure |
Failed convergence run |
drift.detected |
Configuration drift corrected |
convergence.drift |
Drift affecting compliance controls |
convergence.threshold |
Fleet convergence drops below threshold |
compliance.evidence |
Scheduled evidence email delivery |
envoy.stale |
Envoy hasn't checked in within expected interval |
secret.rotated |
Secret value rotated |
config.reload.failure |
Config publish/reload failed |
security.rootkit |
Rootkit detected |
security.malware |
Malware detected |
security.integrity |
File integrity breach |
security.cve_critical |
New critical CVE found |
security.hardening_drop |
Hardening score dropped significantly |