8.0 KiB
Kattila.status - Design Specification
Kattila.status is a virtual network topology monitor designed for multi-layer, multi-network environments (VPN meshes, Wireguard, OpenVPN). It follows an Agent-Manager architecture with a pure push-based messaging model.
Architecture Overview
graph TD
subgraph "Agents (Debian/Linux)"
A1[Agent 1]
A2[Agent 2]
A3[Agent 3]
end
subgraph "Manager (Python/Flask)"
M[Manager API]
DB[(SQLite WAL)]
UI[Web UI / Vis-network]
end
A1 -->|HTTP/JSON| M
A2 -->|Relay| A1
A3 -->|Relay| A2
M <--> DB
UI <--> M
API Endpoints
Agent API (Listen: 5087)
| Endpoint | Method | Description |
|---|---|---|
/status/healthcheck |
GET | Returns simple health status. |
/status/reset |
POST | Wipes local SQLite state and triggers re-registration. |
/status/peer |
GET | Returns local interface/route info (for relay peers). |
/status/relay |
POST | Accepts an enveloped report for forwarding to the Manager. |
Manager API (Listen: 5086)
| Endpoint | Method | Description |
|---|---|---|
/status/updates |
POST | Receives periodic reports from agents. |
/status/register |
POST | First contact; issues a unique agent_id. |
/status/healthcheck |
GET | Manager heartheat check. |
/status/alarms |
GET | Fetches active network anomalies. |
/status/agents |
GET | Lists all known agents and their status. |
/status/admin/reset |
POST | Resets specific agent or fleet state. |
Data Model
Manager DB (kattila_manager.db)
agents
Tracks the fleet registry and presence.
CREATE TABLE agents (
agent_id TEXT PRIMARY KEY,
hostname TEXT NOT NULL,
agent_version INTEGER NOT NULL,
fleet_id TEXT NOT NULL,
registered_at INTEGER NOT NULL,
last_seen_at INTEGER NOT NULL,
last_tick INTEGER NOT NULL DEFAULT 0,
status TEXT NOT NULL DEFAULT 'online' -- online, offline, warning
);
CREATE INDEX idx_agents_last_seen ON agents(last_seen_at);
reports
Auditing and replay protection.
CREATE TABLE reports (
id INTEGER PRIMARY KEY AUTOINCREMENT,
agent_id TEXT NOT NULL,
tick INTEGER NOT NULL,
timestamp INTEGER NOT NULL,
report_type TEXT NOT NULL, -- 'report', 'relay', 'register'
report_json TEXT NOT NULL,
received_at INTEGER NOT NULL DEFAULT (strftime('%s', 'now')),
FOREIGN KEY (agent_id) REFERENCES agents(agent_id) ON DELETE CASCADE
);
CREATE UNIQUE INDEX idx_reports_agent_tick ON reports(agent_id, tick);
topology_edges
Inferred links between agents.
CREATE TABLE topology_edges (
id INTEGER PRIMARY KEY AUTOINCREMENT,
from_agent_id TEXT NOT NULL,
to_agent_id TEXT NOT NULL,
edge_type TEXT NOT NULL, -- 'wireguard', 'openvpn', 'physical', 'relay'
metadata TEXT DEFAULT '{}', -- JSON for pubkeys, RTT, etc.
last_seen INTEGER NOT NULL,
is_active INTEGER NOT NULL DEFAULT 1
);
CREATE UNIQUE INDEX idx_edges_pair ON topology_edges(from_agent_id, to_agent_id, edge_type);
agent_interfaces
Tracks network interfaces an agent reports, allowing detection of when they come and go.
CREATE TABLE agent_interfaces (
id INTEGER PRIMARY KEY AUTOINCREMENT,
agent_id TEXT NOT NULL,
interface_name TEXT NOT NULL,
mac_address TEXT,
addresses_json TEXT,
is_virtual INTEGER NOT NULL DEFAULT 0,
vpn_type TEXT,
last_seen_at INTEGER NOT NULL,
FOREIGN KEY (agent_id) REFERENCES agents(agent_id) ON DELETE CASCADE
);
CREATE UNIQUE INDEX idx_agent_interfaces ON agent_interfaces(agent_id, interface_name);
alarms
Event log for network changes and issues, tracking state and timestamps.
CREATE TABLE alarms (
id INTEGER PRIMARY KEY AUTOINCREMENT,
agent_id TEXT NOT NULL,
alarm_type TEXT NOT NULL, -- e.g., 'link_down', 'new_peer'
status TEXT NOT NULL DEFAULT 'active', -- 'active', 'dismissed'
details_json TEXT DEFAULT '{}',
created_at INTEGER NOT NULL DEFAULT (strftime('%s', 'now')),
dismissed_at INTEGER,
FOREIGN KEY (agent_id) REFERENCES agents(agent_id) ON DELETE CASCADE
);
CREATE INDEX idx_alarms_agent_status ON alarms(agent_id, status);
Communication Protocol
Security & Hardware
- Authentication: HMAC-SHA256 verification using a fleet-wide Bootstrap PSK.
- Key Discovery & Transition: The PSK is retrieved via DNS TXT, HTTP(S) URL, or local file and checked for changes hourly. The manager should accept the current and the 2 previous bootstrap keys to handle propagation delays, returning a specific error if an agent connects with an outdated key.
- Replay Protection: Monotonic "ticks" and a sliding window nonce cache (120 entries).
- Time Sync: 10-minute maximum clock skew allowance.
Report Payload (Agent -> Manager)
Agents send a report every 30 seconds (with randomized jitter).
{
"version": 1,
"tick": 42,
"type": "report",
"nonce": "base64-random-nonce",
"timestamp": 1744569900,
"agent_id": "agent-7f3a9b2c1d",
"agent_version": 5,
"fleet_id": "sha256-psk-hash",
"hmac": "hex-hmac-sha256",
"data": {
"hostname": "node-01",
"uptime_seconds": 123456,
"loadavg": [0.12, 0.34, 0.56],
"interfaces": [
{
"name": "eth0",
"mac": "aa:bb:cc:dd:ee:ff",
"addresses": ["192.168.1.10/24"],
"is_virtual": false,
"vpn_type": null
}
],
"routes": [
{ "dst": "0.0.0.0/0", "via": "192.168.1.1", "dev": "eth0" }
],
"wg_peers": [
{
"public_key": "base64-key",
"endpoint": "1.2.3.4:51820",
"allowed_ips": ["10.0.0.2/32"],
"last_handshake": 1744569800
}
]
}
}
Relay Mechanism
Used when an agent cannot reach the manager directly via the configured URL.
- Discovery: The agent will scan its connected WireGuard networks for other agents (checking port 5087). It queries their
/status/peerendpoint to find a forward path to the manager. - Supports up to 3 hops.
Important
Loop Detection: Agents must check the
relay_patharray. If their ownagent_idis present, the message is dropped to prevent infinite recursion.
Data Model (Manager)
The Manager maintains the network state and inferred topology.
| Table | Purpose |
|---|---|
agents |
Fleet registry and presence tracking (heartbeat). |
agent_interfaces |
Historical snapshot of network interfaces. |
topology_edges |
Inferred links between agents (Physical, VPN, Relay). |
alarms |
Event log for changes (link down, new peer, etc.). |
Visualization & UI
The network is visualized using Vis-network.min.js in a layered approach:
- Layer 1 (Public): Servers with direct public IPs (masked as SHA fingerprints).
- Layer 2 (Linked): Servers behind NAT but directly connected to Layer 1.
- Layer 3 (Private): Isolated nodes reachable only via multi-hop paths.
Operational Considerations
Logging & Monitoring
- Agents: Should log to
journaldat INFO level. Critical errors (e.g., SQLite corruption, no PSK) should be logged at ERROR. - Manager: Log each incoming report and security failure (HMAC mismatch) with the source agent IP and ID.
Maintenance
- Database Vacuum: Periodic
VACUUMon the manager DB is recommended if tracking many historical reports. - Relay Cleanup: The
nonce_cacheshould be cleaned every 10 minutes to prevent memory/storage bloat.
Future Enhancements & Proposals
1. Alerting Integrations
- Webhooks: Simple HTTP POST to external services (Slack, Discord) when an
alarmis created.
2. Historical Topology "Time-Travel"
- Store topology snapshots every hour.
- Allow the UI to "scrub" through history to see when a specific link was added or lost.
3. Advanced Visualization
- Geographic Map Overlay: If agents provide coordinates (or inferred via GeoIP), display nodes on a world map.
- Link Bandwidth Visualization: Thicker lines for higher capacity links (e.g., physical vs. relay).