# Kattila.status - Design Specification Kattila.status is a virtual network topology monitor designed for multi-layer, multi-network environments (VPN meshes, Wireguard, OpenVPN). It follows an **Agent-Manager** architecture with a pure push-based messaging model. ## Architecture Overview ```mermaid graph TD subgraph "Agents (Debian/Linux)" A1[Agent 1] A2[Agent 2] A3[Agent 3] end subgraph "Manager (Python/Flask)" M[Manager API] DB[(SQLite WAL)] UI[Web UI / Vis-network] end A1 -->|HTTP/JSON| M A2 -->|Relay| A1 A3 -->|Relay| A2 M <--> DB UI <--> M ``` ### API Endpoints #### Agent API (Listen: 5087) | Endpoint | Method | Description | | :--- | :--- | :--- | | `/status/healthcheck` | GET | Returns simple health status. | | `/status/reset` | POST | Wipes local SQLite state and triggers re-registration. | | `/status/peer` | GET | Returns local interface/route info (for relay peers). | | `/status/relay` | POST | Accepts an enveloped report for forwarding to the Manager. | #### Manager API (Listen: 5086) | Endpoint | Method | Description | | :--- | :--- | :--- | | `/status/updates` | POST | Receives periodic reports from agents. | | `/status/register` | POST | First contact; issues a unique `agent_id`. | | `/status/healthcheck` | GET | Manager heartheat check. | | `/status/alarms` | GET | Fetches active network anomalies. | | `/status/agents` | GET | Lists all known agents and their status. | | `/status/admin/reset` | POST | Resets specific agent or fleet state. | --- ## Data Model ### Manager DB (`kattila_manager.db`) #### `agents` Tracks the fleet registry and presence. ```sql CREATE TABLE agents ( agent_id TEXT PRIMARY KEY, hostname TEXT NOT NULL, agent_version INTEGER NOT NULL, fleet_id TEXT NOT NULL, registered_at INTEGER NOT NULL, last_seen_at INTEGER NOT NULL, last_tick INTEGER NOT NULL DEFAULT 0, status TEXT NOT NULL DEFAULT 'online' -- online, offline, warning ); CREATE INDEX idx_agents_last_seen ON agents(last_seen_at); ``` #### `reports` Auditing and replay protection. ```sql CREATE TABLE reports ( id INTEGER PRIMARY KEY AUTOINCREMENT, agent_id TEXT NOT NULL, tick INTEGER NOT NULL, timestamp INTEGER NOT NULL, report_type TEXT NOT NULL, -- 'report', 'relay', 'register' report_json TEXT NOT NULL, received_at INTEGER NOT NULL DEFAULT (strftime('%s', 'now')), FOREIGN KEY (agent_id) REFERENCES agents(agent_id) ON DELETE CASCADE ); CREATE UNIQUE INDEX idx_reports_agent_tick ON reports(agent_id, tick); ``` #### `topology_edges` Inferred links between agents. ```sql CREATE TABLE topology_edges ( id INTEGER PRIMARY KEY AUTOINCREMENT, from_agent_id TEXT NOT NULL, to_agent_id TEXT NOT NULL, edge_type TEXT NOT NULL, -- 'wireguard', 'openvpn', 'physical', 'relay' metadata TEXT DEFAULT '{}', -- JSON for pubkeys, RTT, etc. last_seen INTEGER NOT NULL, is_active INTEGER NOT NULL DEFAULT 1 ); CREATE UNIQUE INDEX idx_edges_pair ON topology_edges(from_agent_id, to_agent_id, edge_type); ``` #### `agent_interfaces` Tracks network interfaces an agent reports, allowing detection of when they come and go. ```sql CREATE TABLE agent_interfaces ( id INTEGER PRIMARY KEY AUTOINCREMENT, agent_id TEXT NOT NULL, interface_name TEXT NOT NULL, mac_address TEXT, addresses_json TEXT, is_virtual INTEGER NOT NULL DEFAULT 0, vpn_type TEXT, last_seen_at INTEGER NOT NULL, FOREIGN KEY (agent_id) REFERENCES agents(agent_id) ON DELETE CASCADE ); CREATE UNIQUE INDEX idx_agent_interfaces ON agent_interfaces(agent_id, interface_name); ``` #### `alarms` Event log for network changes and issues, tracking state and timestamps. ```sql CREATE TABLE alarms ( id INTEGER PRIMARY KEY AUTOINCREMENT, agent_id TEXT NOT NULL, alarm_type TEXT NOT NULL, -- e.g., 'link_down', 'new_peer' status TEXT NOT NULL DEFAULT 'active', -- 'active', 'dismissed' details_json TEXT DEFAULT '{}', created_at INTEGER NOT NULL DEFAULT (strftime('%s', 'now')), dismissed_at INTEGER, FOREIGN KEY (agent_id) REFERENCES agents(agent_id) ON DELETE CASCADE ); CREATE INDEX idx_alarms_agent_status ON alarms(agent_id, status); ``` ## Communication Protocol ### Security & Hardware - **Authentication**: HMAC-SHA256 verification using a fleet-wide Bootstrap PSK. - **Key Discovery & Transition**: The PSK is retrieved via DNS TXT, HTTP(S) URL, or local file and checked for changes hourly. The manager should accept the current and the 2 previous bootstrap keys to handle propagation delays, returning a specific error if an agent connects with an outdated key. - **Replay Protection**: Monotonic "ticks" and a sliding window nonce cache (120 entries). - **Time Sync**: 10-minute maximum clock skew allowance. ### Report Payload (Agent -> Manager) Agents send a `report` every 30 seconds (with randomized jitter). ```json { "version": 1, "tick": 42, "type": "report", "nonce": "base64-random-nonce", "timestamp": 1744569900, "agent_id": "agent-7f3a9b2c1d", "agent_version": 5, "fleet_id": "sha256-psk-hash", "hmac": "hex-hmac-sha256", "data": { "hostname": "node-01", "uptime_seconds": 123456, "loadavg": [0.12, 0.34, 0.56], "interfaces": [ { "name": "eth0", "mac": "aa:bb:cc:dd:ee:ff", "addresses": ["192.168.1.10/24"], "is_virtual": false, "vpn_type": null } ], "routes": [ { "dst": "0.0.0.0/0", "via": "192.168.1.1", "dev": "eth0" } ], "wg_peers": [ { "public_key": "base64-key", "endpoint": "1.2.3.4:51820", "allowed_ips": ["10.0.0.2/32"], "last_handshake": 1744569800 } ] } } ``` ### Relay Mechanism Used when an agent cannot reach the manager directly via the configured URL. - **Discovery**: The agent will scan its connected WireGuard networks for other agents (checking port 5087). It queries their `/status/peer` endpoint to find a forward path to the manager. - Supports up to 3 hops. > [!IMPORTANT] > **Loop Detection**: Agents must check the `relay_path` array. If their own `agent_id` is present, the message is dropped to prevent infinite recursion. --- ## Data Model (Manager) The Manager maintains the network state and inferred topology. | Table | Purpose | | :--- | :--- | | `agents` | Fleet registry and presence tracking (heartbeat). | | `agent_interfaces` | Historical snapshot of network interfaces. | | `topology_edges` | Inferred links between agents (Physical, VPN, Relay). | | `alarms` | Event log for changes (link down, new peer, etc.). | --- ## Visualization & UI The network is visualized using **Vis-network.min.js** in a layered approach: 1. **Layer 1 (Public)**: Servers with direct public IPs (masked as SHA fingerprints). 2. **Layer 2 (Linked)**: Servers behind NAT but directly connected to Layer 1. 3. **Layer 3 (Private)**: Isolated nodes reachable only via multi-hop paths. --- ## Operational Considerations ### Logging & Monitoring - **Agents**: Should log to `journald` at INFO level. Critical errors (e.g., SQLite corruption, no PSK) should be logged at ERROR. - **Manager**: Log each incoming report and security failure (HMAC mismatch) with the source agent IP and ID. ### Maintenance - **Database Vacuum**: Periodic `VACUUM` on the manager DB is recommended if tracking many historical reports. - **Relay Cleanup**: The `nonce_cache` should be cleaned every 10 minutes to prevent memory/storage bloat. --- ## Future Enhancements & Proposals ### 1. Alerting Integrations - **Webhooks**: Simple HTTP POST to external services (Slack, Discord) when an `alarm` is created. ### 2. Historical Topology "Time-Travel" - Store topology snapshots every hour. - Allow the UI to "scrub" through history to see when a specific link was added or lost. ### 3. Advanced Visualization - **Geographic Map Overlay**: If agents provide coordinates (or inferred via GeoIP), display nodes on a world map. - **Link Bandwidth Visualization**: Thicker lines for higher capacity links (e.g., physical vs. relay).