Files
kattila.status/DESIGN.md
2026-04-17 19:23:04 +03:00

8.0 KiB

Kattila.status - Design Specification

Kattila.status is a virtual network topology monitor designed for multi-layer, multi-network environments (VPN meshes, Wireguard, OpenVPN). It follows an Agent-Manager architecture with a pure push-based messaging model.

Architecture Overview

graph TD
    subgraph "Agents (Debian/Linux)"
        A1[Agent 1]
        A2[Agent 2]
        A3[Agent 3]
    end

    subgraph "Manager (Python/Flask)"
        M[Manager API]
        DB[(SQLite WAL)]
        UI[Web UI / Vis-network]
    end

    A1 -->|HTTP/JSON| M
    A2 -->|Relay| A1
    A3 -->|Relay| A2
    M <--> DB
    UI <--> M

API Endpoints

Agent API (Listen: 5087)

Endpoint Method Description
/status/healthcheck GET Returns simple health status.
/status/reset POST Wipes local SQLite state and triggers re-registration.
/status/peer GET Returns local interface/route info (for relay peers).
/status/relay POST Accepts an enveloped report for forwarding to the Manager.

Manager API (Listen: 5086)

Endpoint Method Description
/status/updates POST Receives periodic reports from agents.
/status/register POST First contact; issues a unique agent_id.
/status/healthcheck GET Manager heartheat check.
/status/alarms GET Fetches active network anomalies.
/status/agents GET Lists all known agents and their status.
/status/admin/reset POST Resets specific agent or fleet state.

Data Model

Manager DB (kattila_manager.db)

agents

Tracks the fleet registry and presence.

CREATE TABLE agents (
    agent_id TEXT PRIMARY KEY,
    hostname TEXT NOT NULL,
    agent_version INTEGER NOT NULL,
    fleet_id TEXT NOT NULL,
    registered_at INTEGER NOT NULL,
    last_seen_at INTEGER NOT NULL,
    last_tick INTEGER NOT NULL DEFAULT 0,
    status TEXT NOT NULL DEFAULT 'online' -- online, offline, warning
);
CREATE INDEX idx_agents_last_seen ON agents(last_seen_at);

reports

Auditing and replay protection.

CREATE TABLE reports (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    agent_id TEXT NOT NULL,
    tick INTEGER NOT NULL,
    timestamp INTEGER NOT NULL,
    report_type TEXT NOT NULL, -- 'report', 'relay', 'register'
    report_json TEXT NOT NULL,
    received_at INTEGER NOT NULL DEFAULT (strftime('%s', 'now')),
    FOREIGN KEY (agent_id) REFERENCES agents(agent_id) ON DELETE CASCADE
);
CREATE UNIQUE INDEX idx_reports_agent_tick ON reports(agent_id, tick);

topology_edges

Inferred links between agents.

CREATE TABLE topology_edges (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    from_agent_id TEXT NOT NULL,
    to_agent_id TEXT NOT NULL,
    edge_type TEXT NOT NULL, -- 'wireguard', 'openvpn', 'physical', 'relay'
    metadata TEXT DEFAULT '{}', -- JSON for pubkeys, RTT, etc.
    last_seen INTEGER NOT NULL,
    is_active INTEGER NOT NULL DEFAULT 1
);
CREATE UNIQUE INDEX idx_edges_pair ON topology_edges(from_agent_id, to_agent_id, edge_type);

agent_interfaces

Tracks network interfaces an agent reports, allowing detection of when they come and go.

CREATE TABLE agent_interfaces (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    agent_id TEXT NOT NULL,
    interface_name TEXT NOT NULL,
    mac_address TEXT,
    addresses_json TEXT,
    is_virtual INTEGER NOT NULL DEFAULT 0,
    vpn_type TEXT,
    last_seen_at INTEGER NOT NULL,
    FOREIGN KEY (agent_id) REFERENCES agents(agent_id) ON DELETE CASCADE
);
CREATE UNIQUE INDEX idx_agent_interfaces ON agent_interfaces(agent_id, interface_name);

alarms

Event log for network changes and issues, tracking state and timestamps.

CREATE TABLE alarms (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    agent_id TEXT NOT NULL,
    alarm_type TEXT NOT NULL, -- e.g., 'link_down', 'new_peer'
    status TEXT NOT NULL DEFAULT 'active', -- 'active', 'dismissed'
    details_json TEXT DEFAULT '{}',
    created_at INTEGER NOT NULL DEFAULT (strftime('%s', 'now')),
    dismissed_at INTEGER,
    FOREIGN KEY (agent_id) REFERENCES agents(agent_id) ON DELETE CASCADE
);
CREATE INDEX idx_alarms_agent_status ON alarms(agent_id, status);

Communication Protocol

Security & Hardware

  • Authentication: HMAC-SHA256 verification using a fleet-wide Bootstrap PSK.
  • Key Discovery & Transition: The PSK is retrieved via DNS TXT, HTTP(S) URL, or local file and checked for changes hourly. The manager should accept the current and the 2 previous bootstrap keys to handle propagation delays, returning a specific error if an agent connects with an outdated key.
  • Replay Protection: Monotonic "ticks" and a sliding window nonce cache (120 entries).
  • Time Sync: 10-minute maximum clock skew allowance.

Report Payload (Agent -> Manager)

Agents send a report every 30 seconds (with randomized jitter).

{
  "version": 1,
  "tick": 42,
  "type": "report",
  "nonce": "base64-random-nonce",
  "timestamp": 1744569900,
  "agent_id": "agent-7f3a9b2c1d",
  "agent_version": 5,
  "fleet_id": "sha256-psk-hash",
  "hmac": "hex-hmac-sha256",
  "data": {
    "hostname": "node-01",
    "uptime_seconds": 123456,
    "loadavg": [0.12, 0.34, 0.56],
    "interfaces": [
      {
        "name": "eth0",
        "mac": "aa:bb:cc:dd:ee:ff",
        "addresses": ["192.168.1.10/24"],
        "is_virtual": false,
        "vpn_type": null
      }
    ],
    "routes": [
      { "dst": "0.0.0.0/0", "via": "192.168.1.1", "dev": "eth0" }
    ],
    "wg_peers": [
      {
        "public_key": "base64-key",
        "endpoint": "1.2.3.4:51820",
        "allowed_ips": ["10.0.0.2/32"],
        "last_handshake": 1744569800
      }
    ]
  }
}

Relay Mechanism

Used when an agent cannot reach the manager directly via the configured URL.

  • Discovery: The agent will scan its connected WireGuard networks for other agents (checking port 5087). It queries their /status/peer endpoint to find a forward path to the manager.
  • Supports up to 3 hops.

Important

Loop Detection: Agents must check the relay_path array. If their own agent_id is present, the message is dropped to prevent infinite recursion.


Data Model (Manager)

The Manager maintains the network state and inferred topology.

Table Purpose
agents Fleet registry and presence tracking (heartbeat).
agent_interfaces Historical snapshot of network interfaces.
topology_edges Inferred links between agents (Physical, VPN, Relay).
alarms Event log for changes (link down, new peer, etc.).

Visualization & UI

The network is visualized using Vis-network.min.js in a layered approach:

  1. Layer 1 (Public): Servers with direct public IPs (masked as SHA fingerprints).
  2. Layer 2 (Linked): Servers behind NAT but directly connected to Layer 1.
  3. Layer 3 (Private): Isolated nodes reachable only via multi-hop paths.

Operational Considerations

Logging & Monitoring

  • Agents: Should log to journald at INFO level. Critical errors (e.g., SQLite corruption, no PSK) should be logged at ERROR.
  • Manager: Log each incoming report and security failure (HMAC mismatch) with the source agent IP and ID.

Maintenance

  • Database Vacuum: Periodic VACUUM on the manager DB is recommended if tracking many historical reports.
  • Relay Cleanup: The nonce_cache should be cleaned every 10 minutes to prevent memory/storage bloat.

Future Enhancements & Proposals

1. Alerting Integrations

  • Webhooks: Simple HTTP POST to external services (Slack, Discord) when an alarm is created.

2. Historical Topology "Time-Travel"

  • Store topology snapshots every hour.
  • Allow the UI to "scrub" through history to see when a specific link was added or lost.

3. Advanced Visualization

  • Geographic Map Overlay: If agents provide coordinates (or inferred via GeoIP), display nodes on a world map.
  • Link Bandwidth Visualization: Thicker lines for higher capacity links (e.g., physical vs. relay).