Files

Kalzu Rekku 99e0e0208c Makefile and agent start.

2026-04-17 19:23:04 +03:00

8.0 KiB

Raw Blame History

Kattila.status - Design Specification

Kattila.status is a virtual network topology monitor designed for multi-layer, multi-network environments (VPN meshes, Wireguard, OpenVPN). It follows an Agent-Manager architecture with a pure push-based messaging model.

Architecture Overview

graph TD
    subgraph "Agents (Debian/Linux)"
        A1[Agent 1]
        A2[Agent 2]
        A3[Agent 3]
    end

    subgraph "Manager (Python/Flask)"
        M[Manager API]
        DB[(SQLite WAL)]
        UI[Web UI / Vis-network]
    end

    A1 -->|HTTP/JSON| M
    A2 -->|Relay| A1
    A3 -->|Relay| A2
    M <--> DB
    UI <--> M

API Endpoints

Agent API (Listen: 5087)

Endpoint	Method	Description
`/status/healthcheck`	GET	Returns simple health status.
`/status/reset`	POST	Wipes local SQLite state and triggers re-registration.
`/status/peer`	GET	Returns local interface/route info (for relay peers).
`/status/relay`	POST	Accepts an enveloped report for forwarding to the Manager.

Manager API (Listen: 5086)

Endpoint	Method	Description
`/status/updates`	POST	Receives periodic reports from agents.
`/status/register`	POST	First contact; issues a unique `agent_id`.
`/status/healthcheck`	GET	Manager heartheat check.
`/status/alarms`	GET	Fetches active network anomalies.
`/status/agents`	GET	Lists all known agents and their status.
`/status/admin/reset`	POST	Resets specific agent or fleet state.

Data Model

Manager DB (`kattila_manager.db`)

`agents`

Tracks the fleet registry and presence.

CREATE TABLE agents (
    agent_id TEXT PRIMARY KEY,
    hostname TEXT NOT NULL,
    agent_version INTEGER NOT NULL,
    fleet_id TEXT NOT NULL,
    registered_at INTEGER NOT NULL,
    last_seen_at INTEGER NOT NULL,
    last_tick INTEGER NOT NULL DEFAULT 0,
    status TEXT NOT NULL DEFAULT 'online' -- online, offline, warning
);
CREATE INDEX idx_agents_last_seen ON agents(last_seen_at);

`reports`

Auditing and replay protection.

CREATE TABLE reports (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    agent_id TEXT NOT NULL,
    tick INTEGER NOT NULL,
    timestamp INTEGER NOT NULL,
    report_type TEXT NOT NULL, -- 'report', 'relay', 'register'
    report_json TEXT NOT NULL,
    received_at INTEGER NOT NULL DEFAULT (strftime('%s', 'now')),
    FOREIGN KEY (agent_id) REFERENCES agents(agent_id) ON DELETE CASCADE
);
CREATE UNIQUE INDEX idx_reports_agent_tick ON reports(agent_id, tick);

`topology_edges`

Inferred links between agents.

CREATE TABLE topology_edges (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    from_agent_id TEXT NOT NULL,
    to_agent_id TEXT NOT NULL,
    edge_type TEXT NOT NULL, -- 'wireguard', 'openvpn', 'physical', 'relay'
    metadata TEXT DEFAULT '{}', -- JSON for pubkeys, RTT, etc.
    last_seen INTEGER NOT NULL,
    is_active INTEGER NOT NULL DEFAULT 1
);
CREATE UNIQUE INDEX idx_edges_pair ON topology_edges(from_agent_id, to_agent_id, edge_type);

`agent_interfaces`

Tracks network interfaces an agent reports, allowing detection of when they come and go.

CREATE TABLE agent_interfaces (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    agent_id TEXT NOT NULL,
    interface_name TEXT NOT NULL,
    mac_address TEXT,
    addresses_json TEXT,
    is_virtual INTEGER NOT NULL DEFAULT 0,
    vpn_type TEXT,
    last_seen_at INTEGER NOT NULL,
    FOREIGN KEY (agent_id) REFERENCES agents(agent_id) ON DELETE CASCADE
);
CREATE UNIQUE INDEX idx_agent_interfaces ON agent_interfaces(agent_id, interface_name);

`alarms`

Event log for network changes and issues, tracking state and timestamps.

CREATE TABLE alarms (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    agent_id TEXT NOT NULL,
    alarm_type TEXT NOT NULL, -- e.g., 'link_down', 'new_peer'
    status TEXT NOT NULL DEFAULT 'active', -- 'active', 'dismissed'
    details_json TEXT DEFAULT '{}',
    created_at INTEGER NOT NULL DEFAULT (strftime('%s', 'now')),
    dismissed_at INTEGER,
    FOREIGN KEY (agent_id) REFERENCES agents(agent_id) ON DELETE CASCADE
);
CREATE INDEX idx_alarms_agent_status ON alarms(agent_id, status);

Communication Protocol

Security & Hardware

Authentication: HMAC-SHA256 verification using a fleet-wide Bootstrap PSK.
Key Discovery & Transition: The PSK is retrieved via DNS TXT, HTTP(S) URL, or local file and checked for changes hourly. The manager should accept the current and the 2 previous bootstrap keys to handle propagation delays, returning a specific error if an agent connects with an outdated key.
Replay Protection: Monotonic "ticks" and a sliding window nonce cache (120 entries).
Time Sync: 10-minute maximum clock skew allowance.

Report Payload (Agent -> Manager)

Agents send a report every 30 seconds (with randomized jitter).

{
  "version": 1,
  "tick": 42,
  "type": "report",
  "nonce": "base64-random-nonce",
  "timestamp": 1744569900,
  "agent_id": "agent-7f3a9b2c1d",
  "agent_version": 5,
  "fleet_id": "sha256-psk-hash",
  "hmac": "hex-hmac-sha256",
  "data": {
    "hostname": "node-01",
    "uptime_seconds": 123456,
    "loadavg": [0.12, 0.34, 0.56],
    "interfaces": [
      {
        "name": "eth0",
        "mac": "aa:bb:cc:dd:ee:ff",
        "addresses": ["192.168.1.10/24"],
        "is_virtual": false,
        "vpn_type": null
      }
    ],
    "routes": [
      { "dst": "0.0.0.0/0", "via": "192.168.1.1", "dev": "eth0" }
    ],
    "wg_peers": [
      {
        "public_key": "base64-key",
        "endpoint": "1.2.3.4:51820",
        "allowed_ips": ["10.0.0.2/32"],
        "last_handshake": 1744569800
      }
    ]
  }
}

Relay Mechanism

Used when an agent cannot reach the manager directly via the configured URL.

Discovery: The agent will scan its connected WireGuard networks for other agents (checking port 5087). It queries their /status/peer endpoint to find a forward path to the manager.
Supports up to 3 hops.

Important

Loop Detection: Agents must check the relay_path array. If their own agent_id is present, the message is dropped to prevent infinite recursion.

Data Model (Manager)

The Manager maintains the network state and inferred topology.

Table	Purpose
`agents`	Fleet registry and presence tracking (heartbeat).
`agent_interfaces`	Historical snapshot of network interfaces.
`topology_edges`	Inferred links between agents (Physical, VPN, Relay).
`alarms`	Event log for changes (link down, new peer, etc.).

Visualization & UI

The network is visualized using Vis-network.min.js in a layered approach:

Layer 1 (Public): Servers with direct public IPs (masked as SHA fingerprints).
Layer 2 (Linked): Servers behind NAT but directly connected to Layer 1.
Layer 3 (Private): Isolated nodes reachable only via multi-hop paths.

Operational Considerations

Logging & Monitoring

Agents: Should log to journald at INFO level. Critical errors (e.g., SQLite corruption, no PSK) should be logged at ERROR.
Manager: Log each incoming report and security failure (HMAC mismatch) with the source agent IP and ID.

Maintenance

Database Vacuum: Periodic VACUUM on the manager DB is recommended if tracking many historical reports.
Relay Cleanup: The nonce_cache should be cleaned every 10 minutes to prevent memory/storage bloat.

Future Enhancements & Proposals

1. Alerting Integrations

Webhooks: Simple HTTP POST to external services (Slack, Discord) when an alarm is created.

2. Historical Topology "Time-Travel"

Store topology snapshots every hour.
Allow the UI to "scrub" through history to see when a specific link was added or lost.

3. Advanced Visualization

Geographic Map Overlay: If agents provide coordinates (or inferred via GeoIP), display nodes on a world map.
Link Bandwidth Visualization: Thicker lines for higher capacity links (e.g., physical vs. relay).

8.0 KiB Raw Blame History