Files
kattila.status/manager/# Kattila Manager Implementation Plan.md
2026-04-17 20:15:24 +03:00

4.1 KiB

Kattila Manager Implementation Plan

This document outlines the detailed architecture and implementation steps for the Python-based Kattila Manager.

Overview

The Manager is a Python/Flask application that maintains a centralized SQLite (WAL mode) database. It provides an HTTP API to receive pushed reports from the agents, securely verifies their HMAC-SHA256 signatures, prevents replay attacks using a nonce sliding window cache, and updates the local network topology and alarm states based on the received data.

Look at the kattila.poc for ideas how to implement ip address anonymization. And tips on how the map chould be drawn.

Proposed Architecture / Modules

1. Database Layer (db.py)

  • Initializes an sqlite3 connection with PRAGMA journal_mode=WAL;.
  • Automatically executes the CREATE TABLE and CREATE INDEX SQL schemas defined in the DESIGN document on startup.
  • Exposes structured data access methods for other modules (e.g., upsert_agent, insert_report, update_interfaces, update_edges, create_alarm).

2. Security Layer (security.py)

  • Key Fetching: A background thread or periodic polling function that utilizes Python's DNS resolver to get the Bootstrap PSK from the TXT record, keeping track of the current PSK and the two previous PSKs.
  • HMAC Verification: Parses incoming JSON, re-serializes the data payload identically, and checks if the provided HMAC matches one of the known PSKs.
  • Nonce Cache: Maintains a memory-bound cache (e.g., collections.OrderedDict) of the last 120 nonces to prevent replay attacks.
  • Time Skew: Rejects reports whose timestamp deviates by more than 10 minutes from the Manager's local clock.

3. Data Processor (processor.py)

This is the core business logic engine invoked whenever a valid /status/updates payload hits the API:

  • Agents: Upsert the agent_id into the agents table and update the last_seen_at heartbeat.
  • Reports: Store the raw envelope in reports for auditing.
  • Interfaces: Compare the payload's interfaces against agent_interfaces. If new interfaces appear or old ones disappear, update the DB and potentially trigger an alarm (e.g., "Interface eth0 went down").
  • Topology Edges: Iterate over wg_peers. For each peer, create or update a link in topology_edges specifying edge_type='wireguard'.

4. API Layer (api.py or app.py)

  • A Flask Blueprint or App defining:
    • POST /status/updates: Main ingress. Parses JSON -> Verifies HMAC & Nonce -> Calls Processor -> Returns OK. Unwraps relay_path envelopes iteratively if needed.
    • POST /status/register: Allows new agents to announce their generated ID.
    • GET /status/healthcheck: Returns {status: ok}.
    • GET /status/alarms: JSON list of active alarms.
    • GET /status/agents: JSON dump of the fleet matrix.
    • POST /status/admin/reset: Clears specific agent topology state.

User Review Required

Important

  • Since Python's standard library doesn't organically support fetching DNS TXT records easily, I plan to add dnspython to requirements.txt. Is this acceptable?
  • The agent successfully generates its own secure hexadecimal agent_id locally. Instead of the Manager strictly mandating /status/register before everything else, is it acceptable for the Manager to dynamically "auto-register" (upsert) unknown agent_ids directly when they push a valid /status/updates report? (It simplifies bootstrapping considerably).
  • When generating alarms, should we just log simple messages like "Interface X disappeared" and keep the alarm active until a human clears it, or should the alarms auto-dismiss when the issue resolves (e.g., interface comes back)?

Verification Plan

Automated testing

  • Run basic pytest (if available) or dummy scripts pushing forged payloads and ensuring the security layer rejects invalid HMACs and duplicate Nonces.

Manual Verification

  • Start the Flask app.
  • Hit /status/healthcheck with curl.
  • Send a mock successful JSON representation of the wg_peers and interfaces using exactly the PSK from the test .env. Check that kattila_manager.db correctly generated the relational graph.