4.1 KiB
4.1 KiB
Kattila Manager Implementation Plan
This document outlines the detailed architecture and implementation steps for the Python-based Kattila Manager.
Overview
The Manager is a Python/Flask application that maintains a centralized SQLite (WAL mode) database. It provides an HTTP API to receive pushed reports from the agents, securely verifies their HMAC-SHA256 signatures, prevents replay attacks using a nonce sliding window cache, and updates the local network topology and alarm states based on the received data.
Look at the kattila.poc for ideas how to implement ip address anonymization. And tips on how the map chould be drawn.
Proposed Architecture / Modules
1. Database Layer (db.py)
- Initializes an
sqlite3connection withPRAGMA journal_mode=WAL;. - Automatically executes the
CREATE TABLEandCREATE INDEXSQL schemas defined in the DESIGN document on startup. - Exposes structured data access methods for other modules (e.g.,
upsert_agent,insert_report,update_interfaces,update_edges,create_alarm).
2. Security Layer (security.py)
- Key Fetching: A background thread or periodic polling function that utilizes Python's DNS resolver to get the Bootstrap PSK from the TXT record, keeping track of the current PSK and the two previous PSKs.
- HMAC Verification: Parses incoming JSON, re-serializes the
datapayload identically, and checks if the provided HMAC matches one of the known PSKs. - Nonce Cache: Maintains a memory-bound cache (e.g.,
collections.OrderedDict) of the last 120 nonces to prevent replay attacks. - Time Skew: Rejects reports whose
timestampdeviates by more than 10 minutes from the Manager's local clock.
3. Data Processor (processor.py)
This is the core business logic engine invoked whenever a valid /status/updates payload hits the API:
- Agents: Upsert the
agent_idinto theagentstable and update thelast_seen_atheartbeat. - Reports: Store the raw envelope in
reportsfor auditing. - Interfaces: Compare the payload's
interfacesagainstagent_interfaces. If new interfaces appear or old ones disappear, update the DB and potentially trigger an alarm (e.g., "Interface eth0 went down"). - Topology Edges: Iterate over
wg_peers. For each peer, create or update a link intopology_edgesspecifyingedge_type='wireguard'.
4. API Layer (api.py or app.py)
- A Flask Blueprint or App defining:
POST /status/updates: Main ingress. Parses JSON -> Verifies HMAC & Nonce -> Calls Processor -> Returns OK. Unwrapsrelay_pathenvelopes iteratively if needed.POST /status/register: Allows new agents to announce their generated ID.GET /status/healthcheck: Returns{status: ok}.GET /status/alarms: JSON list of active alarms.GET /status/agents: JSON dump of the fleet matrix.POST /status/admin/reset: Clears specific agent topology state.
User Review Required
Important
- Since Python's standard library doesn't organically support fetching DNS TXT records easily, I plan to add
dnspythontorequirements.txt. Is this acceptable?- The agent successfully generates its own secure hexadecimal
agent_idlocally. Instead of the Manager strictly mandating/status/registerbefore everything else, is it acceptable for the Manager to dynamically "auto-register" (upsert) unknownagent_ids directly when they push a valid/status/updatesreport? (It simplifies bootstrapping considerably).- When generating alarms, should we just log simple messages like "Interface X disappeared" and keep the alarm
activeuntil a human clears it, or should the alarms auto-dismiss when the issue resolves (e.g., interface comes back)?
Verification Plan
Automated testing
- Run basic
pytest(if available) or dummy scripts pushing forged payloads and ensuring the security layer rejects invalid HMACs and duplicate Nonces.
Manual Verification
- Start the Flask app.
- Hit
/status/healthcheckwith curl. - Send a mock successful JSON representation of the
wg_peersandinterfacesusing exactly the PSK from the test.env. Check thatkattila_manager.dbcorrectly generated the relational graph.