kattila.status/manager/# Kattila Manager Implementation Plan.md

# Kattila Manager Implementation Plan

This document outlines the detailed architecture and implementation steps for the Python-based Kattila Manager.

## Overview
The Manager is a Python/Flask application that maintains a centralized SQLite (WAL mode) database. It provides an HTTP API to receive pushed reports from the agents, securely verifies their HMAC-SHA256 signatures, prevents replay attacks using a nonce sliding window cache, and updates the local network topology and alarm states based on the received data.

Look at the kattila.poc for ideas how to implement ip address anonymization. And tips on how the map chould be drawn.

## Proposed Architecture / Modules

### 1. Database Layer (`db.py`)
- Initializes an `sqlite3` connection with `PRAGMA journal_mode=WAL;`.
- Automatically executes the `CREATE TABLE` and `CREATE INDEX` SQL schemas defined in the DESIGN document on startup.
- Exposes structured data access methods for other modules (e.g., `upsert_agent`, `insert_report`, `update_interfaces`, `update_edges`, `create_alarm`).

### 2. Security Layer (`security.py`)
- **Key Fetching**: A background thread or periodic polling function that utilizes Python's DNS resolver to get the Bootstrap PSK from the TXT record, keeping track of the current PSK and the two previous PSKs.
- **HMAC Verification**: Parses incoming JSON, re-serializes the `data` payload identically, and checks if the provided HMAC matches one of the known PSKs.
- **Nonce Cache**: Maintains a memory-bound cache (e.g., `collections.OrderedDict`) of the last 120 nonces to prevent replay attacks.
- **Time Skew**: Rejects reports whose `timestamp` deviates by more than 10 minutes from the Manager's local clock.

### 3. Data Processor (`processor.py`)
This is the core business logic engine invoked whenever a valid `/status/updates` payload hits the API:
- **Agents**: Upsert the `agent_id` into the `agents` table and update the `last_seen_at` heartbeat.
- **Reports**: Store the raw envelope in `reports` for auditing.
- **Interfaces**: Compare the payload's `interfaces` against `agent_interfaces`. If new interfaces appear or old ones disappear, update the DB and potentially trigger an alarm (e.g., "Interface eth0 went down").
- **Topology Edges**: Iterate over `wg_peers`. For each peer, create or update a link in `topology_edges` specifying `edge_type='wireguard'`.

### 4. API Layer (`api.py` or `app.py`)
- A Flask Blueprint or App defining:
  - `POST /status/updates`: Main ingress. Parses JSON -> Verifies HMAC & Nonce -> Calls Processor -> Returns OK. Unwraps `relay_path` envelopes iteratively if needed.
  - `POST /status/register`: Allows new agents to announce their generated ID.
  - `GET /status/healthcheck`: Returns `{status: ok}`.
  - `GET /status/alarms`: JSON list of active alarms.
  - `GET /status/agents`: JSON dump of the fleet matrix.
  - `POST /status/admin/reset`: Clears specific agent topology state.

## User Review Required
> [!IMPORTANT]
> - Since Python's standard library doesn't organically support fetching DNS TXT records easily, I plan to add `dnspython` to `requirements.txt`. Is this acceptable?
> - The agent successfully generates its own secure hexadecimal `agent_id` locally. Instead of the Manager strictly mandating `/status/register` before everything else, is it acceptable for the Manager to dynamically "auto-register" (upsert) unknown `agent_id`s directly when they push a valid `/status/updates` report? (It simplifies bootstrapping considerably).
> - When generating alarms, should we just log simple messages like "Interface X disappeared" and keep the alarm `active` until a human clears it, or should the alarms auto-dismiss when the issue resolves (e.g., interface comes back)?

## Verification Plan
### Automated testing
- Run basic `pytest` (if available) or dummy scripts pushing forged payloads and ensuring the security layer rejects invalid HMACs and duplicate Nonces.
### Manual Verification
- Start the Flask app.
- Hit `/status/healthcheck` with curl.
- Send a mock successful JSON representation of the `wg_peers` and `interfaces` using exactly the PSK from the test `.env`. Check that `kattila_manager.db` correctly generated the relational graph.