Files

Kalzu Rekku c91e36b80a README.md

2026-04-18 21:04:16 +03:00

7.7 KiB

Raw Blame History

kattila.status

A lightweight virtual network topology monitor for multi-layer, multi-network environments — WireGuard meshes, VPN overlays, and hybrid physical/virtual networks.

Follows a push-based Agent → Manager architecture. Agents run on each node, gather system and network telemetry, and push it to a central Manager. If the Manager is unreachable, agents relay reports through other agents on the same WireGuard subnet.

Architecture

┌─────────────────────────────────────────────────────────┐
│  Agents (Go, Linux)                                     │
│                                                          │
│   Agent A ──── HTTP/JSON ──────────────────────┐        │
│   Agent B ──── relay → Agent A → Manager  ─────┤        │
│   Agent C ──── relay → Agent B → Agent A ──────┘        │
└──────────────────────────────────────┬──────────────────┘
                                       │
                              ┌────────▼────────┐
                              │  Manager        │
                              │  (Python/Flask) │
                              │  SQLite WAL DB  │
                              └─────────────────┘

Each agent reports every 30 seconds. Reports are authenticated with HMAC-SHA256 using a fleet-wide Pre-Shared Key (PSK) fetched via a DNS TXT record. The relay mechanism supports up to 3 hops with loop detection.

Repository Structure

kattila.status/
├── agent/               # Go agent
│   ├── main.go          # Entry point + CLI flags
│   ├── config/          # .env / env var loading, AgentID persistence
│   ├── network/         # System data collection (interfaces, routes, WG peers)
│   ├── reporter/        # Report building, push to manager, relay logic
│   ├── security/        # PSK via DNS, HMAC signing, nonce generation
│   ├── api/             # Agent HTTP server (peer/relay/healthcheck endpoints)
│   ├── models/          # Shared data types (Report, SystemData, WGPeer, …)
│   └── bin/             # Compiled binaries (gitignored)
├── manager/             # Python manager
│   ├── app.py           # Flask app and API endpoints
│   ├── db.py            # SQLite schema, queries
│   ├── processor.py     # Report ingestion + topology inference
│   ├── security.py      # PSK history, HMAC verification, nonce/timestamp checks
│   └── requirements.txt
├── Makefile
├── .env                 # Local config (gitignored)
└── DESIGN.md            # Full architecture and protocol specification

Getting Started

Prerequisites

Component	Requirement
Agent	Go 1.21+, Linux
Manager	Python 3.11+, pip
Both	A DNS TXT record for PSK distribution

1. Configuration

Copy or create a .env file in the repo root (it is gitignored):

DNS=kattila.example.com      # DNS TXT record holding the fleet PSK
MANAGER_URL=http://10.0.0.1:5086  # Agent: where to push reports

Both the agent and manager load this file automatically on startup. Environment variables override .env values.

2. PSK Setup

The fleet PSK is discovered via a DNS TXT record. Set a TXT record on your domain:

kattila.example.com. 300 IN TXT "your-secret-psk-value"

Both the agent and manager must be able to resolve this record. The manager retries verification against the current + 2 previous PSKs to handle propagation delays during key rotation.

3. Build the Agent

make build-agent

This cross-compiles for both amd64 and arm64:

agent/bin/agent-amd64
agent/bin/agent-arm64

Note

: Requires Go in your $PATH. If installed to a non-standard location (e.g. ~/.local/go/bin/go), run: PATH="$HOME/.local/go/bin:$PATH" make build-agent

4. Run the Manager

make setup-manager   # Create venv and install dependencies (once)
make run-manager     # Start the Flask server on port 5086

5. Deploy the Agent

Copy the binary and .env to each node, then run:

./agent-amd64

The agent will generate and persist its agent_id.txt on first run.

Debug Tooling

The agent binary supports several CLI flags for diagnosing issues without running the full daemon:

`-sysinfo`

Collect and print all system telemetry as formatted JSON. Useful for verifying what the agent sees — interfaces, WireGuard peers, routes, load average:

./agent -sysinfo

`-dump <file>`

Run a single full data collection cycle, build a complete signed report payload (including HMAC, Nonce, AgentID), and write it to a file. This is the exact JSON that would be sent to the manager:

./agent -dump /tmp/report.json
cat /tmp/report.json

`-discover`

Actively probe all IPs from WireGuard AllowedIPs on port 5087 to find other live Kattila agents on the same mesh — the same discovery logic used by the relay mechanism:

./agent -discover

Agent API

The agent exposes a small HTTP server on port 5087 for peer communication:

Endpoint	Method	Description
`/status/healthcheck`	GET	Agent liveness probe
`/status/peer`	GET	Returns local interface/route info (used by relay discovery)
`/status/relay`	POST	Accepts an enveloped report to forward toward the manager
`/status/reset`	POST	Wipes local state and generates a new `agent_id`

Manager API

The manager listens on port 5086:

Endpoint	Method	Description
`/status/updates`	POST	Receive periodic reports from agents
`/status/register`	POST	First-contact endpoint; issues an `agent_id`
`/status/healthcheck`	GET	Manager liveness probe
`/status/agents`	GET	List all known agents and their status
`/status/alarms`	GET	Fetch active network anomalies
`/status/admin/reset`	POST	Reset a specific agent or fleet state

Security Model

Authentication: HMAC-SHA256 over the data payload, signed with the fleet PSK.
Key distribution: PSK fetched from a DNS TXT record, refreshed hourly.
Key rotation: Manager accepts current + 2 previous PSKs to allow propagation time.
Replay protection: Monotonic tick counter + 120-entry nonce sliding window.
Clock skew: Maximum 10-minute allowance between agent and manager timestamps.
Relay loop detection: Agents check relay_path for their own agent_id and drop looping messages.

Makefile Reference

make build-agent     # Cross-compile agent for amd64 + arm64
make setup-manager   # Create Python venv and install dependencies
make run-manager     # Start the manager Flask server
make clean           # Remove built binaries, venv, and manager DB

Database

The manager uses a SQLite database (kattila_manager.db) with WAL mode. Key tables:

Table	Purpose
`agents`	Fleet registry — presence, hostname, last seen
`reports`	Full report audit log
`agent_interfaces`	Network interface snapshots per agent
`topology_edges`	Inferred links between agents (WireGuard, relay, physical)
`alarms`	Event log for topology changes and anomalies

See DESIGN.md for the full schema.

7.7 KiB Raw Blame History