diff --git a/README.md b/README.md new file mode 100644 index 0000000..26693cb --- /dev/null +++ b/README.md @@ -0,0 +1,210 @@ +# kattila.status + +A lightweight virtual network topology monitor for multi-layer, multi-network environments — WireGuard meshes, VPN overlays, and hybrid physical/virtual networks. + +Follows a **push-based Agent → Manager** architecture. Agents run on each node, gather system and network telemetry, and push it to a central Manager. If the Manager is unreachable, agents relay reports through other agents on the same WireGuard subnet. + +--- + +## Architecture + +``` +┌─────────────────────────────────────────────────────────┐ +│ Agents (Go, Linux) │ +│ │ +│ Agent A ──── HTTP/JSON ──────────────────────┐ │ +│ Agent B ──── relay → Agent A → Manager ─────┤ │ +│ Agent C ──── relay → Agent B → Agent A ──────┘ │ +└──────────────────────────────────────┬──────────────────┘ + │ + ┌────────▼────────┐ + │ Manager │ + │ (Python/Flask) │ + │ SQLite WAL DB │ + └─────────────────┘ +``` + +Each agent reports every **30 seconds**. Reports are authenticated with **HMAC-SHA256** using a fleet-wide Pre-Shared Key (PSK) fetched via a DNS TXT record. The relay mechanism supports up to **3 hops** with loop detection. + +--- + +## Repository Structure + +``` +kattila.status/ +├── agent/ # Go agent +│ ├── main.go # Entry point + CLI flags +│ ├── config/ # .env / env var loading, AgentID persistence +│ ├── network/ # System data collection (interfaces, routes, WG peers) +│ ├── reporter/ # Report building, push to manager, relay logic +│ ├── security/ # PSK via DNS, HMAC signing, nonce generation +│ ├── api/ # Agent HTTP server (peer/relay/healthcheck endpoints) +│ ├── models/ # Shared data types (Report, SystemData, WGPeer, …) +│ └── bin/ # Compiled binaries (gitignored) +├── manager/ # Python manager +│ ├── app.py # Flask app and API endpoints +│ ├── db.py # SQLite schema, queries +│ ├── processor.py # Report ingestion + topology inference +│ ├── security.py # PSK history, HMAC verification, nonce/timestamp checks +│ └── requirements.txt +├── Makefile +├── .env # Local config (gitignored) +└── DESIGN.md # Full architecture and protocol specification +``` + +--- + +## Getting Started + +### Prerequisites + +| Component | Requirement | +|-----------|-------------| +| Agent | Go 1.21+, Linux | +| Manager | Python 3.11+, pip | +| Both | A DNS TXT record for PSK distribution | + +### 1. Configuration + +Copy or create a `.env` file in the repo root (it is gitignored): + +```env +DNS=kattila.example.com # DNS TXT record holding the fleet PSK +MANAGER_URL=http://10.0.0.1:5086 # Agent: where to push reports +``` + +Both the agent and manager load this file automatically on startup. Environment variables override `.env` values. + +### 2. PSK Setup + +The fleet PSK is discovered via a **DNS TXT record**. Set a TXT record on your domain: + +``` +kattila.example.com. 300 IN TXT "your-secret-psk-value" +``` + +Both the agent and manager must be able to resolve this record. The manager retries verification against the **current + 2 previous** PSKs to handle propagation delays during key rotation. + +### 3. Build the Agent + +```bash +make build-agent +``` + +This cross-compiles for both `amd64` and `arm64`: + +``` +agent/bin/agent-amd64 +agent/bin/agent-arm64 +``` + +> **Note**: Requires Go in your `$PATH`. If installed to a non-standard location (e.g. `~/.local/go/bin/go`), run: `PATH="$HOME/.local/go/bin:$PATH" make build-agent` + +### 4. Run the Manager + +```bash +make setup-manager # Create venv and install dependencies (once) +make run-manager # Start the Flask server on port 5086 +``` + +### 5. Deploy the Agent + +Copy the binary and `.env` to each node, then run: + +```bash +./agent-amd64 +``` + +The agent will generate and persist its `agent_id.txt` on first run. + +--- + +## Debug Tooling + +The agent binary supports several CLI flags for diagnosing issues without running the full daemon: + +### `-sysinfo` +Collect and print all system telemetry as formatted JSON. Useful for verifying what the agent sees — interfaces, WireGuard peers, routes, load average: + +```bash +./agent -sysinfo +``` + +### `-dump ` +Run a single full data collection cycle, build a complete signed report payload (including HMAC, Nonce, AgentID), and write it to a file. This is the exact JSON that would be sent to the manager: + +```bash +./agent -dump /tmp/report.json +cat /tmp/report.json +``` + +### `-discover` +Actively probe all IPs from WireGuard `AllowedIPs` on port 5087 to find other live Kattila agents on the same mesh — the same discovery logic used by the relay mechanism: + +```bash +./agent -discover +``` + +--- + +## Agent API + +The agent exposes a small HTTP server on port **5087** for peer communication: + +| Endpoint | Method | Description | +|----------|--------|-------------| +| `/status/healthcheck` | GET | Agent liveness probe | +| `/status/peer` | GET | Returns local interface/route info (used by relay discovery) | +| `/status/relay` | POST | Accepts an enveloped report to forward toward the manager | +| `/status/reset` | POST | Wipes local state and generates a new `agent_id` | + +## Manager API + +The manager listens on port **5086**: + +| Endpoint | Method | Description | +|----------|--------|-------------| +| `/status/updates` | POST | Receive periodic reports from agents | +| `/status/register` | POST | First-contact endpoint; issues an `agent_id` | +| `/status/healthcheck` | GET | Manager liveness probe | +| `/status/agents` | GET | List all known agents and their status | +| `/status/alarms` | GET | Fetch active network anomalies | +| `/status/admin/reset` | POST | Reset a specific agent or fleet state | + +--- + +## Security Model + +- **Authentication**: HMAC-SHA256 over the `data` payload, signed with the fleet PSK. +- **Key distribution**: PSK fetched from a DNS TXT record, refreshed hourly. +- **Key rotation**: Manager accepts current + 2 previous PSKs to allow propagation time. +- **Replay protection**: Monotonic tick counter + 120-entry nonce sliding window. +- **Clock skew**: Maximum 10-minute allowance between agent and manager timestamps. +- **Relay loop detection**: Agents check `relay_path` for their own `agent_id` and drop looping messages. + +--- + +## Makefile Reference + +```bash +make build-agent # Cross-compile agent for amd64 + arm64 +make setup-manager # Create Python venv and install dependencies +make run-manager # Start the manager Flask server +make clean # Remove built binaries, venv, and manager DB +``` + +--- + +## Database + +The manager uses a SQLite database (`kattila_manager.db`) with WAL mode. Key tables: + +| Table | Purpose | +|-------|---------| +| `agents` | Fleet registry — presence, hostname, last seen | +| `reports` | Full report audit log | +| `agent_interfaces` | Network interface snapshots per agent | +| `topology_edges` | Inferred links between agents (WireGuard, relay, physical) | +| `alarms` | Event log for topology changes and anomalies | + +See [`DESIGN.md`](DESIGN.md) for the full schema.