README.md
This commit is contained in:
210
README.md
Normal file
210
README.md
Normal file
@@ -0,0 +1,210 @@
|
||||
# kattila.status
|
||||
|
||||
A lightweight virtual network topology monitor for multi-layer, multi-network environments — WireGuard meshes, VPN overlays, and hybrid physical/virtual networks.
|
||||
|
||||
Follows a **push-based Agent → Manager** architecture. Agents run on each node, gather system and network telemetry, and push it to a central Manager. If the Manager is unreachable, agents relay reports through other agents on the same WireGuard subnet.
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────┐
|
||||
│ Agents (Go, Linux) │
|
||||
│ │
|
||||
│ Agent A ──── HTTP/JSON ──────────────────────┐ │
|
||||
│ Agent B ──── relay → Agent A → Manager ─────┤ │
|
||||
│ Agent C ──── relay → Agent B → Agent A ──────┘ │
|
||||
└──────────────────────────────────────┬──────────────────┘
|
||||
│
|
||||
┌────────▼────────┐
|
||||
│ Manager │
|
||||
│ (Python/Flask) │
|
||||
│ SQLite WAL DB │
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
Each agent reports every **30 seconds**. Reports are authenticated with **HMAC-SHA256** using a fleet-wide Pre-Shared Key (PSK) fetched via a DNS TXT record. The relay mechanism supports up to **3 hops** with loop detection.
|
||||
|
||||
---
|
||||
|
||||
## Repository Structure
|
||||
|
||||
```
|
||||
kattila.status/
|
||||
├── agent/ # Go agent
|
||||
│ ├── main.go # Entry point + CLI flags
|
||||
│ ├── config/ # .env / env var loading, AgentID persistence
|
||||
│ ├── network/ # System data collection (interfaces, routes, WG peers)
|
||||
│ ├── reporter/ # Report building, push to manager, relay logic
|
||||
│ ├── security/ # PSK via DNS, HMAC signing, nonce generation
|
||||
│ ├── api/ # Agent HTTP server (peer/relay/healthcheck endpoints)
|
||||
│ ├── models/ # Shared data types (Report, SystemData, WGPeer, …)
|
||||
│ └── bin/ # Compiled binaries (gitignored)
|
||||
├── manager/ # Python manager
|
||||
│ ├── app.py # Flask app and API endpoints
|
||||
│ ├── db.py # SQLite schema, queries
|
||||
│ ├── processor.py # Report ingestion + topology inference
|
||||
│ ├── security.py # PSK history, HMAC verification, nonce/timestamp checks
|
||||
│ └── requirements.txt
|
||||
├── Makefile
|
||||
├── .env # Local config (gitignored)
|
||||
└── DESIGN.md # Full architecture and protocol specification
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Getting Started
|
||||
|
||||
### Prerequisites
|
||||
|
||||
| Component | Requirement |
|
||||
|-----------|-------------|
|
||||
| Agent | Go 1.21+, Linux |
|
||||
| Manager | Python 3.11+, pip |
|
||||
| Both | A DNS TXT record for PSK distribution |
|
||||
|
||||
### 1. Configuration
|
||||
|
||||
Copy or create a `.env` file in the repo root (it is gitignored):
|
||||
|
||||
```env
|
||||
DNS=kattila.example.com # DNS TXT record holding the fleet PSK
|
||||
MANAGER_URL=http://10.0.0.1:5086 # Agent: where to push reports
|
||||
```
|
||||
|
||||
Both the agent and manager load this file automatically on startup. Environment variables override `.env` values.
|
||||
|
||||
### 2. PSK Setup
|
||||
|
||||
The fleet PSK is discovered via a **DNS TXT record**. Set a TXT record on your domain:
|
||||
|
||||
```
|
||||
kattila.example.com. 300 IN TXT "your-secret-psk-value"
|
||||
```
|
||||
|
||||
Both the agent and manager must be able to resolve this record. The manager retries verification against the **current + 2 previous** PSKs to handle propagation delays during key rotation.
|
||||
|
||||
### 3. Build the Agent
|
||||
|
||||
```bash
|
||||
make build-agent
|
||||
```
|
||||
|
||||
This cross-compiles for both `amd64` and `arm64`:
|
||||
|
||||
```
|
||||
agent/bin/agent-amd64
|
||||
agent/bin/agent-arm64
|
||||
```
|
||||
|
||||
> **Note**: Requires Go in your `$PATH`. If installed to a non-standard location (e.g. `~/.local/go/bin/go`), run: `PATH="$HOME/.local/go/bin:$PATH" make build-agent`
|
||||
|
||||
### 4. Run the Manager
|
||||
|
||||
```bash
|
||||
make setup-manager # Create venv and install dependencies (once)
|
||||
make run-manager # Start the Flask server on port 5086
|
||||
```
|
||||
|
||||
### 5. Deploy the Agent
|
||||
|
||||
Copy the binary and `.env` to each node, then run:
|
||||
|
||||
```bash
|
||||
./agent-amd64
|
||||
```
|
||||
|
||||
The agent will generate and persist its `agent_id.txt` on first run.
|
||||
|
||||
---
|
||||
|
||||
## Debug Tooling
|
||||
|
||||
The agent binary supports several CLI flags for diagnosing issues without running the full daemon:
|
||||
|
||||
### `-sysinfo`
|
||||
Collect and print all system telemetry as formatted JSON. Useful for verifying what the agent sees — interfaces, WireGuard peers, routes, load average:
|
||||
|
||||
```bash
|
||||
./agent -sysinfo
|
||||
```
|
||||
|
||||
### `-dump <file>`
|
||||
Run a single full data collection cycle, build a complete signed report payload (including HMAC, Nonce, AgentID), and write it to a file. This is the exact JSON that would be sent to the manager:
|
||||
|
||||
```bash
|
||||
./agent -dump /tmp/report.json
|
||||
cat /tmp/report.json
|
||||
```
|
||||
|
||||
### `-discover`
|
||||
Actively probe all IPs from WireGuard `AllowedIPs` on port 5087 to find other live Kattila agents on the same mesh — the same discovery logic used by the relay mechanism:
|
||||
|
||||
```bash
|
||||
./agent -discover
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Agent API
|
||||
|
||||
The agent exposes a small HTTP server on port **5087** for peer communication:
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `/status/healthcheck` | GET | Agent liveness probe |
|
||||
| `/status/peer` | GET | Returns local interface/route info (used by relay discovery) |
|
||||
| `/status/relay` | POST | Accepts an enveloped report to forward toward the manager |
|
||||
| `/status/reset` | POST | Wipes local state and generates a new `agent_id` |
|
||||
|
||||
## Manager API
|
||||
|
||||
The manager listens on port **5086**:
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `/status/updates` | POST | Receive periodic reports from agents |
|
||||
| `/status/register` | POST | First-contact endpoint; issues an `agent_id` |
|
||||
| `/status/healthcheck` | GET | Manager liveness probe |
|
||||
| `/status/agents` | GET | List all known agents and their status |
|
||||
| `/status/alarms` | GET | Fetch active network anomalies |
|
||||
| `/status/admin/reset` | POST | Reset a specific agent or fleet state |
|
||||
|
||||
---
|
||||
|
||||
## Security Model
|
||||
|
||||
- **Authentication**: HMAC-SHA256 over the `data` payload, signed with the fleet PSK.
|
||||
- **Key distribution**: PSK fetched from a DNS TXT record, refreshed hourly.
|
||||
- **Key rotation**: Manager accepts current + 2 previous PSKs to allow propagation time.
|
||||
- **Replay protection**: Monotonic tick counter + 120-entry nonce sliding window.
|
||||
- **Clock skew**: Maximum 10-minute allowance between agent and manager timestamps.
|
||||
- **Relay loop detection**: Agents check `relay_path` for their own `agent_id` and drop looping messages.
|
||||
|
||||
---
|
||||
|
||||
## Makefile Reference
|
||||
|
||||
```bash
|
||||
make build-agent # Cross-compile agent for amd64 + arm64
|
||||
make setup-manager # Create Python venv and install dependencies
|
||||
make run-manager # Start the manager Flask server
|
||||
make clean # Remove built binaries, venv, and manager DB
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Database
|
||||
|
||||
The manager uses a SQLite database (`kattila_manager.db`) with WAL mode. Key tables:
|
||||
|
||||
| Table | Purpose |
|
||||
|-------|---------|
|
||||
| `agents` | Fleet registry — presence, hostname, last seen |
|
||||
| `reports` | Full report audit log |
|
||||
| `agent_interfaces` | Network interface snapshots per agent |
|
||||
| `topology_edges` | Inferred links between agents (WireGuard, relay, physical) |
|
||||
| `alarms` | Event log for topology changes and anomalies |
|
||||
|
||||
See [`DESIGN.md`](DESIGN.md) for the full schema.
|
||||
Reference in New Issue
Block a user