Makefile and agent start.
This commit is contained in:
239
DESIGN.md
Normal file
239
DESIGN.md
Normal file
@@ -0,0 +1,239 @@
|
||||
# Kattila.status - Design Specification
|
||||
|
||||
Kattila.status is a virtual network topology monitor designed for multi-layer, multi-network environments (VPN meshes, Wireguard, OpenVPN). It follows an **Agent-Manager** architecture with a pure push-based messaging model.
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
```mermaid
|
||||
graph TD
|
||||
subgraph "Agents (Debian/Linux)"
|
||||
A1[Agent 1]
|
||||
A2[Agent 2]
|
||||
A3[Agent 3]
|
||||
end
|
||||
|
||||
subgraph "Manager (Python/Flask)"
|
||||
M[Manager API]
|
||||
DB[(SQLite WAL)]
|
||||
UI[Web UI / Vis-network]
|
||||
end
|
||||
|
||||
A1 -->|HTTP/JSON| M
|
||||
A2 -->|Relay| A1
|
||||
A3 -->|Relay| A2
|
||||
M <--> DB
|
||||
UI <--> M
|
||||
```
|
||||
|
||||
### API Endpoints
|
||||
|
||||
#### Agent API (Listen: 5087)
|
||||
| Endpoint | Method | Description |
|
||||
| :--- | :--- | :--- |
|
||||
| `/status/healthcheck` | GET | Returns simple health status. |
|
||||
| `/status/reset` | POST | Wipes local SQLite state and triggers re-registration. |
|
||||
| `/status/peer` | GET | Returns local interface/route info (for relay peers). |
|
||||
| `/status/relay` | POST | Accepts an enveloped report for forwarding to the Manager. |
|
||||
|
||||
#### Manager API (Listen: 5086)
|
||||
| Endpoint | Method | Description |
|
||||
| :--- | :--- | :--- |
|
||||
| `/status/updates` | POST | Receives periodic reports from agents. |
|
||||
| `/status/register` | POST | First contact; issues a unique `agent_id`. |
|
||||
| `/status/healthcheck` | GET | Manager heartheat check. |
|
||||
| `/status/alarms` | GET | Fetches active network anomalies. |
|
||||
| `/status/agents` | GET | Lists all known agents and their status. |
|
||||
| `/status/admin/reset` | POST | Resets specific agent or fleet state. |
|
||||
|
||||
---
|
||||
|
||||
## Data Model
|
||||
|
||||
### Manager DB (`kattila_manager.db`)
|
||||
|
||||
#### `agents`
|
||||
Tracks the fleet registry and presence.
|
||||
```sql
|
||||
CREATE TABLE agents (
|
||||
agent_id TEXT PRIMARY KEY,
|
||||
hostname TEXT NOT NULL,
|
||||
agent_version INTEGER NOT NULL,
|
||||
fleet_id TEXT NOT NULL,
|
||||
registered_at INTEGER NOT NULL,
|
||||
last_seen_at INTEGER NOT NULL,
|
||||
last_tick INTEGER NOT NULL DEFAULT 0,
|
||||
status TEXT NOT NULL DEFAULT 'online' -- online, offline, warning
|
||||
);
|
||||
CREATE INDEX idx_agents_last_seen ON agents(last_seen_at);
|
||||
```
|
||||
|
||||
#### `reports`
|
||||
Auditing and replay protection.
|
||||
```sql
|
||||
CREATE TABLE reports (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
agent_id TEXT NOT NULL,
|
||||
tick INTEGER NOT NULL,
|
||||
timestamp INTEGER NOT NULL,
|
||||
report_type TEXT NOT NULL, -- 'report', 'relay', 'register'
|
||||
report_json TEXT NOT NULL,
|
||||
received_at INTEGER NOT NULL DEFAULT (strftime('%s', 'now')),
|
||||
FOREIGN KEY (agent_id) REFERENCES agents(agent_id) ON DELETE CASCADE
|
||||
);
|
||||
CREATE UNIQUE INDEX idx_reports_agent_tick ON reports(agent_id, tick);
|
||||
```
|
||||
|
||||
#### `topology_edges`
|
||||
Inferred links between agents.
|
||||
```sql
|
||||
CREATE TABLE topology_edges (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
from_agent_id TEXT NOT NULL,
|
||||
to_agent_id TEXT NOT NULL,
|
||||
edge_type TEXT NOT NULL, -- 'wireguard', 'openvpn', 'physical', 'relay'
|
||||
metadata TEXT DEFAULT '{}', -- JSON for pubkeys, RTT, etc.
|
||||
last_seen INTEGER NOT NULL,
|
||||
is_active INTEGER NOT NULL DEFAULT 1
|
||||
);
|
||||
CREATE UNIQUE INDEX idx_edges_pair ON topology_edges(from_agent_id, to_agent_id, edge_type);
|
||||
```
|
||||
|
||||
#### `agent_interfaces`
|
||||
Tracks network interfaces an agent reports, allowing detection of when they come and go.
|
||||
```sql
|
||||
CREATE TABLE agent_interfaces (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
agent_id TEXT NOT NULL,
|
||||
interface_name TEXT NOT NULL,
|
||||
mac_address TEXT,
|
||||
addresses_json TEXT,
|
||||
is_virtual INTEGER NOT NULL DEFAULT 0,
|
||||
vpn_type TEXT,
|
||||
last_seen_at INTEGER NOT NULL,
|
||||
FOREIGN KEY (agent_id) REFERENCES agents(agent_id) ON DELETE CASCADE
|
||||
);
|
||||
CREATE UNIQUE INDEX idx_agent_interfaces ON agent_interfaces(agent_id, interface_name);
|
||||
```
|
||||
|
||||
#### `alarms`
|
||||
Event log for network changes and issues, tracking state and timestamps.
|
||||
```sql
|
||||
CREATE TABLE alarms (
|
||||
id INTEGER PRIMARY KEY AUTOINCREMENT,
|
||||
agent_id TEXT NOT NULL,
|
||||
alarm_type TEXT NOT NULL, -- e.g., 'link_down', 'new_peer'
|
||||
status TEXT NOT NULL DEFAULT 'active', -- 'active', 'dismissed'
|
||||
details_json TEXT DEFAULT '{}',
|
||||
created_at INTEGER NOT NULL DEFAULT (strftime('%s', 'now')),
|
||||
dismissed_at INTEGER,
|
||||
FOREIGN KEY (agent_id) REFERENCES agents(agent_id) ON DELETE CASCADE
|
||||
);
|
||||
CREATE INDEX idx_alarms_agent_status ON alarms(agent_id, status);
|
||||
```
|
||||
|
||||
## Communication Protocol
|
||||
|
||||
### Security & Hardware
|
||||
- **Authentication**: HMAC-SHA256 verification using a fleet-wide Bootstrap PSK.
|
||||
- **Key Discovery & Transition**: The PSK is retrieved via DNS TXT, HTTP(S) URL, or local file and checked for changes hourly. The manager should accept the current and the 2 previous bootstrap keys to handle propagation delays, returning a specific error if an agent connects with an outdated key.
|
||||
- **Replay Protection**: Monotonic "ticks" and a sliding window nonce cache (120 entries).
|
||||
- **Time Sync**: 10-minute maximum clock skew allowance.
|
||||
|
||||
### Report Payload (Agent -> Manager)
|
||||
Agents send a `report` every 30 seconds (with randomized jitter).
|
||||
|
||||
```json
|
||||
{
|
||||
"version": 1,
|
||||
"tick": 42,
|
||||
"type": "report",
|
||||
"nonce": "base64-random-nonce",
|
||||
"timestamp": 1744569900,
|
||||
"agent_id": "agent-7f3a9b2c1d",
|
||||
"agent_version": 5,
|
||||
"fleet_id": "sha256-psk-hash",
|
||||
"hmac": "hex-hmac-sha256",
|
||||
"data": {
|
||||
"hostname": "node-01",
|
||||
"uptime_seconds": 123456,
|
||||
"loadavg": [0.12, 0.34, 0.56],
|
||||
"interfaces": [
|
||||
{
|
||||
"name": "eth0",
|
||||
"mac": "aa:bb:cc:dd:ee:ff",
|
||||
"addresses": ["192.168.1.10/24"],
|
||||
"is_virtual": false,
|
||||
"vpn_type": null
|
||||
}
|
||||
],
|
||||
"routes": [
|
||||
{ "dst": "0.0.0.0/0", "via": "192.168.1.1", "dev": "eth0" }
|
||||
],
|
||||
"wg_peers": [
|
||||
{
|
||||
"public_key": "base64-key",
|
||||
"endpoint": "1.2.3.4:51820",
|
||||
"allowed_ips": ["10.0.0.2/32"],
|
||||
"last_handshake": 1744569800
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Relay Mechanism
|
||||
Used when an agent cannot reach the manager directly via the configured URL.
|
||||
- **Discovery**: The agent will scan its connected WireGuard networks for other agents (checking port 5087). It queries their `/status/peer` endpoint to find a forward path to the manager.
|
||||
- Supports up to 3 hops.
|
||||
|
||||
> [!IMPORTANT]
|
||||
> **Loop Detection**: Agents must check the `relay_path` array. If their own `agent_id` is present, the message is dropped to prevent infinite recursion.
|
||||
|
||||
---
|
||||
|
||||
## Data Model (Manager)
|
||||
|
||||
The Manager maintains the network state and inferred topology.
|
||||
|
||||
| Table | Purpose |
|
||||
| :--- | :--- |
|
||||
| `agents` | Fleet registry and presence tracking (heartbeat). |
|
||||
| `agent_interfaces` | Historical snapshot of network interfaces. |
|
||||
| `topology_edges` | Inferred links between agents (Physical, VPN, Relay). |
|
||||
| `alarms` | Event log for changes (link down, new peer, etc.). |
|
||||
|
||||
---
|
||||
|
||||
## Visualization & UI
|
||||
|
||||
The network is visualized using **Vis-network.min.js** in a layered approach:
|
||||
1. **Layer 1 (Public)**: Servers with direct public IPs (masked as SHA fingerprints).
|
||||
2. **Layer 2 (Linked)**: Servers behind NAT but directly connected to Layer 1.
|
||||
3. **Layer 3 (Private)**: Isolated nodes reachable only via multi-hop paths.
|
||||
|
||||
---
|
||||
|
||||
## Operational Considerations
|
||||
|
||||
### Logging & Monitoring
|
||||
- **Agents**: Should log to `journald` at INFO level. Critical errors (e.g., SQLite corruption, no PSK) should be logged at ERROR.
|
||||
- **Manager**: Log each incoming report and security failure (HMAC mismatch) with the source agent IP and ID.
|
||||
|
||||
### Maintenance
|
||||
- **Database Vacuum**: Periodic `VACUUM` on the manager DB is recommended if tracking many historical reports.
|
||||
- **Relay Cleanup**: The `nonce_cache` should be cleaned every 10 minutes to prevent memory/storage bloat.
|
||||
|
||||
---
|
||||
|
||||
## Future Enhancements & Proposals
|
||||
|
||||
### 1. Alerting Integrations
|
||||
- **Webhooks**: Simple HTTP POST to external services (Slack, Discord) when an `alarm` is created.
|
||||
|
||||
### 2. Historical Topology "Time-Travel"
|
||||
- Store topology snapshots every hour.
|
||||
- Allow the UI to "scrub" through history to see when a specific link was added or lost.
|
||||
|
||||
### 3. Advanced Visualization
|
||||
- **Geographic Map Overlay**: If agents provide coordinates (or inferred via GeoIP), display nodes on a world map.
|
||||
- **Link Bandwidth Visualization**: Thicker lines for higher capacity links (e.g., physical vs. relay).
|
||||
Reference in New Issue
Block a user