288 lines
13 KiB
Markdown
288 lines
13 KiB
Markdown
# CLAUDE.md
|
|
|
|
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
|
|
|
## Project Overview
|
|
|
|
This is a **distributed internet network mapping system** that performs pings and traceroutes across geographically diverse nodes to build a continuously evolving map of internet routes. The system is designed to be resilient to node failures, network instability, and imperfect infrastructure (Raspberry Pis, consumer NAT, 4G/LTE connections).
|
|
|
|
Core concept: Bootstrap with ~19,000 cloud provider IPs → ping targets → traceroute responders → extract intermediate hops → feed hops back as new targets → build organic graph of internet routes over time.
|
|
|
|
## Multi-Instance Production Deployment
|
|
|
|
**CRITICAL**: All services are designed to run with **multiple instances in production**. This architectural constraint must be considered in all design decisions:
|
|
|
|
### State Management
|
|
- **Avoid local in-memory state** for coordination or shared data
|
|
- Use external stores (files, databases, shared storage) for state that must persist across instances
|
|
- Current input_service uses per-consumer file-based state tracking - each instance maintains its own consumer mappings
|
|
- Current ping_service uses in-memory cooldown cache - acceptable because workers are distributed and some overlap is tolerable
|
|
|
|
### Coordination Requirements
|
|
- **ping_service**: Multiple workers can ping the same targets (cooldown prevents excessive frequency)
|
|
- **input_service**: Multiple instances serve different consumers independently; per-consumer state prevents duplicate work for the same client
|
|
- **output_service**: Must handle concurrent writes from multiple ping_service instances safely
|
|
- **manager**: Session management currently in-memory - needs external session store for multi-instance deployment
|
|
|
|
### Design Implications
|
|
- Services must be stateless where possible, or use shared external state
|
|
- Database/storage layer must handle concurrent access correctly
|
|
- Load balancing between instances should be connection-based for input_service (maintains per-consumer state)
|
|
- Race conditions and distributed coordination must be considered for shared resources
|
|
|
|
### Current Implementation Status
|
|
- **input_service**: Partially multi-instance ready (per-consumer state is instance-local; hop deduplication requires session affinity or broadcast strategy - see MULTI_INSTANCE.md)
|
|
- **ping_service**: Fully multi-instance ready (distributed workers by design)
|
|
- **output_service**: Fully multi-instance ready (each instance maintains its own SQLite database with TTL-based sentHops cleanup)
|
|
- **manager**: Requires configuration for multi-instance (in-memory sessions; user store now uses file locking for safe concurrent access - see MULTI_INSTANCE.md)
|
|
|
|
## Architecture Components
|
|
|
|
### 1. `ping_service` (Root Directory)
|
|
The worker agent that runs on each distributed node.
|
|
|
|
- **Language**: Go
|
|
- **Main file**: `ping_service.go`
|
|
- **Responsibilities**: Execute ICMP/TCP pings, apply per-IP cooldowns, run traceroute on successes, output structured JSON results, expose health/metrics endpoints
|
|
- **Configuration**: `config.yaml` - supports file/HTTP/Unix socket for input/output
|
|
- **Deployment**: Designed to run unattended under systemd on Debian-based systems
|
|
|
|
### 2. `input_service/`
|
|
HTTP service that feeds IP addresses to ping workers with subnet interleaving.
|
|
|
|
- **Main file**: `http_input_service.go`
|
|
- **Responsibilities**: Serve individual IPs with subnet interleaving (avoids consecutive IPs from same subnet), maintain per-consumer state, accept discovered hops from output_service via `/hops` endpoint
|
|
- **Data source**: Expects `./cloud-provider-ip-addresses/` directory with `.txt` files containing CIDR ranges
|
|
- **Features**: 10-CIDR interleaving, per-consumer + global deduplication, hop discovery feedback loop, lazy CIDR expansion, persistent state (save/import), IPv4 filtering, graceful shutdown
|
|
- **API Endpoints**: `/` (GET - serve IP), `/hops` (POST - accept discovered hops), `/status`, `/export`, `/import`
|
|
|
|
### 3. `output_service/`
|
|
HTTP service that receives and stores ping/traceroute results.
|
|
|
|
- **Main file**: `main.go`
|
|
- **Responsibilities**: Store ping/traceroute results in SQLite, extract intermediate hops, forward discovered hops to input_service, provide reporting/metrics API
|
|
- **Database**: SQLite with automatic rotation (weekly OR 100MB, keep 5 files)
|
|
- **Features**: Hop deduplication, remote database dumps, Prometheus metrics, health checks
|
|
- **Multi-instance**: Each instance maintains its own database, can be aggregated later
|
|
|
|
### 4. `manager/`
|
|
Centralized web UI and control plane with TOTP authentication.
|
|
|
|
- **Main file**: `main.go`
|
|
- **Responsibilities**: Web UI for system observation, control/coordination, certificate/crypto handling (AES-GCM double encryption), Dynamic DNS (dy.fi) integration, fail2ban-ready security logging, worker registration and monitoring, optional gateway/proxy for external workers
|
|
- **Security**: TOTP two-factor auth, Let's Encrypt ACME support, encrypted user store, rate limiting, API key management (for gateway)
|
|
- **Additional modules**: `store.go`, `logger.go`, `template.go`, `crypto.go`, `cert.go`, `dyfi.go`, `gr.go`, `workers.go`, `handlers.go`, `security.go`, `proxy.go`, `apikeys.go`
|
|
- **Features**: Worker auto-discovery, health polling (60s), dashboard UI, gateway mode (optional), multi-instance dy.fi failover
|
|
|
|
## Service Discovery
|
|
|
|
All services (input, ping, output) expose a `/service-info` endpoint that returns:
|
|
|
|
```json
|
|
{
|
|
"service_type": "input|ping|output",
|
|
"version": "1.0.0",
|
|
"name": "service_name",
|
|
"instance_id": "hostname",
|
|
"capabilities": ["feature1", "feature2"]
|
|
}
|
|
```
|
|
|
|
**Purpose**: Enables automatic worker type detection in the manager. When registering a worker, you only need to provide the URL - the manager queries `/service-info` to determine:
|
|
- **Service type** (input/ping/output)
|
|
- **Suggested name** (generated from service name + instance ID)
|
|
|
|
**Location of endpoint**:
|
|
- **input_service**: `http://host:8080/service-info`
|
|
- **ping_service**: `http://host:PORT/service-info` (on health check port)
|
|
- **output_service**: `http://host:HEALTH_PORT/service-info` (on health check server)
|
|
|
|
**Manager behavior**:
|
|
- If worker registration omits `type`, manager calls `/service-info` to auto-detect
|
|
- If auto-detection fails, registration fails with helpful error message
|
|
- Manual type override is always available
|
|
- Auto-generated names can be overridden during registration
|
|
|
|
**Note**: This only works for **internal workers** that the manager can reach (e.g., on WireGuard). External workers behind NAT use the gateway with API keys (see `GATEWAY.md`).
|
|
|
|
## Common Commands
|
|
|
|
### Building Components
|
|
|
|
```bash
|
|
# Build ping_service (root)
|
|
go build -o ping_service
|
|
|
|
# Build input_service
|
|
cd input_service
|
|
go build -ldflags="-s -w" -o http_input_service http_input_service.go
|
|
|
|
# Build output_service
|
|
cd output_service
|
|
go build -o output_service main.go
|
|
|
|
# Build manager
|
|
cd manager
|
|
go mod tidy
|
|
go build -o manager
|
|
```
|
|
|
|
### Running Services
|
|
|
|
```bash
|
|
# Run ping_service with verbose logging
|
|
./ping_service -config config.yaml -verbose
|
|
|
|
# Run input_service (serves on :8080)
|
|
cd input_service
|
|
./http_input_service
|
|
|
|
# Run output_service (serves on :8081 for results, :8091 for health)
|
|
cd output_service
|
|
./output_service --verbose
|
|
|
|
# Run manager in development (self-signed certs)
|
|
cd manager
|
|
go run . --port=8080
|
|
|
|
# Run manager in production (Let's Encrypt)
|
|
sudo go run . --port=443 --domain=example.dy.fi --email=admin@example.com
|
|
```
|
|
|
|
### Installing ping_service as systemd Service
|
|
|
|
```bash
|
|
chmod +x install.sh
|
|
sudo ./install.sh
|
|
sudo systemctl start ping-service
|
|
sudo systemctl status ping-service
|
|
sudo journalctl -u ping-service -f
|
|
```
|
|
|
|
### Manager User Management
|
|
|
|
```bash
|
|
# Add new user (generates TOTP QR code)
|
|
cd manager
|
|
go run . --add-user=username
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### ping_service (`config.yaml`)
|
|
- `input_file`: IP source - HTTP endpoint, file path, or Unix socket
|
|
- `output_file`: Results destination - HTTP endpoint, file path, or Unix socket
|
|
- `interval_seconds`: Poll interval between runs
|
|
- `cooldown_minutes`: Minimum time between pinging the same IP
|
|
- `enable_traceroute`: Enable traceroute on successful pings
|
|
- `traceroute_max_hops`: Maximum TTL for traceroute
|
|
- `health_check_port`: Port for `/health`, `/ready`, `/metrics` endpoints
|
|
|
|
### output_service (CLI Flags)
|
|
- `--port`: Port for receiving results (default 8081)
|
|
- `--health-port`: Port for health/metrics (default 8091)
|
|
- `--input-url`: Input service URL for hop submission (default http://localhost:8080/hops)
|
|
- `--db-dir`: Directory for database files (default ./output_data)
|
|
- `--max-size-mb`: Max DB size in MB before rotation (default 100)
|
|
- `--rotation-days`: Rotate DB after N days (default 7)
|
|
- `--keep-files`: Number of DB files to keep (default 5)
|
|
- `-v, --verbose`: Enable verbose logging
|
|
|
|
### manager (Environment Variables)
|
|
- `SERVER_KEY`: 32-byte base64 key for encryption (auto-generated if missing)
|
|
- `DYFI_DOMAIN`, `DYFI_USER`, `DYFI_PASS`: Dynamic DNS configuration
|
|
- `ACME_EMAIL`: Email for Let's Encrypt notifications
|
|
- `LOG_FILE`: Path for fail2ban-ready authentication logs
|
|
- `MANAGER_PORT`: HTTP/HTTPS port (default from flag)
|
|
|
|
## Key Design Principles
|
|
|
|
1. **Fault Tolerance**: Nodes can join/leave freely, partial failures expected
|
|
2. **Network Reality**: Designed for imperfect infrastructure (NAT, 4G, low-end hardware)
|
|
3. **No Time Guarantees**: Latency variations normal, no assumption of always-online workers
|
|
4. **Organic Growth**: System learns by discovering hops and feeding them back as targets
|
|
5. **Security**: Manager requires TOTP auth, double-encrypted storage, fail2ban integration
|
|
|
|
## Dependencies
|
|
|
|
### ping_service
|
|
- `github.com/go-ping/ping` - ICMP ping library
|
|
- `gopkg.in/yaml.v3` - YAML config parsing
|
|
- Go 1.25.0
|
|
|
|
### output_service
|
|
- `github.com/mattn/go-sqlite3` - SQLite driver (requires CGO)
|
|
- Go 1.25.0
|
|
|
|
### manager
|
|
- `github.com/pquerna/otp` - TOTP authentication
|
|
- `golang.org/x/crypto/acme/autocert` - Let's Encrypt integration
|
|
|
|
## Data Flow
|
|
|
|
1. `input_service` serves IPs from CIDR ranges (or accepts discovered hops)
|
|
2. `ping_service` nodes poll input_service, ping targets with cooldown enforcement
|
|
3. Successful pings trigger optional traceroute (ICMP/TCP)
|
|
4. Results (JSON) sent to `output_service` (HTTP/file/socket)
|
|
5. `output_service` extracts intermediate hops from traceroute data
|
|
6. New hops fed back into `input_service` target pool
|
|
7. `manager` provides visibility and control over the system
|
|
|
|
## Health Endpoints
|
|
|
|
### ping_service (port 8090)
|
|
- `GET /health` - Status, uptime, ping statistics
|
|
- `GET /ready` - Readiness check
|
|
- `GET /metrics` - Prometheus-compatible metrics
|
|
|
|
### output_service (port 8091)
|
|
- `GET /health` - Status, uptime, processing statistics
|
|
- `GET /ready` - Readiness check (verifies database connectivity)
|
|
- `GET /metrics` - Prometheus-compatible metrics
|
|
- `GET /stats` - Detailed statistics in JSON format
|
|
- `GET /recent?limit=100&ip=8.8.8.8` - Query recent ping results
|
|
|
|
### output_service API endpoints (port 8081)
|
|
- `POST /results` - Receive ping results from ping_service nodes
|
|
- `POST /rotate` - Manually trigger database rotation
|
|
- `GET /dump` - Download current SQLite database file
|
|
|
|
## Project Status
|
|
|
|
- Functional distributed ping + traceroute workers
|
|
- Input service with persistent state and lazy CIDR expansion
|
|
- Output service with SQLite storage, rotation, hop extraction, and feedback loop
|
|
- Manager with TOTP auth, encryption, Let's Encrypt, dy.fi integration
|
|
- Mapping and visualization still exploratory
|
|
|
|
## Important Notes
|
|
|
|
- Visualization strategy is an open problem (no finalized design)
|
|
- System currently bootstrapped with ~19,000 cloud provider IPs
|
|
- Traceroute supports both ICMP and TCP methods
|
|
- Manager logs `AUTH_FAILURE` events with IP for fail2ban filtering
|
|
- **Input service interleaving**: Maintains 10 active CIDR generators, rotates between them to avoid consecutive IPs from same /24 or /29 subnet
|
|
- **Input service deduplication**: Per-consumer (prevents re-serving) and global (prevents re-adding from hops)
|
|
- **Hop feedback loop**: Output service extracts hops → POSTs to input service `/hops` → input service adds to all consumer pools → organic target growth
|
|
- Input service maintains per-consumer progress state (can be exported/imported)
|
|
- Output service rotates databases weekly OR at 100MB (whichever first), keeping 5 files
|
|
- Each output_service instance maintains its own database; use `/dump` for central aggregation
|
|
- For multi-instance input_service, use session affinity or call `/hops` on all instances
|
|
|
|
## Multi-Instance Deployment
|
|
|
|
All services support multi-instance deployment with varying degrees of readiness. See **MULTI_INSTANCE.md** for comprehensive deployment guidance including:
|
|
- Session affinity strategies for input_service
|
|
- Database aggregation for output_service
|
|
- File locking for manager user store
|
|
- Load balancing recommendations
|
|
- Known limitations and workarounds
|
|
|
|
## Recent Critical Fixes
|
|
|
|
- **Fixed panic risk**: input_service now uses `ParseAddr()` with error handling instead of `MustParseAddr()`
|
|
- **Added HTTP timeouts**: ping_service uses 30-second timeout to prevent indefinite hangs
|
|
- **Fixed state serialization**: input_service now preserves activeGens array for proper interleaving after reload
|
|
- **Implemented sentHops eviction**: output_service uses TTL-based cleanup (24h) to prevent unbounded memory growth
|
|
- **Added file locking**: manager user store uses flock for safe concurrent access in multi-instance deployments
|