12 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Project Overview
This is a distributed internet network mapping system that performs pings and traceroutes across geographically diverse nodes to build a continuously evolving map of internet routes. The system is designed to be resilient to node failures, network instability, and imperfect infrastructure (Raspberry Pis, consumer NAT, 4G/LTE connections).
Core concept: Bootstrap with ~19,000 cloud provider IPs → ping targets → traceroute responders → extract intermediate hops → feed hops back as new targets → build organic graph of internet routes over time.
Multi-Instance Production Deployment
CRITICAL: All services are designed to run with multiple instances in production. This architectural constraint must be considered in all design decisions:
State Management
- Avoid local in-memory state for coordination or shared data
- Use external stores (files, databases, shared storage) for state that must persist across instances
- Current input_service uses per-consumer file-based state tracking - each instance maintains its own consumer mappings
- Current ping_service uses in-memory cooldown cache - acceptable because workers are distributed and some overlap is tolerable
Coordination Requirements
- ping_service: Multiple workers can ping the same targets (cooldown prevents excessive frequency)
- input_service: Multiple instances serve different consumers independently; per-consumer state prevents duplicate work for the same client
- output_service: Must handle concurrent writes from multiple ping_service instances safely
- manager: Session management currently in-memory - needs external session store for multi-instance deployment
Design Implications
- Services must be stateless where possible, or use shared external state
- Database/storage layer must handle concurrent access correctly
- Load balancing between instances should be connection-based for input_service (maintains per-consumer state)
- Race conditions and distributed coordination must be considered for shared resources
Current Implementation Status
- input_service: Partially multi-instance ready (per-consumer state is instance-local, which works if clients stick to one instance)
- ping_service: Fully multi-instance ready (distributed workers by design)
- output_service: Fully multi-instance ready (each instance maintains its own SQLite database)
- manager: Not multi-instance ready (in-memory sessions, user store reload assumes single instance)
Architecture Components
1. ping_service (Root Directory)
The worker agent that runs on each distributed node.
- Language: Go
- Main file:
ping_service.go - Responsibilities: Execute ICMP/TCP pings, apply per-IP cooldowns, run traceroute on successes, output structured JSON results, expose health/metrics endpoints
- Configuration:
config.yaml- supports file/HTTP/Unix socket for input/output - Deployment: Designed to run unattended under systemd on Debian-based systems
2. input_service/
HTTP service that feeds IP addresses to ping workers with subnet interleaving.
- Main file:
http_input_service.go - Responsibilities: Serve individual IPs with subnet interleaving (avoids consecutive IPs from same subnet), maintain per-consumer state, accept discovered hops from output_service via
/hopsendpoint - Data source: Expects
./cloud-provider-ip-addresses/directory with.txtfiles containing CIDR ranges - Features: 10-CIDR interleaving, per-consumer + global deduplication, hop discovery feedback loop, lazy CIDR expansion, persistent state (save/import), IPv4 filtering, graceful shutdown
- API Endpoints:
/(GET - serve IP),/hops(POST - accept discovered hops),/status,/export,/import
3. output_service/
HTTP service that receives and stores ping/traceroute results.
- Main file:
main.go - Responsibilities: Store ping/traceroute results in SQLite, extract intermediate hops, forward discovered hops to input_service, provide reporting/metrics API
- Database: SQLite with automatic rotation (weekly OR 100MB, keep 5 files)
- Features: Hop deduplication, remote database dumps, Prometheus metrics, health checks
- Multi-instance: Each instance maintains its own database, can be aggregated later
4. manager/
Centralized web UI and control plane with TOTP authentication.
- Main file:
main.go - Responsibilities: Web UI for system observation, control/coordination, certificate/crypto handling (AES-GCM double encryption), Dynamic DNS (dy.fi) integration, fail2ban-ready security logging, worker registration and monitoring, optional gateway/proxy for external workers
- Security: TOTP two-factor auth, Let's Encrypt ACME support, encrypted user store, rate limiting, API key management (for gateway)
- Additional modules:
store.go,logger.go,template.go,crypto.go,cert.go,dyfi.go,gr.go,workers.go,handlers.go,security.go,proxy.go,apikeys.go - Features: Worker auto-discovery, health polling (60s), dashboard UI, gateway mode (optional), multi-instance dy.fi failover
Service Discovery
All services (input, ping, output) expose a /service-info endpoint that returns:
{
"service_type": "input|ping|output",
"version": "1.0.0",
"name": "service_name",
"instance_id": "hostname",
"capabilities": ["feature1", "feature2"]
}
Purpose: Enables automatic worker type detection in the manager. When registering a worker, you only need to provide the URL - the manager queries /service-info to determine:
- Service type (input/ping/output)
- Suggested name (generated from service name + instance ID)
Location of endpoint:
- input_service:
http://host:8080/service-info - ping_service:
http://host:PORT/service-info(on health check port) - output_service:
http://host:HEALTH_PORT/service-info(on health check server)
Manager behavior:
- If worker registration omits
type, manager calls/service-infoto auto-detect - If auto-detection fails, registration fails with helpful error message
- Manual type override is always available
- Auto-generated names can be overridden during registration
Note: This only works for internal workers that the manager can reach (e.g., on WireGuard). External workers behind NAT use the gateway with API keys (see GATEWAY.md).
Common Commands
Building Components
# Build ping_service (root)
go build -o ping_service
# Build input_service
cd input_service
go build -ldflags="-s -w" -o http_input_service http_input_service.go
# Build output_service
cd output_service
go build -o output_service main.go
# Build manager
cd manager
go mod tidy
go build -o manager
Running Services
# Run ping_service with verbose logging
./ping_service -config config.yaml -verbose
# Run input_service (serves on :8080)
cd input_service
./http_input_service
# Run output_service (serves on :8081 for results, :8091 for health)
cd output_service
./output_service --verbose
# Run manager in development (self-signed certs)
cd manager
go run . --port=8080
# Run manager in production (Let's Encrypt)
sudo go run . --port=443 --domain=example.dy.fi --email=admin@example.com
Installing ping_service as systemd Service
chmod +x install.sh
sudo ./install.sh
sudo systemctl start ping-service
sudo systemctl status ping-service
sudo journalctl -u ping-service -f
Manager User Management
# Add new user (generates TOTP QR code)
cd manager
go run . --add-user=username
Configuration
ping_service (config.yaml)
input_file: IP source - HTTP endpoint, file path, or Unix socketoutput_file: Results destination - HTTP endpoint, file path, or Unix socketinterval_seconds: Poll interval between runscooldown_minutes: Minimum time between pinging the same IPenable_traceroute: Enable traceroute on successful pingstraceroute_max_hops: Maximum TTL for traceroutehealth_check_port: Port for/health,/ready,/metricsendpoints
output_service (CLI Flags)
--port: Port for receiving results (default 8081)--health-port: Port for health/metrics (default 8091)--input-url: Input service URL for hop submission (default http://localhost:8080/hops)--db-dir: Directory for database files (default ./output_data)--max-size-mb: Max DB size in MB before rotation (default 100)--rotation-days: Rotate DB after N days (default 7)--keep-files: Number of DB files to keep (default 5)-v, --verbose: Enable verbose logging
manager (Environment Variables)
SERVER_KEY: 32-byte base64 key for encryption (auto-generated if missing)DYFI_DOMAIN,DYFI_USER,DYFI_PASS: Dynamic DNS configurationACME_EMAIL: Email for Let's Encrypt notificationsLOG_FILE: Path for fail2ban-ready authentication logsMANAGER_PORT: HTTP/HTTPS port (default from flag)
Key Design Principles
- Fault Tolerance: Nodes can join/leave freely, partial failures expected
- Network Reality: Designed for imperfect infrastructure (NAT, 4G, low-end hardware)
- No Time Guarantees: Latency variations normal, no assumption of always-online workers
- Organic Growth: System learns by discovering hops and feeding them back as targets
- Security: Manager requires TOTP auth, double-encrypted storage, fail2ban integration
Dependencies
ping_service
github.com/go-ping/ping- ICMP ping librarygopkg.in/yaml.v3- YAML config parsing- Go 1.25.0
output_service
github.com/mattn/go-sqlite3- SQLite driver (requires CGO)- Go 1.25.0
manager
github.com/pquerna/otp- TOTP authenticationgolang.org/x/crypto/acme/autocert- Let's Encrypt integration
Data Flow
input_serviceserves IPs from CIDR ranges (or accepts discovered hops)ping_servicenodes poll input_service, ping targets with cooldown enforcement- Successful pings trigger optional traceroute (ICMP/TCP)
- Results (JSON) sent to
output_service(HTTP/file/socket) output_serviceextracts intermediate hops from traceroute data- New hops fed back into
input_servicetarget pool managerprovides visibility and control over the system
Health Endpoints
ping_service (port 8090)
GET /health- Status, uptime, ping statisticsGET /ready- Readiness checkGET /metrics- Prometheus-compatible metrics
output_service (port 8091)
GET /health- Status, uptime, processing statisticsGET /ready- Readiness check (verifies database connectivity)GET /metrics- Prometheus-compatible metricsGET /stats- Detailed statistics in JSON formatGET /recent?limit=100&ip=8.8.8.8- Query recent ping results
output_service API endpoints (port 8081)
POST /results- Receive ping results from ping_service nodesPOST /rotate- Manually trigger database rotationGET /dump- Download current SQLite database file
Project Status
- Functional distributed ping + traceroute workers
- Input service with persistent state and lazy CIDR expansion
- Output service with SQLite storage, rotation, hop extraction, and feedback loop
- Manager with TOTP auth, encryption, Let's Encrypt, dy.fi integration
- Mapping and visualization still exploratory
Important Notes
- Visualization strategy is an open problem (no finalized design)
- System currently bootstrapped with ~19,000 cloud provider IPs
- Traceroute supports both ICMP and TCP methods
- Manager logs
AUTH_FAILUREevents with IP for fail2ban filtering - Input service interleaving: Maintains 10 active CIDR generators, rotates between them to avoid consecutive IPs from same /24 or /29 subnet
- Input service deduplication: Per-consumer (prevents re-serving) and global (prevents re-adding from hops)
- Hop feedback loop: Output service extracts hops → POSTs to input service
/hops→ input service adds to all consumer pools → organic target growth - Input service maintains per-consumer progress state (can be exported/imported)
- Output service rotates databases weekly OR at 100MB (whichever first), keeping 5 files
- Each output_service instance maintains its own database; use
/dumpfor central aggregation - For multi-instance input_service, use session affinity or call
/hopson all instances