Files
ping_service/CLAUDE.md
2026-01-08 12:11:26 +02:00

12 KiB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is a distributed internet network mapping system that performs pings and traceroutes across geographically diverse nodes to build a continuously evolving map of internet routes. The system is designed to be resilient to node failures, network instability, and imperfect infrastructure (Raspberry Pis, consumer NAT, 4G/LTE connections).

Core concept: Bootstrap with ~19,000 cloud provider IPs → ping targets → traceroute responders → extract intermediate hops → feed hops back as new targets → build organic graph of internet routes over time.

Multi-Instance Production Deployment

CRITICAL: All services are designed to run with multiple instances in production. This architectural constraint must be considered in all design decisions:

State Management

  • Avoid local in-memory state for coordination or shared data
  • Use external stores (files, databases, shared storage) for state that must persist across instances
  • Current input_service uses per-consumer file-based state tracking - each instance maintains its own consumer mappings
  • Current ping_service uses in-memory cooldown cache - acceptable because workers are distributed and some overlap is tolerable

Coordination Requirements

  • ping_service: Multiple workers can ping the same targets (cooldown prevents excessive frequency)
  • input_service: Multiple instances serve different consumers independently; per-consumer state prevents duplicate work for the same client
  • output_service: Must handle concurrent writes from multiple ping_service instances safely
  • manager: Session management currently in-memory - needs external session store for multi-instance deployment

Design Implications

  • Services must be stateless where possible, or use shared external state
  • Database/storage layer must handle concurrent access correctly
  • Load balancing between instances should be connection-based for input_service (maintains per-consumer state)
  • Race conditions and distributed coordination must be considered for shared resources

Current Implementation Status

  • input_service: Partially multi-instance ready (per-consumer state is instance-local, which works if clients stick to one instance)
  • ping_service: Fully multi-instance ready (distributed workers by design)
  • output_service: Fully multi-instance ready (each instance maintains its own SQLite database)
  • manager: Not multi-instance ready (in-memory sessions, user store reload assumes single instance)

Architecture Components

1. ping_service (Root Directory)

The worker agent that runs on each distributed node.

  • Language: Go
  • Main file: ping_service.go
  • Responsibilities: Execute ICMP/TCP pings, apply per-IP cooldowns, run traceroute on successes, output structured JSON results, expose health/metrics endpoints
  • Configuration: config.yaml - supports file/HTTP/Unix socket for input/output
  • Deployment: Designed to run unattended under systemd on Debian-based systems

2. input_service/

HTTP service that feeds IP addresses to ping workers with subnet interleaving.

  • Main file: http_input_service.go
  • Responsibilities: Serve individual IPs with subnet interleaving (avoids consecutive IPs from same subnet), maintain per-consumer state, accept discovered hops from output_service via /hops endpoint
  • Data source: Expects ./cloud-provider-ip-addresses/ directory with .txt files containing CIDR ranges
  • Features: 10-CIDR interleaving, per-consumer + global deduplication, hop discovery feedback loop, lazy CIDR expansion, persistent state (save/import), IPv4 filtering, graceful shutdown
  • API Endpoints: / (GET - serve IP), /hops (POST - accept discovered hops), /status, /export, /import

3. output_service/

HTTP service that receives and stores ping/traceroute results.

  • Main file: main.go
  • Responsibilities: Store ping/traceroute results in SQLite, extract intermediate hops, forward discovered hops to input_service, provide reporting/metrics API
  • Database: SQLite with automatic rotation (weekly OR 100MB, keep 5 files)
  • Features: Hop deduplication, remote database dumps, Prometheus metrics, health checks
  • Multi-instance: Each instance maintains its own database, can be aggregated later

4. manager/

Centralized web UI and control plane with TOTP authentication.

  • Main file: main.go
  • Responsibilities: Web UI for system observation, control/coordination, certificate/crypto handling (AES-GCM double encryption), Dynamic DNS (dy.fi) integration, fail2ban-ready security logging, worker registration and monitoring, optional gateway/proxy for external workers
  • Security: TOTP two-factor auth, Let's Encrypt ACME support, encrypted user store, rate limiting, API key management (for gateway)
  • Additional modules: store.go, logger.go, template.go, crypto.go, cert.go, dyfi.go, gr.go, workers.go, handlers.go, security.go, proxy.go, apikeys.go
  • Features: Worker auto-discovery, health polling (60s), dashboard UI, gateway mode (optional), multi-instance dy.fi failover

Service Discovery

All services (input, ping, output) expose a /service-info endpoint that returns:

{
  "service_type": "input|ping|output",
  "version": "1.0.0",
  "name": "service_name",
  "instance_id": "hostname",
  "capabilities": ["feature1", "feature2"]
}

Purpose: Enables automatic worker type detection in the manager. When registering a worker, you only need to provide the URL - the manager queries /service-info to determine:

  • Service type (input/ping/output)
  • Suggested name (generated from service name + instance ID)

Location of endpoint:

  • input_service: http://host:8080/service-info
  • ping_service: http://host:PORT/service-info (on health check port)
  • output_service: http://host:HEALTH_PORT/service-info (on health check server)

Manager behavior:

  • If worker registration omits type, manager calls /service-info to auto-detect
  • If auto-detection fails, registration fails with helpful error message
  • Manual type override is always available
  • Auto-generated names can be overridden during registration

Note: This only works for internal workers that the manager can reach (e.g., on WireGuard). External workers behind NAT use the gateway with API keys (see GATEWAY.md).

Common Commands

Building Components

# Build ping_service (root)
go build -o ping_service

# Build input_service
cd input_service
go build -ldflags="-s -w" -o http_input_service http_input_service.go

# Build output_service
cd output_service
go build -o output_service main.go

# Build manager
cd manager
go mod tidy
go build -o manager

Running Services

# Run ping_service with verbose logging
./ping_service -config config.yaml -verbose

# Run input_service (serves on :8080)
cd input_service
./http_input_service

# Run output_service (serves on :8081 for results, :8091 for health)
cd output_service
./output_service --verbose

# Run manager in development (self-signed certs)
cd manager
go run . --port=8080

# Run manager in production (Let's Encrypt)
sudo go run . --port=443 --domain=example.dy.fi --email=admin@example.com

Installing ping_service as systemd Service

chmod +x install.sh
sudo ./install.sh
sudo systemctl start ping-service
sudo systemctl status ping-service
sudo journalctl -u ping-service -f

Manager User Management

# Add new user (generates TOTP QR code)
cd manager
go run . --add-user=username

Configuration

ping_service (config.yaml)

  • input_file: IP source - HTTP endpoint, file path, or Unix socket
  • output_file: Results destination - HTTP endpoint, file path, or Unix socket
  • interval_seconds: Poll interval between runs
  • cooldown_minutes: Minimum time between pinging the same IP
  • enable_traceroute: Enable traceroute on successful pings
  • traceroute_max_hops: Maximum TTL for traceroute
  • health_check_port: Port for /health, /ready, /metrics endpoints

output_service (CLI Flags)

  • --port: Port for receiving results (default 8081)
  • --health-port: Port for health/metrics (default 8091)
  • --input-url: Input service URL for hop submission (default http://localhost:8080/hops)
  • --db-dir: Directory for database files (default ./output_data)
  • --max-size-mb: Max DB size in MB before rotation (default 100)
  • --rotation-days: Rotate DB after N days (default 7)
  • --keep-files: Number of DB files to keep (default 5)
  • -v, --verbose: Enable verbose logging

manager (Environment Variables)

  • SERVER_KEY: 32-byte base64 key for encryption (auto-generated if missing)
  • DYFI_DOMAIN, DYFI_USER, DYFI_PASS: Dynamic DNS configuration
  • ACME_EMAIL: Email for Let's Encrypt notifications
  • LOG_FILE: Path for fail2ban-ready authentication logs
  • MANAGER_PORT: HTTP/HTTPS port (default from flag)

Key Design Principles

  1. Fault Tolerance: Nodes can join/leave freely, partial failures expected
  2. Network Reality: Designed for imperfect infrastructure (NAT, 4G, low-end hardware)
  3. No Time Guarantees: Latency variations normal, no assumption of always-online workers
  4. Organic Growth: System learns by discovering hops and feeding them back as targets
  5. Security: Manager requires TOTP auth, double-encrypted storage, fail2ban integration

Dependencies

ping_service

  • github.com/go-ping/ping - ICMP ping library
  • gopkg.in/yaml.v3 - YAML config parsing
  • Go 1.25.0

output_service

  • github.com/mattn/go-sqlite3 - SQLite driver (requires CGO)
  • Go 1.25.0

manager

  • github.com/pquerna/otp - TOTP authentication
  • golang.org/x/crypto/acme/autocert - Let's Encrypt integration

Data Flow

  1. input_service serves IPs from CIDR ranges (or accepts discovered hops)
  2. ping_service nodes poll input_service, ping targets with cooldown enforcement
  3. Successful pings trigger optional traceroute (ICMP/TCP)
  4. Results (JSON) sent to output_service (HTTP/file/socket)
  5. output_service extracts intermediate hops from traceroute data
  6. New hops fed back into input_service target pool
  7. manager provides visibility and control over the system

Health Endpoints

ping_service (port 8090)

  • GET /health - Status, uptime, ping statistics
  • GET /ready - Readiness check
  • GET /metrics - Prometheus-compatible metrics

output_service (port 8091)

  • GET /health - Status, uptime, processing statistics
  • GET /ready - Readiness check (verifies database connectivity)
  • GET /metrics - Prometheus-compatible metrics
  • GET /stats - Detailed statistics in JSON format
  • GET /recent?limit=100&ip=8.8.8.8 - Query recent ping results

output_service API endpoints (port 8081)

  • POST /results - Receive ping results from ping_service nodes
  • POST /rotate - Manually trigger database rotation
  • GET /dump - Download current SQLite database file

Project Status

  • Functional distributed ping + traceroute workers
  • Input service with persistent state and lazy CIDR expansion
  • Output service with SQLite storage, rotation, hop extraction, and feedback loop
  • Manager with TOTP auth, encryption, Let's Encrypt, dy.fi integration
  • Mapping and visualization still exploratory

Important Notes

  • Visualization strategy is an open problem (no finalized design)
  • System currently bootstrapped with ~19,000 cloud provider IPs
  • Traceroute supports both ICMP and TCP methods
  • Manager logs AUTH_FAILURE events with IP for fail2ban filtering
  • Input service interleaving: Maintains 10 active CIDR generators, rotates between them to avoid consecutive IPs from same /24 or /29 subnet
  • Input service deduplication: Per-consumer (prevents re-serving) and global (prevents re-adding from hops)
  • Hop feedback loop: Output service extracts hops → POSTs to input service /hops → input service adds to all consumer pools → organic target growth
  • Input service maintains per-consumer progress state (can be exported/imported)
  • Output service rotates databases weekly OR at 100MB (whichever first), keeping 5 files
  • Each output_service instance maintains its own database; use /dump for central aggregation
  • For multi-instance input_service, use session affinity or call /hops on all instances