MrKalzu/ping_service

Fork 0

Files

Kalzu Rekku 6db2e58dcd Claude Code session 1.

2026-01-08 12:11:26 +02:00

12 KiB

Raw Blame History

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

This is a distributed internet network mapping system that performs pings and traceroutes across geographically diverse nodes to build a continuously evolving map of internet routes. The system is designed to be resilient to node failures, network instability, and imperfect infrastructure (Raspberry Pis, consumer NAT, 4G/LTE connections).

Core concept: Bootstrap with ~19,000 cloud provider IPs → ping targets → traceroute responders → extract intermediate hops → feed hops back as new targets → build organic graph of internet routes over time.

Multi-Instance Production Deployment

CRITICAL: All services are designed to run with multiple instances in production. This architectural constraint must be considered in all design decisions:

State Management

Avoid local in-memory state for coordination or shared data
Use external stores (files, databases, shared storage) for state that must persist across instances
Current input_service uses per-consumer file-based state tracking - each instance maintains its own consumer mappings
Current ping_service uses in-memory cooldown cache - acceptable because workers are distributed and some overlap is tolerable

Coordination Requirements

ping_service: Multiple workers can ping the same targets (cooldown prevents excessive frequency)
input_service: Multiple instances serve different consumers independently; per-consumer state prevents duplicate work for the same client
output_service: Must handle concurrent writes from multiple ping_service instances safely
manager: Session management currently in-memory - needs external session store for multi-instance deployment

Design Implications

Services must be stateless where possible, or use shared external state
Database/storage layer must handle concurrent access correctly
Load balancing between instances should be connection-based for input_service (maintains per-consumer state)
Race conditions and distributed coordination must be considered for shared resources

Current Implementation Status

input_service: Partially multi-instance ready (per-consumer state is instance-local, which works if clients stick to one instance)
ping_service: Fully multi-instance ready (distributed workers by design)
output_service: Fully multi-instance ready (each instance maintains its own SQLite database)
manager: Not multi-instance ready (in-memory sessions, user store reload assumes single instance)

Architecture Components

1. `ping_service` (Root Directory)

The worker agent that runs on each distributed node.

Language: Go
Main file: ping_service.go
Responsibilities: Execute ICMP/TCP pings, apply per-IP cooldowns, run traceroute on successes, output structured JSON results, expose health/metrics endpoints
Configuration: config.yaml - supports file/HTTP/Unix socket for input/output
Deployment: Designed to run unattended under systemd on Debian-based systems

2. `input_service/`

HTTP service that feeds IP addresses to ping workers with subnet interleaving.

Main file: http_input_service.go
Responsibilities: Serve individual IPs with subnet interleaving (avoids consecutive IPs from same subnet), maintain per-consumer state, accept discovered hops from output_service via /hops endpoint
Data source: Expects ./cloud-provider-ip-addresses/ directory with .txt files containing CIDR ranges
Features: 10-CIDR interleaving, per-consumer + global deduplication, hop discovery feedback loop, lazy CIDR expansion, persistent state (save/import), IPv4 filtering, graceful shutdown
API Endpoints: / (GET - serve IP), /hops (POST - accept discovered hops), /status, /export, /import

3. `output_service/`

HTTP service that receives and stores ping/traceroute results.

Main file: main.go
Responsibilities: Store ping/traceroute results in SQLite, extract intermediate hops, forward discovered hops to input_service, provide reporting/metrics API
Database: SQLite with automatic rotation (weekly OR 100MB, keep 5 files)
Features: Hop deduplication, remote database dumps, Prometheus metrics, health checks
Multi-instance: Each instance maintains its own database, can be aggregated later

4. `manager/`

Centralized web UI and control plane with TOTP authentication.

Main file: main.go
Responsibilities: Web UI for system observation, control/coordination, certificate/crypto handling (AES-GCM double encryption), Dynamic DNS (dy.fi) integration, fail2ban-ready security logging, worker registration and monitoring, optional gateway/proxy for external workers
Security: TOTP two-factor auth, Let's Encrypt ACME support, encrypted user store, rate limiting, API key management (for gateway)
Additional modules: store.go, logger.go, template.go, crypto.go, cert.go, dyfi.go, gr.go, workers.go, handlers.go, security.go, proxy.go, apikeys.go
Features: Worker auto-discovery, health polling (60s), dashboard UI, gateway mode (optional), multi-instance dy.fi failover

Service Discovery

All services (input, ping, output) expose a /service-info endpoint that returns:

{
  "service_type": "input|ping|output",
  "version": "1.0.0",
  "name": "service_name",
  "instance_id": "hostname",
  "capabilities": ["feature1", "feature2"]
}

Purpose: Enables automatic worker type detection in the manager. When registering a worker, you only need to provide the URL - the manager queries /service-info to determine:

Service type (input/ping/output)
Suggested name (generated from service name + instance ID)

Location of endpoint:

input_service: http://host:8080/service-info
ping_service: http://host:PORT/service-info (on health check port)
output_service: http://host:HEALTH_PORT/service-info (on health check server)

Manager behavior:

If worker registration omits type, manager calls /service-info to auto-detect
If auto-detection fails, registration fails with helpful error message
Manual type override is always available
Auto-generated names can be overridden during registration

Note: This only works for internal workers that the manager can reach (e.g., on WireGuard). External workers behind NAT use the gateway with API keys (see GATEWAY.md).

Common Commands

Building Components

# Build ping_service (root)
go build -o ping_service

# Build input_service
cd input_service
go build -ldflags="-s -w" -o http_input_service http_input_service.go

# Build output_service
cd output_service
go build -o output_service main.go

# Build manager
cd manager
go mod tidy
go build -o manager

Running Services

# Run ping_service with verbose logging
./ping_service -config config.yaml -verbose

# Run input_service (serves on :8080)
cd input_service
./http_input_service

# Run output_service (serves on :8081 for results, :8091 for health)
cd output_service
./output_service --verbose

# Run manager in development (self-signed certs)
cd manager
go run . --port=8080

# Run manager in production (Let's Encrypt)
sudo go run . --port=443 --domain=example.dy.fi --email=admin@example.com

Installing ping_service as systemd Service

chmod +x install.sh
sudo ./install.sh
sudo systemctl start ping-service
sudo systemctl status ping-service
sudo journalctl -u ping-service -f

Manager User Management

# Add new user (generates TOTP QR code)
cd manager
go run . --add-user=username

Configuration

ping_service (`config.yaml`)

input_file: IP source - HTTP endpoint, file path, or Unix socket
output_file: Results destination - HTTP endpoint, file path, or Unix socket
interval_seconds: Poll interval between runs
cooldown_minutes: Minimum time between pinging the same IP
enable_traceroute: Enable traceroute on successful pings
traceroute_max_hops: Maximum TTL for traceroute
health_check_port: Port for /health, /ready, /metrics endpoints

output_service (CLI Flags)

--port: Port for receiving results (default 8081)
--health-port: Port for health/metrics (default 8091)
--input-url: Input service URL for hop submission (default http://localhost:8080/hops)
--db-dir: Directory for database files (default ./output_data)
--max-size-mb: Max DB size in MB before rotation (default 100)
--rotation-days: Rotate DB after N days (default 7)
--keep-files: Number of DB files to keep (default 5)
-v, --verbose: Enable verbose logging

manager (Environment Variables)

SERVER_KEY: 32-byte base64 key for encryption (auto-generated if missing)
DYFI_DOMAIN, DYFI_USER, DYFI_PASS: Dynamic DNS configuration
ACME_EMAIL: Email for Let's Encrypt notifications
LOG_FILE: Path for fail2ban-ready authentication logs
MANAGER_PORT: HTTP/HTTPS port (default from flag)

Key Design Principles

Fault Tolerance: Nodes can join/leave freely, partial failures expected
Network Reality: Designed for imperfect infrastructure (NAT, 4G, low-end hardware)
No Time Guarantees: Latency variations normal, no assumption of always-online workers
Organic Growth: System learns by discovering hops and feeding them back as targets
Security: Manager requires TOTP auth, double-encrypted storage, fail2ban integration

Dependencies

ping_service

github.com/go-ping/ping - ICMP ping library
gopkg.in/yaml.v3 - YAML config parsing
Go 1.25.0

output_service

github.com/mattn/go-sqlite3 - SQLite driver (requires CGO)
Go 1.25.0

manager

github.com/pquerna/otp - TOTP authentication
golang.org/x/crypto/acme/autocert - Let's Encrypt integration

Data Flow

input_service serves IPs from CIDR ranges (or accepts discovered hops)
ping_service nodes poll input_service, ping targets with cooldown enforcement
Successful pings trigger optional traceroute (ICMP/TCP)
Results (JSON) sent to output_service (HTTP/file/socket)
output_service extracts intermediate hops from traceroute data
New hops fed back into input_service target pool
manager provides visibility and control over the system

Health Endpoints

ping_service (port 8090)

GET /health - Status, uptime, ping statistics
GET /ready - Readiness check
GET /metrics - Prometheus-compatible metrics

output_service (port 8091)

GET /health - Status, uptime, processing statistics
GET /ready - Readiness check (verifies database connectivity)
GET /metrics - Prometheus-compatible metrics
GET /stats - Detailed statistics in JSON format
GET /recent?limit=100&ip=8.8.8.8 - Query recent ping results

output_service API endpoints (port 8081)

POST /results - Receive ping results from ping_service nodes
POST /rotate - Manually trigger database rotation
GET /dump - Download current SQLite database file

Project Status

Functional distributed ping + traceroute workers
Input service with persistent state and lazy CIDR expansion
Output service with SQLite storage, rotation, hop extraction, and feedback loop
Manager with TOTP auth, encryption, Let's Encrypt, dy.fi integration
Mapping and visualization still exploratory

Important Notes

Visualization strategy is an open problem (no finalized design)
System currently bootstrapped with ~19,000 cloud provider IPs
Traceroute supports both ICMP and TCP methods
Manager logs AUTH_FAILURE events with IP for fail2ban filtering
Input service interleaving: Maintains 10 active CIDR generators, rotates between them to avoid consecutive IPs from same /24 or /29 subnet
Input service deduplication: Per-consumer (prevents re-serving) and global (prevents re-adding from hops)
Hop feedback loop: Output service extracts hops → POSTs to input service /hops → input service adds to all consumer pools → organic target growth
Input service maintains per-consumer progress state (can be exported/imported)
Output service rotates databases weekly OR at 100MB (whichever first), keeping 5 files
Each output_service instance maintains its own database; use /dump for central aggregation
For multi-instance input_service, use session affinity or call /hops on all instances

12 KiB Raw Blame History