Fixed few memory leaks. Implement testing of the functionality.

2026-01-08 18:55:32 +02:00
parent c663ec0431
commit 1130b7fb8c
10 changed files with 1334 additions and 13 deletions
--- a/MULTI_INSTANCE.md
+++ b/MULTI_INSTANCE.md
@@ -0,0 +1,305 @@
+# Multi-Instance Deployment Guide
+
+This document provides guidance for deploying multiple instances of each service for high availability and scalability.
+
+## Overview
+
+All services in this distributed network mapping system are designed to support multi-instance deployments, but each has specific considerations and limitations.
+
+---
+
+## Input Service (input_service/)
+
+### Multi-Instance Readiness: ⚠️ **Partially Ready**
+
+#### How It Works
+- Each instance maintains its own per-consumer state and CIDR generators
+- State is stored locally in `progress_state/` directory
+- Global hop deduplication (`globalSeen` map) is **instance-local**
+
+#### Multi-Instance Deployment Strategies
+
+**Option 1: Session Affinity (Recommended)**
+```
+Load Balancer (with sticky sessions based on source IP)
+    ├── input_service instance 1
+    ├── input_service instance 2
+    └── input_service instance 3
+```
+- Configure load balancer to route each ping worker to the same input_service instance
+- Ensures per-consumer state consistency
+- Simple to implement and maintain
+
+**Option 2: Broadcast Hop Submissions**
+```
+output_service ---> POST /hops ---> ALL input_service instances
+```
+Modify output_service to POST discovered hops to all input_service instances instead of just one. This ensures hop deduplication works across instances.
+
+**Option 3: Shared Deduplication Backend (Future Enhancement)**
+Implement Redis or database-backed `globalSeen` storage so all instances share deduplication state.
+
+#### Known Limitations
+- **Hop deduplication is instance-local**: Different instances may serve duplicate hops if output_service sends hops to only one instance
+- **Per-consumer state is instance-local**: If a consumer switches instances, it gets a new generator and starts from the beginning
+- **CIDR files must be present on all instances**: The `cloud-provider-ip-addresses/` directory must exist on each instance
+
+#### Deployment Example
+```bash
+# Instance 1
+./http_input_service &
+
+# Instance 2 (different port)
+PORT=8081 ./http_input_service &
+
+# Load balancer (nginx example)
+upstream input_service {
+    ip_hash;  # Session affinity
+    server 127.0.0.1:8080;
+    server 127.0.0.1:8081;
+}
+```
+
+---
+
+## Output Service (output_service/)
+
+### Multi-Instance Readiness: ✅ **Fully Ready**
+
+#### How It Works
+- Each instance maintains its own SQLite database
+- Databases are independent and can be aggregated later
+- `sentHops` deduplication is instance-local with 24-hour TTL
+
+#### Multi-Instance Deployment
+```
+ping_service workers ---> Load Balancer ---> output_service instances
+```
+- No session affinity required
+- Each instance stores results independently
+- Use `/dump` endpoint to collect databases from all instances for aggregation
+
+#### Aggregation Strategy
+```bash
+# Collect databases from all instances
+curl http://instance1:8091/dump > instance1.db
+curl http://instance2:8091/dump > instance2.db
+curl http://instance3:8091/dump > instance3.db
+
+# Merge using sqlite3
+sqlite3 merged.db <<EOF
+ATTACH 'instance1.db' AS db1;
+ATTACH 'instance2.db' AS db2;
+ATTACH 'instance3.db' AS db3;
+
+INSERT INTO ping_results SELECT * FROM db1.ping_results;
+INSERT INTO ping_results SELECT * FROM db2.ping_results;
+INSERT INTO ping_results SELECT * FROM db3.ping_results;
+
+INSERT INTO traceroute_hops SELECT * FROM db1.traceroute_hops;
+INSERT INTO traceroute_hops SELECT * FROM db2.traceroute_hops;
+INSERT INTO traceroute_hops SELECT * FROM db3.traceroute_hops;
+EOF
+```
+
+#### Deployment Example
+```bash
+# Instance 1
+./output_service --port=8081 --health-port=8091 --db-dir=/data/output1 &
+
+# Instance 2
+./output_service --port=8082 --health-port=8092 --db-dir=/data/output2 &
+
+# Instance 3
+./output_service --port=8083 --health-port=8093 --db-dir=/data/output3 &
+```
+
+---
+
+## Ping Service (ping_service/)
+
+### Multi-Instance Readiness: ✅ **Fully Ready**
+
+#### How It Works
+- Designed from the ground up for distributed operation
+- Each worker independently polls input_service and submits results
+- Cooldown cache is instance-local (intentional - distributed workers coordinate via cooldown duration)
+
+#### Multi-Instance Deployment
+```
+input_service <--- ping_service workers (many instances)
+                         |
+                         v
+                  output_service
+```
+- Deploy as many workers as needed across different networks/locations
+- Workers can run on Raspberry Pis, VPS, cloud instances, etc.
+- No coordination required between workers
+
+#### Deployment Example
+```bash
+# Worker 1 (local network)
+./ping_service -config config.yaml &
+
+# Worker 2 (VPS)
+ssh vps1 "./ping_service -config config.yaml" &
+
+# Worker 3 (different geographic location)
+ssh vps2 "./ping_service -config config.yaml" &
+```
+
+---
+
+## Manager (manager/)
+
+### Multi-Instance Readiness: ⚠️ **Requires Configuration**
+
+#### How It Works
+- Session store is **in-memory** (not shared across instances)
+- User store uses file-based storage with file locking (multi-instance safe as of latest update)
+- Worker registry is instance-local
+
+#### Multi-Instance Deployment Strategies
+
+**Option 1: Active-Passive with Failover**
+```
+Load Balancer (active-passive)
+    ├── manager instance 1 (active)
+    └── manager instance 2 (standby)
+```
+- Only one instance active at a time
+- Failover on primary failure
+- Simplest approach, no session coordination needed
+
+**Option 2: Shared Session Store (Recommended for Active-Active)**
+Implement Redis or database-backed session storage to enable true active-active multi-instance deployment.
+
+**Required Changes for Active-Active:**
+```go
+// Replace in-memory sessions (main.go:31-34) with Redis
+var sessions = redis.NewSessionStore(redisClient)
+```
+
+#### Current Limitations
+- **Sessions are not shared**: User authenticated on instance A cannot access instance B
+- **Worker registry is not shared**: Each instance maintains its own worker list
+- **dy.fi updates may conflict**: Multiple instances updating the same domain simultaneously
+
+#### User Store File Locking (✅ Fixed)
+As of the latest update, the user store uses file locking to prevent race conditions:
+- **Shared locks** for reads (multiple readers allowed)
+- **Exclusive locks** for writes (blocks all readers and writers)
+- **Atomic write-then-rename** prevents corruption
+- Safe for multi-instance deployment when instances share the same filesystem
+
+#### Deployment Example (Active-Passive)
+```bash
+# Primary instance
+./manager --port=8080 --domain=manager.dy.fi &
+
+# Secondary instance (standby)
+MANAGER_PORT=8081 ./manager &
+
+# Load balancer health check both, route to active only
+```
+
+---
+
+## General Multi-Instance Recommendations
+
+### Health Checks
+All services expose `/health` and `/ready` endpoints. Configure your load balancer to:
+- Route traffic only to healthy instances
+- Remove failed instances from rotation automatically
+- Monitor `/metrics` endpoint for Prometheus integration
+
+### Monitoring
+Add `instance_id` labels to metrics for per-instance monitoring:
+```go
+// Recommended enhancement for all services
+var instanceID = os.Hostname()
+```
+
+### File Locking
+Services that write to shared storage should use file locking (like manager user store) to prevent corruption:
+```go
+syscall.Flock(fd, syscall.LOCK_EX)  // Exclusive lock
+syscall.Flock(fd, syscall.LOCK_SH)  // Shared lock
+```
+
+### Network Considerations
+- **Latency**: Place input_service close to ping workers to minimize polling latency
+- **Bandwidth**: output_service should have sufficient bandwidth for result ingestion
+- **NAT Traversal**: Use manager gateway mode for ping workers behind NAT
+
+---
+
+## Troubleshooting Multi-Instance Deployments
+
+### Input Service: Duplicate Hops Served
+**Symptom**: Same hop appears multiple times in different workers
+**Cause**: Hop deduplication is instance-local
+**Solution**: Implement session affinity or broadcast hop submissions
+
+### Manager: Sessions Lost After Reconnect
+**Symptom**: User logged out when load balancer switches instances
+**Cause**: Sessions are in-memory, not shared
+**Solution**: Use session affinity in load balancer or implement shared session store
+
+### Output Service: Database Conflicts
+**Symptom**: Database file corruption or lock timeouts
+**Cause**: Multiple instances writing to same database file
+**Solution**: Each instance MUST have its own `--db-dir`, then aggregate later
+
+### Ping Service: Excessive Pinging
+**Symptom**: Same IP pinged too frequently
+**Cause**: Too many workers with short cooldown period
+**Solution**: Increase `cooldown_minutes` in config.yaml
+
+---
+
+## Production Deployment Checklist
+
+- [ ] Input service: Configure session affinity or hop broadcast
+- [ ] Output service: Each instance has unique `--db-dir`
+- [ ] Ping service: Cooldown duration accounts for total worker count
+- [ ] Manager: Decide active-passive or implement shared sessions
+- [ ] All services: Health check endpoints configured in load balancer
+- [ ] All services: Metrics exported to monitoring system
+- [ ] All services: Logs aggregated to central logging system
+- [ ] File-based state: Shared filesystem or backup/sync strategy
+- [ ] Database rotation: Automated collection of output service dumps
+
+---
+
+## Future Enhancements
+
+### High Priority
+1. **Shared session store for manager** (Redis/database)
+2. **Shared hop deduplication for input_service** (Redis)
+3. **Distributed worker coordination** for ping_service cooldowns
+
+### Medium Priority
+4. **Instance ID labels in metrics** for better observability
+5. **Graceful shutdown coordination** to prevent data loss
+6. **Health check improvements** to verify actual functionality
+
+### Low Priority
+7. **Automated database aggregation** for output_service
+8. **Service mesh integration** (Consul, etcd) for discovery
+9. **Horizontal autoscaling** based on load metrics
+
+---
+
+## Summary Table
+
+| Service | Multi-Instance Ready | Session Affinity Needed | Shared Storage Needed | Notes |
+|---------|---------------------|------------------------|---------------------|-------|
+| input_service | ⚠️ Partial | ✅ Yes (recommended) | ❌ No | Hop dedup is instance-local |
+| output_service | ✅ Full | ❌ No | ❌ No | Each instance has own DB |
+| ping_service | ✅ Full | ❌ No | ❌ No | Fully distributed by design |
+| manager | ⚠️ Requires config | ✅ Yes (sessions) | ✅ Yes (user store) | Sessions in-memory; user store file-locked |
+
+---
+
+For questions or issues with multi-instance deployments, refer to the service-specific README files or open an issue in the project repository.