Fixed few memory leaks. Implement testing of the functionality.

This commit is contained in:
Kalzu Rekku
2026-01-08 18:55:32 +02:00
parent c663ec0431
commit 1130b7fb8c
10 changed files with 1334 additions and 13 deletions

305
MULTI_INSTANCE.md Normal file
View File

@@ -0,0 +1,305 @@
# Multi-Instance Deployment Guide
This document provides guidance for deploying multiple instances of each service for high availability and scalability.
## Overview
All services in this distributed network mapping system are designed to support multi-instance deployments, but each has specific considerations and limitations.
---
## Input Service (input_service/)
### Multi-Instance Readiness: ⚠️ **Partially Ready**
#### How It Works
- Each instance maintains its own per-consumer state and CIDR generators
- State is stored locally in `progress_state/` directory
- Global hop deduplication (`globalSeen` map) is **instance-local**
#### Multi-Instance Deployment Strategies
**Option 1: Session Affinity (Recommended)**
```
Load Balancer (with sticky sessions based on source IP)
├── input_service instance 1
├── input_service instance 2
└── input_service instance 3
```
- Configure load balancer to route each ping worker to the same input_service instance
- Ensures per-consumer state consistency
- Simple to implement and maintain
**Option 2: Broadcast Hop Submissions**
```
output_service ---> POST /hops ---> ALL input_service instances
```
Modify output_service to POST discovered hops to all input_service instances instead of just one. This ensures hop deduplication works across instances.
**Option 3: Shared Deduplication Backend (Future Enhancement)**
Implement Redis or database-backed `globalSeen` storage so all instances share deduplication state.
#### Known Limitations
- **Hop deduplication is instance-local**: Different instances may serve duplicate hops if output_service sends hops to only one instance
- **Per-consumer state is instance-local**: If a consumer switches instances, it gets a new generator and starts from the beginning
- **CIDR files must be present on all instances**: The `cloud-provider-ip-addresses/` directory must exist on each instance
#### Deployment Example
```bash
# Instance 1
./http_input_service &
# Instance 2 (different port)
PORT=8081 ./http_input_service &
# Load balancer (nginx example)
upstream input_service {
ip_hash; # Session affinity
server 127.0.0.1:8080;
server 127.0.0.1:8081;
}
```
---
## Output Service (output_service/)
### Multi-Instance Readiness: ✅ **Fully Ready**
#### How It Works
- Each instance maintains its own SQLite database
- Databases are independent and can be aggregated later
- `sentHops` deduplication is instance-local with 24-hour TTL
#### Multi-Instance Deployment
```
ping_service workers ---> Load Balancer ---> output_service instances
```
- No session affinity required
- Each instance stores results independently
- Use `/dump` endpoint to collect databases from all instances for aggregation
#### Aggregation Strategy
```bash
# Collect databases from all instances
curl http://instance1:8091/dump > instance1.db
curl http://instance2:8091/dump > instance2.db
curl http://instance3:8091/dump > instance3.db
# Merge using sqlite3
sqlite3 merged.db <<EOF
ATTACH 'instance1.db' AS db1;
ATTACH 'instance2.db' AS db2;
ATTACH 'instance3.db' AS db3;
INSERT INTO ping_results SELECT * FROM db1.ping_results;
INSERT INTO ping_results SELECT * FROM db2.ping_results;
INSERT INTO ping_results SELECT * FROM db3.ping_results;
INSERT INTO traceroute_hops SELECT * FROM db1.traceroute_hops;
INSERT INTO traceroute_hops SELECT * FROM db2.traceroute_hops;
INSERT INTO traceroute_hops SELECT * FROM db3.traceroute_hops;
EOF
```
#### Deployment Example
```bash
# Instance 1
./output_service --port=8081 --health-port=8091 --db-dir=/data/output1 &
# Instance 2
./output_service --port=8082 --health-port=8092 --db-dir=/data/output2 &
# Instance 3
./output_service --port=8083 --health-port=8093 --db-dir=/data/output3 &
```
---
## Ping Service (ping_service/)
### Multi-Instance Readiness: ✅ **Fully Ready**
#### How It Works
- Designed from the ground up for distributed operation
- Each worker independently polls input_service and submits results
- Cooldown cache is instance-local (intentional - distributed workers coordinate via cooldown duration)
#### Multi-Instance Deployment
```
input_service <--- ping_service workers (many instances)
|
v
output_service
```
- Deploy as many workers as needed across different networks/locations
- Workers can run on Raspberry Pis, VPS, cloud instances, etc.
- No coordination required between workers
#### Deployment Example
```bash
# Worker 1 (local network)
./ping_service -config config.yaml &
# Worker 2 (VPS)
ssh vps1 "./ping_service -config config.yaml" &
# Worker 3 (different geographic location)
ssh vps2 "./ping_service -config config.yaml" &
```
---
## Manager (manager/)
### Multi-Instance Readiness: ⚠️ **Requires Configuration**
#### How It Works
- Session store is **in-memory** (not shared across instances)
- User store uses file-based storage with file locking (multi-instance safe as of latest update)
- Worker registry is instance-local
#### Multi-Instance Deployment Strategies
**Option 1: Active-Passive with Failover**
```
Load Balancer (active-passive)
├── manager instance 1 (active)
└── manager instance 2 (standby)
```
- Only one instance active at a time
- Failover on primary failure
- Simplest approach, no session coordination needed
**Option 2: Shared Session Store (Recommended for Active-Active)**
Implement Redis or database-backed session storage to enable true active-active multi-instance deployment.
**Required Changes for Active-Active:**
```go
// Replace in-memory sessions (main.go:31-34) with Redis
var sessions = redis.NewSessionStore(redisClient)
```
#### Current Limitations
- **Sessions are not shared**: User authenticated on instance A cannot access instance B
- **Worker registry is not shared**: Each instance maintains its own worker list
- **dy.fi updates may conflict**: Multiple instances updating the same domain simultaneously
#### User Store File Locking (✅ Fixed)
As of the latest update, the user store uses file locking to prevent race conditions:
- **Shared locks** for reads (multiple readers allowed)
- **Exclusive locks** for writes (blocks all readers and writers)
- **Atomic write-then-rename** prevents corruption
- Safe for multi-instance deployment when instances share the same filesystem
#### Deployment Example (Active-Passive)
```bash
# Primary instance
./manager --port=8080 --domain=manager.dy.fi &
# Secondary instance (standby)
MANAGER_PORT=8081 ./manager &
# Load balancer health check both, route to active only
```
---
## General Multi-Instance Recommendations
### Health Checks
All services expose `/health` and `/ready` endpoints. Configure your load balancer to:
- Route traffic only to healthy instances
- Remove failed instances from rotation automatically
- Monitor `/metrics` endpoint for Prometheus integration
### Monitoring
Add `instance_id` labels to metrics for per-instance monitoring:
```go
// Recommended enhancement for all services
var instanceID = os.Hostname()
```
### File Locking
Services that write to shared storage should use file locking (like manager user store) to prevent corruption:
```go
syscall.Flock(fd, syscall.LOCK_EX) // Exclusive lock
syscall.Flock(fd, syscall.LOCK_SH) // Shared lock
```
### Network Considerations
- **Latency**: Place input_service close to ping workers to minimize polling latency
- **Bandwidth**: output_service should have sufficient bandwidth for result ingestion
- **NAT Traversal**: Use manager gateway mode for ping workers behind NAT
---
## Troubleshooting Multi-Instance Deployments
### Input Service: Duplicate Hops Served
**Symptom**: Same hop appears multiple times in different workers
**Cause**: Hop deduplication is instance-local
**Solution**: Implement session affinity or broadcast hop submissions
### Manager: Sessions Lost After Reconnect
**Symptom**: User logged out when load balancer switches instances
**Cause**: Sessions are in-memory, not shared
**Solution**: Use session affinity in load balancer or implement shared session store
### Output Service: Database Conflicts
**Symptom**: Database file corruption or lock timeouts
**Cause**: Multiple instances writing to same database file
**Solution**: Each instance MUST have its own `--db-dir`, then aggregate later
### Ping Service: Excessive Pinging
**Symptom**: Same IP pinged too frequently
**Cause**: Too many workers with short cooldown period
**Solution**: Increase `cooldown_minutes` in config.yaml
---
## Production Deployment Checklist
- [ ] Input service: Configure session affinity or hop broadcast
- [ ] Output service: Each instance has unique `--db-dir`
- [ ] Ping service: Cooldown duration accounts for total worker count
- [ ] Manager: Decide active-passive or implement shared sessions
- [ ] All services: Health check endpoints configured in load balancer
- [ ] All services: Metrics exported to monitoring system
- [ ] All services: Logs aggregated to central logging system
- [ ] File-based state: Shared filesystem or backup/sync strategy
- [ ] Database rotation: Automated collection of output service dumps
---
## Future Enhancements
### High Priority
1. **Shared session store for manager** (Redis/database)
2. **Shared hop deduplication for input_service** (Redis)
3. **Distributed worker coordination** for ping_service cooldowns
### Medium Priority
4. **Instance ID labels in metrics** for better observability
5. **Graceful shutdown coordination** to prevent data loss
6. **Health check improvements** to verify actual functionality
### Low Priority
7. **Automated database aggregation** for output_service
8. **Service mesh integration** (Consul, etcd) for discovery
9. **Horizontal autoscaling** based on load metrics
---
## Summary Table
| Service | Multi-Instance Ready | Session Affinity Needed | Shared Storage Needed | Notes |
|---------|---------------------|------------------------|---------------------|-------|
| input_service | ⚠️ Partial | ✅ Yes (recommended) | ❌ No | Hop dedup is instance-local |
| output_service | ✅ Full | ❌ No | ❌ No | Each instance has own DB |
| ping_service | ✅ Full | ❌ No | ❌ No | Fully distributed by design |
| manager | ⚠️ Requires config | ✅ Yes (sessions) | ✅ Yes (user store) | Sessions in-memory; user store file-locked |
---
For questions or issues with multi-instance deployments, refer to the service-specific README files or open an issue in the project repository.