Fixed few memory leaks. Implement testing of the functionality.
This commit is contained in:
305
MULTI_INSTANCE.md
Normal file
305
MULTI_INSTANCE.md
Normal file
@@ -0,0 +1,305 @@
|
||||
# Multi-Instance Deployment Guide
|
||||
|
||||
This document provides guidance for deploying multiple instances of each service for high availability and scalability.
|
||||
|
||||
## Overview
|
||||
|
||||
All services in this distributed network mapping system are designed to support multi-instance deployments, but each has specific considerations and limitations.
|
||||
|
||||
---
|
||||
|
||||
## Input Service (input_service/)
|
||||
|
||||
### Multi-Instance Readiness: ⚠️ **Partially Ready**
|
||||
|
||||
#### How It Works
|
||||
- Each instance maintains its own per-consumer state and CIDR generators
|
||||
- State is stored locally in `progress_state/` directory
|
||||
- Global hop deduplication (`globalSeen` map) is **instance-local**
|
||||
|
||||
#### Multi-Instance Deployment Strategies
|
||||
|
||||
**Option 1: Session Affinity (Recommended)**
|
||||
```
|
||||
Load Balancer (with sticky sessions based on source IP)
|
||||
├── input_service instance 1
|
||||
├── input_service instance 2
|
||||
└── input_service instance 3
|
||||
```
|
||||
- Configure load balancer to route each ping worker to the same input_service instance
|
||||
- Ensures per-consumer state consistency
|
||||
- Simple to implement and maintain
|
||||
|
||||
**Option 2: Broadcast Hop Submissions**
|
||||
```
|
||||
output_service ---> POST /hops ---> ALL input_service instances
|
||||
```
|
||||
Modify output_service to POST discovered hops to all input_service instances instead of just one. This ensures hop deduplication works across instances.
|
||||
|
||||
**Option 3: Shared Deduplication Backend (Future Enhancement)**
|
||||
Implement Redis or database-backed `globalSeen` storage so all instances share deduplication state.
|
||||
|
||||
#### Known Limitations
|
||||
- **Hop deduplication is instance-local**: Different instances may serve duplicate hops if output_service sends hops to only one instance
|
||||
- **Per-consumer state is instance-local**: If a consumer switches instances, it gets a new generator and starts from the beginning
|
||||
- **CIDR files must be present on all instances**: The `cloud-provider-ip-addresses/` directory must exist on each instance
|
||||
|
||||
#### Deployment Example
|
||||
```bash
|
||||
# Instance 1
|
||||
./http_input_service &
|
||||
|
||||
# Instance 2 (different port)
|
||||
PORT=8081 ./http_input_service &
|
||||
|
||||
# Load balancer (nginx example)
|
||||
upstream input_service {
|
||||
ip_hash; # Session affinity
|
||||
server 127.0.0.1:8080;
|
||||
server 127.0.0.1:8081;
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Output Service (output_service/)
|
||||
|
||||
### Multi-Instance Readiness: ✅ **Fully Ready**
|
||||
|
||||
#### How It Works
|
||||
- Each instance maintains its own SQLite database
|
||||
- Databases are independent and can be aggregated later
|
||||
- `sentHops` deduplication is instance-local with 24-hour TTL
|
||||
|
||||
#### Multi-Instance Deployment
|
||||
```
|
||||
ping_service workers ---> Load Balancer ---> output_service instances
|
||||
```
|
||||
- No session affinity required
|
||||
- Each instance stores results independently
|
||||
- Use `/dump` endpoint to collect databases from all instances for aggregation
|
||||
|
||||
#### Aggregation Strategy
|
||||
```bash
|
||||
# Collect databases from all instances
|
||||
curl http://instance1:8091/dump > instance1.db
|
||||
curl http://instance2:8091/dump > instance2.db
|
||||
curl http://instance3:8091/dump > instance3.db
|
||||
|
||||
# Merge using sqlite3
|
||||
sqlite3 merged.db <<EOF
|
||||
ATTACH 'instance1.db' AS db1;
|
||||
ATTACH 'instance2.db' AS db2;
|
||||
ATTACH 'instance3.db' AS db3;
|
||||
|
||||
INSERT INTO ping_results SELECT * FROM db1.ping_results;
|
||||
INSERT INTO ping_results SELECT * FROM db2.ping_results;
|
||||
INSERT INTO ping_results SELECT * FROM db3.ping_results;
|
||||
|
||||
INSERT INTO traceroute_hops SELECT * FROM db1.traceroute_hops;
|
||||
INSERT INTO traceroute_hops SELECT * FROM db2.traceroute_hops;
|
||||
INSERT INTO traceroute_hops SELECT * FROM db3.traceroute_hops;
|
||||
EOF
|
||||
```
|
||||
|
||||
#### Deployment Example
|
||||
```bash
|
||||
# Instance 1
|
||||
./output_service --port=8081 --health-port=8091 --db-dir=/data/output1 &
|
||||
|
||||
# Instance 2
|
||||
./output_service --port=8082 --health-port=8092 --db-dir=/data/output2 &
|
||||
|
||||
# Instance 3
|
||||
./output_service --port=8083 --health-port=8093 --db-dir=/data/output3 &
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Ping Service (ping_service/)
|
||||
|
||||
### Multi-Instance Readiness: ✅ **Fully Ready**
|
||||
|
||||
#### How It Works
|
||||
- Designed from the ground up for distributed operation
|
||||
- Each worker independently polls input_service and submits results
|
||||
- Cooldown cache is instance-local (intentional - distributed workers coordinate via cooldown duration)
|
||||
|
||||
#### Multi-Instance Deployment
|
||||
```
|
||||
input_service <--- ping_service workers (many instances)
|
||||
|
|
||||
v
|
||||
output_service
|
||||
```
|
||||
- Deploy as many workers as needed across different networks/locations
|
||||
- Workers can run on Raspberry Pis, VPS, cloud instances, etc.
|
||||
- No coordination required between workers
|
||||
|
||||
#### Deployment Example
|
||||
```bash
|
||||
# Worker 1 (local network)
|
||||
./ping_service -config config.yaml &
|
||||
|
||||
# Worker 2 (VPS)
|
||||
ssh vps1 "./ping_service -config config.yaml" &
|
||||
|
||||
# Worker 3 (different geographic location)
|
||||
ssh vps2 "./ping_service -config config.yaml" &
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Manager (manager/)
|
||||
|
||||
### Multi-Instance Readiness: ⚠️ **Requires Configuration**
|
||||
|
||||
#### How It Works
|
||||
- Session store is **in-memory** (not shared across instances)
|
||||
- User store uses file-based storage with file locking (multi-instance safe as of latest update)
|
||||
- Worker registry is instance-local
|
||||
|
||||
#### Multi-Instance Deployment Strategies
|
||||
|
||||
**Option 1: Active-Passive with Failover**
|
||||
```
|
||||
Load Balancer (active-passive)
|
||||
├── manager instance 1 (active)
|
||||
└── manager instance 2 (standby)
|
||||
```
|
||||
- Only one instance active at a time
|
||||
- Failover on primary failure
|
||||
- Simplest approach, no session coordination needed
|
||||
|
||||
**Option 2: Shared Session Store (Recommended for Active-Active)**
|
||||
Implement Redis or database-backed session storage to enable true active-active multi-instance deployment.
|
||||
|
||||
**Required Changes for Active-Active:**
|
||||
```go
|
||||
// Replace in-memory sessions (main.go:31-34) with Redis
|
||||
var sessions = redis.NewSessionStore(redisClient)
|
||||
```
|
||||
|
||||
#### Current Limitations
|
||||
- **Sessions are not shared**: User authenticated on instance A cannot access instance B
|
||||
- **Worker registry is not shared**: Each instance maintains its own worker list
|
||||
- **dy.fi updates may conflict**: Multiple instances updating the same domain simultaneously
|
||||
|
||||
#### User Store File Locking (✅ Fixed)
|
||||
As of the latest update, the user store uses file locking to prevent race conditions:
|
||||
- **Shared locks** for reads (multiple readers allowed)
|
||||
- **Exclusive locks** for writes (blocks all readers and writers)
|
||||
- **Atomic write-then-rename** prevents corruption
|
||||
- Safe for multi-instance deployment when instances share the same filesystem
|
||||
|
||||
#### Deployment Example (Active-Passive)
|
||||
```bash
|
||||
# Primary instance
|
||||
./manager --port=8080 --domain=manager.dy.fi &
|
||||
|
||||
# Secondary instance (standby)
|
||||
MANAGER_PORT=8081 ./manager &
|
||||
|
||||
# Load balancer health check both, route to active only
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## General Multi-Instance Recommendations
|
||||
|
||||
### Health Checks
|
||||
All services expose `/health` and `/ready` endpoints. Configure your load balancer to:
|
||||
- Route traffic only to healthy instances
|
||||
- Remove failed instances from rotation automatically
|
||||
- Monitor `/metrics` endpoint for Prometheus integration
|
||||
|
||||
### Monitoring
|
||||
Add `instance_id` labels to metrics for per-instance monitoring:
|
||||
```go
|
||||
// Recommended enhancement for all services
|
||||
var instanceID = os.Hostname()
|
||||
```
|
||||
|
||||
### File Locking
|
||||
Services that write to shared storage should use file locking (like manager user store) to prevent corruption:
|
||||
```go
|
||||
syscall.Flock(fd, syscall.LOCK_EX) // Exclusive lock
|
||||
syscall.Flock(fd, syscall.LOCK_SH) // Shared lock
|
||||
```
|
||||
|
||||
### Network Considerations
|
||||
- **Latency**: Place input_service close to ping workers to minimize polling latency
|
||||
- **Bandwidth**: output_service should have sufficient bandwidth for result ingestion
|
||||
- **NAT Traversal**: Use manager gateway mode for ping workers behind NAT
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting Multi-Instance Deployments
|
||||
|
||||
### Input Service: Duplicate Hops Served
|
||||
**Symptom**: Same hop appears multiple times in different workers
|
||||
**Cause**: Hop deduplication is instance-local
|
||||
**Solution**: Implement session affinity or broadcast hop submissions
|
||||
|
||||
### Manager: Sessions Lost After Reconnect
|
||||
**Symptom**: User logged out when load balancer switches instances
|
||||
**Cause**: Sessions are in-memory, not shared
|
||||
**Solution**: Use session affinity in load balancer or implement shared session store
|
||||
|
||||
### Output Service: Database Conflicts
|
||||
**Symptom**: Database file corruption or lock timeouts
|
||||
**Cause**: Multiple instances writing to same database file
|
||||
**Solution**: Each instance MUST have its own `--db-dir`, then aggregate later
|
||||
|
||||
### Ping Service: Excessive Pinging
|
||||
**Symptom**: Same IP pinged too frequently
|
||||
**Cause**: Too many workers with short cooldown period
|
||||
**Solution**: Increase `cooldown_minutes` in config.yaml
|
||||
|
||||
---
|
||||
|
||||
## Production Deployment Checklist
|
||||
|
||||
- [ ] Input service: Configure session affinity or hop broadcast
|
||||
- [ ] Output service: Each instance has unique `--db-dir`
|
||||
- [ ] Ping service: Cooldown duration accounts for total worker count
|
||||
- [ ] Manager: Decide active-passive or implement shared sessions
|
||||
- [ ] All services: Health check endpoints configured in load balancer
|
||||
- [ ] All services: Metrics exported to monitoring system
|
||||
- [ ] All services: Logs aggregated to central logging system
|
||||
- [ ] File-based state: Shared filesystem or backup/sync strategy
|
||||
- [ ] Database rotation: Automated collection of output service dumps
|
||||
|
||||
---
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
### High Priority
|
||||
1. **Shared session store for manager** (Redis/database)
|
||||
2. **Shared hop deduplication for input_service** (Redis)
|
||||
3. **Distributed worker coordination** for ping_service cooldowns
|
||||
|
||||
### Medium Priority
|
||||
4. **Instance ID labels in metrics** for better observability
|
||||
5. **Graceful shutdown coordination** to prevent data loss
|
||||
6. **Health check improvements** to verify actual functionality
|
||||
|
||||
### Low Priority
|
||||
7. **Automated database aggregation** for output_service
|
||||
8. **Service mesh integration** (Consul, etcd) for discovery
|
||||
9. **Horizontal autoscaling** based on load metrics
|
||||
|
||||
---
|
||||
|
||||
## Summary Table
|
||||
|
||||
| Service | Multi-Instance Ready | Session Affinity Needed | Shared Storage Needed | Notes |
|
||||
|---------|---------------------|------------------------|---------------------|-------|
|
||||
| input_service | ⚠️ Partial | ✅ Yes (recommended) | ❌ No | Hop dedup is instance-local |
|
||||
| output_service | ✅ Full | ❌ No | ❌ No | Each instance has own DB |
|
||||
| ping_service | ✅ Full | ❌ No | ❌ No | Fully distributed by design |
|
||||
| manager | ⚠️ Requires config | ✅ Yes (sessions) | ✅ Yes (user store) | Sessions in-memory; user store file-locked |
|
||||
|
||||
---
|
||||
|
||||
For questions or issues with multi-instance deployments, refer to the service-specific README files or open an issue in the project repository.
|
||||
Reference in New Issue
Block a user