306 lines
10 KiB
Markdown
306 lines
10 KiB
Markdown
# Multi-Instance Deployment Guide
|
|
|
|
This document provides guidance for deploying multiple instances of each service for high availability and scalability.
|
|
|
|
## Overview
|
|
|
|
All services in this distributed network mapping system are designed to support multi-instance deployments, but each has specific considerations and limitations.
|
|
|
|
---
|
|
|
|
## Input Service (input_service/)
|
|
|
|
### Multi-Instance Readiness: ⚠️ **Partially Ready**
|
|
|
|
#### How It Works
|
|
- Each instance maintains its own per-consumer state and CIDR generators
|
|
- State is stored locally in `progress_state/` directory
|
|
- Global hop deduplication (`globalSeen` map) is **instance-local**
|
|
|
|
#### Multi-Instance Deployment Strategies
|
|
|
|
**Option 1: Session Affinity (Recommended)**
|
|
```
|
|
Load Balancer (with sticky sessions based on source IP)
|
|
├── input_service instance 1
|
|
├── input_service instance 2
|
|
└── input_service instance 3
|
|
```
|
|
- Configure load balancer to route each ping worker to the same input_service instance
|
|
- Ensures per-consumer state consistency
|
|
- Simple to implement and maintain
|
|
|
|
**Option 2: Broadcast Hop Submissions**
|
|
```
|
|
output_service ---> POST /hops ---> ALL input_service instances
|
|
```
|
|
Modify output_service to POST discovered hops to all input_service instances instead of just one. This ensures hop deduplication works across instances.
|
|
|
|
**Option 3: Shared Deduplication Backend (Future Enhancement)**
|
|
Implement Redis or database-backed `globalSeen` storage so all instances share deduplication state.
|
|
|
|
#### Known Limitations
|
|
- **Hop deduplication is instance-local**: Different instances may serve duplicate hops if output_service sends hops to only one instance
|
|
- **Per-consumer state is instance-local**: If a consumer switches instances, it gets a new generator and starts from the beginning
|
|
- **CIDR files must be present on all instances**: The `cloud-provider-ip-addresses/` directory must exist on each instance
|
|
|
|
#### Deployment Example
|
|
```bash
|
|
# Instance 1
|
|
./http_input_service &
|
|
|
|
# Instance 2 (different port)
|
|
PORT=8081 ./http_input_service &
|
|
|
|
# Load balancer (nginx example)
|
|
upstream input_service {
|
|
ip_hash; # Session affinity
|
|
server 127.0.0.1:8080;
|
|
server 127.0.0.1:8081;
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Output Service (output_service/)
|
|
|
|
### Multi-Instance Readiness: ✅ **Fully Ready**
|
|
|
|
#### How It Works
|
|
- Each instance maintains its own SQLite database
|
|
- Databases are independent and can be aggregated later
|
|
- `sentHops` deduplication is instance-local with 24-hour TTL
|
|
|
|
#### Multi-Instance Deployment
|
|
```
|
|
ping_service workers ---> Load Balancer ---> output_service instances
|
|
```
|
|
- No session affinity required
|
|
- Each instance stores results independently
|
|
- Use `/dump` endpoint to collect databases from all instances for aggregation
|
|
|
|
#### Aggregation Strategy
|
|
```bash
|
|
# Collect databases from all instances
|
|
curl http://instance1:8091/dump > instance1.db
|
|
curl http://instance2:8091/dump > instance2.db
|
|
curl http://instance3:8091/dump > instance3.db
|
|
|
|
# Merge using sqlite3
|
|
sqlite3 merged.db <<EOF
|
|
ATTACH 'instance1.db' AS db1;
|
|
ATTACH 'instance2.db' AS db2;
|
|
ATTACH 'instance3.db' AS db3;
|
|
|
|
INSERT INTO ping_results SELECT * FROM db1.ping_results;
|
|
INSERT INTO ping_results SELECT * FROM db2.ping_results;
|
|
INSERT INTO ping_results SELECT * FROM db3.ping_results;
|
|
|
|
INSERT INTO traceroute_hops SELECT * FROM db1.traceroute_hops;
|
|
INSERT INTO traceroute_hops SELECT * FROM db2.traceroute_hops;
|
|
INSERT INTO traceroute_hops SELECT * FROM db3.traceroute_hops;
|
|
EOF
|
|
```
|
|
|
|
#### Deployment Example
|
|
```bash
|
|
# Instance 1
|
|
./output_service --port=8081 --health-port=8091 --db-dir=/data/output1 &
|
|
|
|
# Instance 2
|
|
./output_service --port=8082 --health-port=8092 --db-dir=/data/output2 &
|
|
|
|
# Instance 3
|
|
./output_service --port=8083 --health-port=8093 --db-dir=/data/output3 &
|
|
```
|
|
|
|
---
|
|
|
|
## Ping Service (ping_service/)
|
|
|
|
### Multi-Instance Readiness: ✅ **Fully Ready**
|
|
|
|
#### How It Works
|
|
- Designed from the ground up for distributed operation
|
|
- Each worker independently polls input_service and submits results
|
|
- Cooldown cache is instance-local (intentional - distributed workers coordinate via cooldown duration)
|
|
|
|
#### Multi-Instance Deployment
|
|
```
|
|
input_service <--- ping_service workers (many instances)
|
|
|
|
|
v
|
|
output_service
|
|
```
|
|
- Deploy as many workers as needed across different networks/locations
|
|
- Workers can run on Raspberry Pis, VPS, cloud instances, etc.
|
|
- No coordination required between workers
|
|
|
|
#### Deployment Example
|
|
```bash
|
|
# Worker 1 (local network)
|
|
./ping_service -config config.yaml &
|
|
|
|
# Worker 2 (VPS)
|
|
ssh vps1 "./ping_service -config config.yaml" &
|
|
|
|
# Worker 3 (different geographic location)
|
|
ssh vps2 "./ping_service -config config.yaml" &
|
|
```
|
|
|
|
---
|
|
|
|
## Manager (manager/)
|
|
|
|
### Multi-Instance Readiness: ⚠️ **Requires Configuration**
|
|
|
|
#### How It Works
|
|
- Session store is **in-memory** (not shared across instances)
|
|
- User store uses file-based storage with file locking (multi-instance safe as of latest update)
|
|
- Worker registry is instance-local
|
|
|
|
#### Multi-Instance Deployment Strategies
|
|
|
|
**Option 1: Active-Passive with Failover**
|
|
```
|
|
Load Balancer (active-passive)
|
|
├── manager instance 1 (active)
|
|
└── manager instance 2 (standby)
|
|
```
|
|
- Only one instance active at a time
|
|
- Failover on primary failure
|
|
- Simplest approach, no session coordination needed
|
|
|
|
**Option 2: Shared Session Store (Recommended for Active-Active)**
|
|
Implement Redis or database-backed session storage to enable true active-active multi-instance deployment.
|
|
|
|
**Required Changes for Active-Active:**
|
|
```go
|
|
// Replace in-memory sessions (main.go:31-34) with Redis
|
|
var sessions = redis.NewSessionStore(redisClient)
|
|
```
|
|
|
|
#### Current Limitations
|
|
- **Sessions are not shared**: User authenticated on instance A cannot access instance B
|
|
- **Worker registry is not shared**: Each instance maintains its own worker list
|
|
- **dy.fi updates may conflict**: Multiple instances updating the same domain simultaneously
|
|
|
|
#### User Store File Locking (✅ Fixed)
|
|
As of the latest update, the user store uses file locking to prevent race conditions:
|
|
- **Shared locks** for reads (multiple readers allowed)
|
|
- **Exclusive locks** for writes (blocks all readers and writers)
|
|
- **Atomic write-then-rename** prevents corruption
|
|
- Safe for multi-instance deployment when instances share the same filesystem
|
|
|
|
#### Deployment Example (Active-Passive)
|
|
```bash
|
|
# Primary instance
|
|
./manager --port=8080 --domain=manager.dy.fi &
|
|
|
|
# Secondary instance (standby)
|
|
MANAGER_PORT=8081 ./manager &
|
|
|
|
# Load balancer health check both, route to active only
|
|
```
|
|
|
|
---
|
|
|
|
## General Multi-Instance Recommendations
|
|
|
|
### Health Checks
|
|
All services expose `/health` and `/ready` endpoints. Configure your load balancer to:
|
|
- Route traffic only to healthy instances
|
|
- Remove failed instances from rotation automatically
|
|
- Monitor `/metrics` endpoint for Prometheus integration
|
|
|
|
### Monitoring
|
|
Add `instance_id` labels to metrics for per-instance monitoring:
|
|
```go
|
|
// Recommended enhancement for all services
|
|
var instanceID = os.Hostname()
|
|
```
|
|
|
|
### File Locking
|
|
Services that write to shared storage should use file locking (like manager user store) to prevent corruption:
|
|
```go
|
|
syscall.Flock(fd, syscall.LOCK_EX) // Exclusive lock
|
|
syscall.Flock(fd, syscall.LOCK_SH) // Shared lock
|
|
```
|
|
|
|
### Network Considerations
|
|
- **Latency**: Place input_service close to ping workers to minimize polling latency
|
|
- **Bandwidth**: output_service should have sufficient bandwidth for result ingestion
|
|
- **NAT Traversal**: Use manager gateway mode for ping workers behind NAT
|
|
|
|
---
|
|
|
|
## Troubleshooting Multi-Instance Deployments
|
|
|
|
### Input Service: Duplicate Hops Served
|
|
**Symptom**: Same hop appears multiple times in different workers
|
|
**Cause**: Hop deduplication is instance-local
|
|
**Solution**: Implement session affinity or broadcast hop submissions
|
|
|
|
### Manager: Sessions Lost After Reconnect
|
|
**Symptom**: User logged out when load balancer switches instances
|
|
**Cause**: Sessions are in-memory, not shared
|
|
**Solution**: Use session affinity in load balancer or implement shared session store
|
|
|
|
### Output Service: Database Conflicts
|
|
**Symptom**: Database file corruption or lock timeouts
|
|
**Cause**: Multiple instances writing to same database file
|
|
**Solution**: Each instance MUST have its own `--db-dir`, then aggregate later
|
|
|
|
### Ping Service: Excessive Pinging
|
|
**Symptom**: Same IP pinged too frequently
|
|
**Cause**: Too many workers with short cooldown period
|
|
**Solution**: Increase `cooldown_minutes` in config.yaml
|
|
|
|
---
|
|
|
|
## Production Deployment Checklist
|
|
|
|
- [ ] Input service: Configure session affinity or hop broadcast
|
|
- [ ] Output service: Each instance has unique `--db-dir`
|
|
- [ ] Ping service: Cooldown duration accounts for total worker count
|
|
- [ ] Manager: Decide active-passive or implement shared sessions
|
|
- [ ] All services: Health check endpoints configured in load balancer
|
|
- [ ] All services: Metrics exported to monitoring system
|
|
- [ ] All services: Logs aggregated to central logging system
|
|
- [ ] File-based state: Shared filesystem or backup/sync strategy
|
|
- [ ] Database rotation: Automated collection of output service dumps
|
|
|
|
---
|
|
|
|
## Future Enhancements
|
|
|
|
### High Priority
|
|
1. **Shared session store for manager** (Redis/database)
|
|
2. **Shared hop deduplication for input_service** (Redis)
|
|
3. **Distributed worker coordination** for ping_service cooldowns
|
|
|
|
### Medium Priority
|
|
4. **Instance ID labels in metrics** for better observability
|
|
5. **Graceful shutdown coordination** to prevent data loss
|
|
6. **Health check improvements** to verify actual functionality
|
|
|
|
### Low Priority
|
|
7. **Automated database aggregation** for output_service
|
|
8. **Service mesh integration** (Consul, etcd) for discovery
|
|
9. **Horizontal autoscaling** based on load metrics
|
|
|
|
---
|
|
|
|
## Summary Table
|
|
|
|
| Service | Multi-Instance Ready | Session Affinity Needed | Shared Storage Needed | Notes |
|
|
|---------|---------------------|------------------------|---------------------|-------|
|
|
| input_service | ⚠️ Partial | ✅ Yes (recommended) | ❌ No | Hop dedup is instance-local |
|
|
| output_service | ✅ Full | ❌ No | ❌ No | Each instance has own DB |
|
|
| ping_service | ✅ Full | ❌ No | ❌ No | Fully distributed by design |
|
|
| manager | ⚠️ Requires config | ✅ Yes (sessions) | ✅ Yes (user store) | Sessions in-memory; user store file-locked |
|
|
|
|
---
|
|
|
|
For questions or issues with multi-instance deployments, refer to the service-specific README files or open an issue in the project repository.
|