ping_service/MULTI_INSTANCE.md

# Multi-Instance Deployment Guide

This document provides guidance for deploying multiple instances of each service for high availability and scalability.

## Overview

All services in this distributed network mapping system are designed to support multi-instance deployments, but each has specific considerations and limitations.

---

## Input Service (input_service/)

### Multi-Instance Readiness: ⚠️ **Partially Ready**

#### How It Works
- Each instance maintains its own per-consumer state and CIDR generators
- State is stored locally in `progress_state/` directory
- Global hop deduplication (`globalSeen` map) is **instance-local**

#### Multi-Instance Deployment Strategies

**Option 1: Session Affinity (Recommended)**
```
Load Balancer (with sticky sessions based on source IP)
    ├── input_service instance 1
    ├── input_service instance 2
    └── input_service instance 3
```
- Configure load balancer to route each ping worker to the same input_service instance
- Ensures per-consumer state consistency
- Simple to implement and maintain

**Option 2: Broadcast Hop Submissions**
```
output_service ---> POST /hops ---> ALL input_service instances
```
Modify output_service to POST discovered hops to all input_service instances instead of just one. This ensures hop deduplication works across instances.

**Option 3: Shared Deduplication Backend (Future Enhancement)**
Implement Redis or database-backed `globalSeen` storage so all instances share deduplication state.

#### Known Limitations
- **Hop deduplication is instance-local**: Different instances may serve duplicate hops if output_service sends hops to only one instance
- **Per-consumer state is instance-local**: If a consumer switches instances, it gets a new generator and starts from the beginning
- **CIDR files must be present on all instances**: The `cloud-provider-ip-addresses/` directory must exist on each instance

#### Deployment Example
```bash
# Instance 1
./http_input_service &

# Instance 2 (different port)
PORT=8081 ./http_input_service &

# Load balancer (nginx example)
upstream input_service {
    ip_hash;  # Session affinity
    server 127.0.0.1:8080;
    server 127.0.0.1:8081;
}
```

---

## Output Service (output_service/)

### Multi-Instance Readiness: ✅ **Fully Ready**

#### How It Works
- Each instance maintains its own SQLite database
- Databases are independent and can be aggregated later
- `sentHops` deduplication is instance-local with 24-hour TTL

#### Multi-Instance Deployment
```
ping_service workers ---> Load Balancer ---> output_service instances
```
- No session affinity required
- Each instance stores results independently
- Use `/dump` endpoint to collect databases from all instances for aggregation

#### Aggregation Strategy
```bash
# Collect databases from all instances
curl http://instance1:8091/dump > instance1.db
curl http://instance2:8091/dump > instance2.db
curl http://instance3:8091/dump > instance3.db

# Merge using sqlite3
sqlite3 merged.db <<EOF
ATTACH 'instance1.db' AS db1;
ATTACH 'instance2.db' AS db2;
ATTACH 'instance3.db' AS db3;

INSERT INTO ping_results SELECT * FROM db1.ping_results;
INSERT INTO ping_results SELECT * FROM db2.ping_results;
INSERT INTO ping_results SELECT * FROM db3.ping_results;

INSERT INTO traceroute_hops SELECT * FROM db1.traceroute_hops;
INSERT INTO traceroute_hops SELECT * FROM db2.traceroute_hops;
INSERT INTO traceroute_hops SELECT * FROM db3.traceroute_hops;
EOF
```

#### Deployment Example
```bash
# Instance 1
./output_service --port=8081 --health-port=8091 --db-dir=/data/output1 &

# Instance 2
./output_service --port=8082 --health-port=8092 --db-dir=/data/output2 &

# Instance 3
./output_service --port=8083 --health-port=8093 --db-dir=/data/output3 &
```

---

## Ping Service (ping_service/)

### Multi-Instance Readiness: ✅ **Fully Ready**

#### How It Works
- Designed from the ground up for distributed operation
- Each worker independently polls input_service and submits results
- Cooldown cache is instance-local (intentional - distributed workers coordinate via cooldown duration)

#### Multi-Instance Deployment
```
input_service <--- ping_service workers (many instances)
                         |
                         v
                  output_service
```
- Deploy as many workers as needed across different networks/locations
- Workers can run on Raspberry Pis, VPS, cloud instances, etc.
- No coordination required between workers

#### Deployment Example
```bash
# Worker 1 (local network)
./ping_service -config config.yaml &

# Worker 2 (VPS)
ssh vps1 "./ping_service -config config.yaml" &

# Worker 3 (different geographic location)
ssh vps2 "./ping_service -config config.yaml" &
```

---

## Manager (manager/)

### Multi-Instance Readiness: ⚠️ **Requires Configuration**

#### How It Works
- Session store is **in-memory** (not shared across instances)
- User store uses file-based storage with file locking (multi-instance safe as of latest update)
- Worker registry is instance-local

#### Multi-Instance Deployment Strategies

**Option 1: Active-Passive with Failover**
```
Load Balancer (active-passive)
    ├── manager instance 1 (active)
    └── manager instance 2 (standby)
```
- Only one instance active at a time
- Failover on primary failure
- Simplest approach, no session coordination needed

**Option 2: Shared Session Store (Recommended for Active-Active)**
Implement Redis or database-backed session storage to enable true active-active multi-instance deployment.

**Required Changes for Active-Active:**
```go
// Replace in-memory sessions (main.go:31-34) with Redis
var sessions = redis.NewSessionStore(redisClient)
```

#### Current Limitations
- **Sessions are not shared**: User authenticated on instance A cannot access instance B
- **Worker registry is not shared**: Each instance maintains its own worker list
- **dy.fi updates may conflict**: Multiple instances updating the same domain simultaneously

#### User Store File Locking (✅ Fixed)
As of the latest update, the user store uses file locking to prevent race conditions:
- **Shared locks** for reads (multiple readers allowed)
- **Exclusive locks** for writes (blocks all readers and writers)
- **Atomic write-then-rename** prevents corruption
- Safe for multi-instance deployment when instances share the same filesystem

#### Deployment Example (Active-Passive)
```bash
# Primary instance
./manager --port=8080 --domain=manager.dy.fi &

# Secondary instance (standby)
MANAGER_PORT=8081 ./manager &

# Load balancer health check both, route to active only
```

---

## General Multi-Instance Recommendations

### Health Checks
All services expose `/health` and `/ready` endpoints. Configure your load balancer to:
- Route traffic only to healthy instances
- Remove failed instances from rotation automatically
- Monitor `/metrics` endpoint for Prometheus integration

### Monitoring
Add `instance_id` labels to metrics for per-instance monitoring:
```go
// Recommended enhancement for all services
var instanceID = os.Hostname()
```

### File Locking
Services that write to shared storage should use file locking (like manager user store) to prevent corruption:
```go
syscall.Flock(fd, syscall.LOCK_EX)  // Exclusive lock
syscall.Flock(fd, syscall.LOCK_SH)  // Shared lock
```

### Network Considerations
- **Latency**: Place input_service close to ping workers to minimize polling latency
- **Bandwidth**: output_service should have sufficient bandwidth for result ingestion
- **NAT Traversal**: Use manager gateway mode for ping workers behind NAT

---

## Troubleshooting Multi-Instance Deployments

### Input Service: Duplicate Hops Served
**Symptom**: Same hop appears multiple times in different workers
**Cause**: Hop deduplication is instance-local
**Solution**: Implement session affinity or broadcast hop submissions

### Manager: Sessions Lost After Reconnect
**Symptom**: User logged out when load balancer switches instances
**Cause**: Sessions are in-memory, not shared
**Solution**: Use session affinity in load balancer or implement shared session store

### Output Service: Database Conflicts
**Symptom**: Database file corruption or lock timeouts
**Cause**: Multiple instances writing to same database file
**Solution**: Each instance MUST have its own `--db-dir`, then aggregate later

### Ping Service: Excessive Pinging
**Symptom**: Same IP pinged too frequently
**Cause**: Too many workers with short cooldown period
**Solution**: Increase `cooldown_minutes` in config.yaml

---

## Production Deployment Checklist

- [ ] Input service: Configure session affinity or hop broadcast
- [ ] Output service: Each instance has unique `--db-dir`
- [ ] Ping service: Cooldown duration accounts for total worker count
- [ ] Manager: Decide active-passive or implement shared sessions
- [ ] All services: Health check endpoints configured in load balancer
- [ ] All services: Metrics exported to monitoring system
- [ ] All services: Logs aggregated to central logging system
- [ ] File-based state: Shared filesystem or backup/sync strategy
- [ ] Database rotation: Automated collection of output service dumps

---

## Future Enhancements

### High Priority
1. **Shared session store for manager** (Redis/database)
2. **Shared hop deduplication for input_service** (Redis)
3. **Distributed worker coordination** for ping_service cooldowns

### Medium Priority
4. **Instance ID labels in metrics** for better observability
5. **Graceful shutdown coordination** to prevent data loss
6. **Health check improvements** to verify actual functionality

### Low Priority
7. **Automated database aggregation** for output_service
8. **Service mesh integration** (Consul, etcd) for discovery
9. **Horizontal autoscaling** based on load metrics

---

## Summary Table

| Service | Multi-Instance Ready | Session Affinity Needed | Shared Storage Needed | Notes |
|---------|---------------------|------------------------|---------------------|-------|
| input_service | ⚠️ Partial | ✅ Yes (recommended) | ❌ No | Hop dedup is instance-local |
| output_service | ✅ Full | ❌ No | ❌ No | Each instance has own DB |
| ping_service | ✅ Full | ❌ No | ❌ No | Fully distributed by design |
| manager | ⚠️ Requires config | ✅ Yes (sessions) | ✅ Yes (user store) | Sessions in-memory; user store file-locked |

---

For questions or issues with multi-instance deployments, refer to the service-specific README files or open an issue in the project repository.