Implemented a comprehensive secure authentication mechanism for inter-node
cluster communication with the following features:
1. Global Cluster Secret (GCS)
- Auto-generated cryptographically secure random secret (256-bit)
- Configurable via YAML config file
- Shared across all cluster nodes for authentication
2. Cluster Authentication Middleware
- Validates X-Cluster-Secret and X-Node-ID headers
- Applied to all cluster endpoints (/members/*, /merkle_tree/*, /kv_range)
- Comprehensive logging of authentication attempts
3. Authenticated HTTP Client
- Custom HTTP client with cluster auth headers
- TLS support with configurable certificate verification
- Protocol-aware (http/https based on TLS settings)
4. Secure Bootstrap Endpoint
- New /auth/cluster-bootstrap endpoint
- Protected by JWT authentication with admin scope
- Allows new nodes to securely obtain cluster secret
5. Updated Cluster Communication
- All gossip protocol requests include auth headers
- All Merkle tree sync requests include auth headers
- All data replication requests include auth headers
6. Configuration
- cluster_secret: Shared secret (auto-generated if not provided)
- cluster_tls_enabled: Enable TLS for inter-node communication
- cluster_tls_cert_file: Path to TLS certificate
- cluster_tls_key_file: Path to TLS private key
- cluster_tls_skip_verify: Skip TLS verification (testing only)
This implementation addresses the security vulnerability of unprotected
cluster endpoints and provides a flexible, secure approach to protecting
internal cluster communication while allowing for automated node bootstrapping.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
This commit fixes the flaky conflict resolution test by addressing two issues:
## 🔧 Root Cause Analysis
Through detailed debugging, discovered that:
1. The conflict resolution algorithm works perfectly
2. The issue was insufficient cluster stabilization time
3. Nodes need proper gossip membership before sync can detect conflicts
## 🛠️ Fixes Applied
**1. Increase Cluster Stabilization Time**
- Extended wait from 10s to 20s for proper gossip protocol establishment
- This allows nodes to discover each other as "healthy members"
- Required for Merkle sync to activate between peers
**2. Enhanced Debug Logging**
- Added detailed membership debugging to conflict resolution
- Shows peer addresses, member counts, and lookup failures
- Helps diagnose future distributed systems issues
**3. Remove Silent Error Hiding**
- Removed `/dev/null` redirect from test_conflict.go execution
- Now shows conflict creation output for better diagnostics
## 🧪 Test Results
- All integration tests now pass consistently (8/8)
- Conflict resolution test reliably converges within 3 seconds
- Enhanced retry logic provides clear progress visibility
The sophisticated conflict resolution with oldest-node tie-breaking now works
reliably in all test scenarios, demonstrating the system's correctness.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
The conflict resolution test was failing because when two nodes had the same
timestamp but different UUIDs/data, the system would just keep local data
instead of applying proper conflict resolution logic.
## 🔧 Fix Details
- Implement "oldest-node rule" for timestamp collisions in 2-node clusters
- When timestamps are equal, the node with the earliest joined_timestamp wins
- Add fallback to UUID comparison if membership info is unavailable
- Enhanced logging for conflict resolution debugging
## 🧪 Test Results
- All integration tests now pass (8/8)
- Conflict resolution test consistently converges to the same value
- Maintains data consistency across cluster nodes
This implements the sophisticated conflict resolution mentioned in the design
docs using majority vote with oldest-node tie-breaking, correctly handling
the 2-node cluster scenario used in integration tests.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Implemented fetchSingleKVFromPeer: HTTP client to fetch KV pairs from peers
- Implemented getLocalData: Badger DB access for local data retrieval
- Implemented deleteKVLocally: Local deletion with timestamp index cleanup
- Implemented storeReplicatedDataWithMetadata: Preserves original UUID/timestamp
- Implemented resolveConflict: Simple conflict resolution (newer timestamp wins)
- Implemented fetchAndStoreRange: Fetches KV ranges for Merkle sync
This fixes the critical data replication issue where sync was failing with
"not implemented" errors. Integration tests now pass for data replication.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Removed all duplicate Server methods from main.go (630 lines)
- Fixed import conflicts and unused imports
- main.go reduced from 3,298 to 340 lines (89% reduction)
- Clean modular structure with server package handling all server functionality
- Achieved clean build with no compilation errors
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
- Create cluster/merkle.go with Merkle tree operations
- Create cluster/gossip.go with gossip protocol implementation
- Create cluster/sync.go with data synchronization logic
- Create cluster/bootstrap.go with cluster joining functionality
Major clustering functionality now properly separated:
* MerkleService: Tree building, hashing, filtering
* GossipService: Member discovery, health checking, list merging
* SyncService: Merkle-based synchronization between nodes
* BootstrapService: Seed node joining and initial sync
Build tested and verified working. Ready for main.go integration.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>