- Go module setup with BadgerDB, Gorilla Mux, Logrus, UUID, and YAML - Core data structures for distributed key-value store - HTTP REST API with /kv/ endpoints (GET, PUT, DELETE) - Member management endpoints (/members/) - Timestamp indexing for efficient time-based queries - YAML configuration with auto-generation - Structured JSON logging with configurable levels - Operational modes (normal, read-only, syncing) - Basic health check endpoint - Graceful shutdown handling Tested basic functionality - all core endpoints working correctly. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
23 KiB
Gossip in GO, lazy syncing K/J database
Software Design Document: Clustered Key-Value Store
1. Introduction
1.1 Goals
This document outlines the design for a minimalistic, clustered key-value database system written in Go. The primary goals are:
- Eventual Consistency: Prioritize availability and partition tolerance over strong consistency.
- Local-First Truth: Local operations should be fast, with replication happening in the background.
- Gossip-Style Membership: Decentralized mechanism for nodes to discover and track each other.
- Hierarchical Keys: Support for structured keys (e.g.,
/home/room/closet/socks
). - Minimalistic Footprint: Efficient resource usage on servers.
- Simple Configuration & Operation: Easy to deploy and manage.
- Read-Only Mode: Ability for nodes to restrict external writes.
- Gradual Bootstrapping: New nodes integrate smoothly without overwhelming the cluster.
- Sophisticated Conflict Resolution: Handle timestamp collisions using majority vote, with oldest node as tie-breaker.
1.2 Non-Goals
- Strong (linearizable/serializable) consistency.
- Complex querying or indexing beyond key-based lookups and timestamp-filtered UUID lists.
- Transaction support across multiple keys.
2. Architecture Overview
The system will consist of independent Go services (nodes) that communicate via HTTP/REST. Each node will embed a BadgerDB instance for local data storage and manage its own membership list through a gossip protocol. External clients interact with any available node, which then participates in the cluster's eventual consistency model.
Key Architectural Principles:
- Decentralized: No central coordinator or leader.
- Peer-to-Peer: Nodes communicate directly with each other for replication and membership.
- API-Driven: All interactions, both external (clients) and internal (replication), occur over a RESTful HTTP API.
+----------------+ +----------------+ +----------------+
| Node A | | Node B | | Node C |
| (Go Service) | | (Go Service) | | (Go Service) |
| | | | | |
| +------------+ | | +------------+ | | +------------+ |
| | HTTP Server| | <---- | | HTTP Server| | <---- | | HTTP Server| |
| | (API) | | ---> | | (API) | | ---> | | (API) | |
| +------------+ | | +------------+ | | +------------+ |
| | | | | | | | |
| +------------+ | | +------------+ | | +------------+ |
| | Gossip | | <---> | | Gossip | | <---> | | Gossip | |
| | Manager | | | | Manager | | | | Manager | |
| +------------+ | | +------------+ | | +------------+ |
| | | | | | | | |
| +------------+ | | +------------+ | | +------------+ |
| | Replication| | <---> | | Replication| | <---> | | Replication| |
| | Logic | | | | Logic | | | | Logic | |
| +------------+ | | +------------+ | | +------------+ |
| | | | | | | | |
| +------------+ | | +------------+ | | +------------+ |
| | BadgerDB | | | | BadgerDB | | | | BadgerDB | |
| | (Local KV) | | | | (Local KV) | | | | (Local KV) | |
| +------------+ | | +------------+ | | +------------+ |
+----------------+ +----------------+ +----------------+
^
|
+----- External Clients (Interact with any Node's API)
3. Data Model
3.1 Logical Data Structure
Data is logically stored as a key-value pair, where the key is a hierarchical path and the value is a JSON object. Each pair also carries metadata for consistency and conflict resolution.
- Logical Key:
string
(e.g.,/home/room/closet/socks
) - Logical Value:
JSON object
(e.g.,{"count":7,"colors":["blue","red","black"]}
)
3.2 Internal Storage Structure (BadgerDB)
BadgerDB is a flat key-value store. To accommodate hierarchical keys and metadata, the following mapping will be used:
-
BadgerDB Key: The full logical key path, with the leading
/kv/
prefix removed. Path segments will be separated by/
. No leading/
will be stored in the BadgerDB key.- Example: For logical key
/kv/home/room/closet/socks
, the BadgerDB key will behome/room/closet/socks
.
- Example: For logical key
-
BadgerDB Value: A marshaled JSON object containing the
uuid
,timestamp
, and the actualdata
JSON object. This allows for consistent versioning and conflict resolution.// Example BadgerDB Value (marshaled JSON string) { "uuid": "a1b2c3d4-e5f6-7890-1234-567890abcdef", "timestamp": 1672531200000, // Unix timestamp in milliseconds "data": { "count": 7, "colors": ["blue", "red", "black"] } } ``` * **`uuid` (string):** A UUIDv4, unique identifier for this specific version of the data. * **`timestamp` (int64):** Unix timestamp representing the time of the last modification. **This will be in milliseconds since epoch**, providing higher precision and reducing collision risk. This is the primary mechanism for conflict resolution ("newest data wins"). * **`data` (JSON object):** The actual user-provided JSON payload.
4. API Endpoints
All endpoints will communicate over HTTP/1.1 and utilize JSON for request/response bodies.
4.1 /kv/
Endpoints (Data Operations - External/Internal)
These endpoints are for direct key-value manipulation by external clients and are also used internally by nodes when fetching full data during replication.
-
GET /kv/{path}
- Description: Retrieves the JSON object associated with the given hierarchical key path.
- Request: No body.
- Responses:
200 OK
:Content-Type: application/json
with the stored JSON object.404 Not Found
: If the key does not exist.500 Internal Server Error
: For server-side issues (e.g., BadgerDB error).
- Example:
GET /kv/home/room/closet/socks
->{"count":7,"colors":["blue","red","black"]}
-
PUT /kv/{path}
- Description: Creates or updates a JSON object at the given path. This operation will internally generate a new UUIDv4 and assign the current Unix timestamp (milliseconds) to the stored value.
- Request:
Content-Type: application/json
- Body: The JSON object to store.
- Responses:
200 OK
(Update) or201 Created
(New): On success, returns{"uuid": "new-uuid", "timestamp": new-timestamp_ms}
.400 Bad Request
: If the request body is not valid JSON.403 Forbidden
: If the node is in "read-only" mode and the request's origin is not a recognized cluster member (checked via IP/hostname).500 Internal Server Error
: For server-side issues.
- Example:
PUT /kv/settings/theme
with body{"color":"dark","font_size":14}
->{"uuid": "...", "timestamp": ...}
-
DELETE /kv/{path}
- Description: Deletes the key-value pair at the given path.
- Request: No body.
- Responses:
204 No Content
: On successful deletion.404 Not Found
: If the key does not exist.403 Forbidden
: If the node is in "read-only" mode and the request is not from a recognized cluster member.500 Internal Server Error
: For server-side issues.
4.2 /members/
Endpoints (Membership & Internal Replication)
These endpoints are primarily for internal communication between cluster nodes, managing membership and facilitating data synchronization.
-
GET /members/
- Description: Returns a list of known active members in the cluster. This list is maintained locally by each node based on the gossip protocol.
- Request: No body.
- Responses:
200 OK
:Content-Type: application/json
with a JSON array of member details.[ {"id": "node-alpha", "address": "192.168.1.10:8080", "last_seen": 1672531200000, "joined_timestamp": 1672530000000}, {"id": "node-beta", "address": "192.168.1.11:8080", "last_seen": 1672531205000, "joined_timestamp": 1672530100000} ]
id
(string): Unique identifier for the node.address
(string):host:port
of the node's API endpoint.last_seen
(int64): Unix timestamp (milliseconds) of when this node was last successfully contacted or heard from.joined_timestamp
(int64): Unix timestamp (milliseconds) of when this node first joined the cluster. This is crucial for tie-breaking conflicts.
500 Internal Server Error
: For server-side issues.
-
POST /members/join
- Description: Used by a new node to announce its presence and attempt to join the cluster. Existing nodes use this to update their member list and respond with their current view of the cluster.
- Request:
Content-Type: application/json
- Body:
{"id": "node-gamma", "address": "192.168.1.12:8080", "joined_timestamp": 1672532000000}
joined_timestamp
will be set by the joining node (its startup time).
- Responses:
200 OK
: Acknowledgment, returning the current list of known members to the joining node (same format asGET /members/
).400 Bad Request
: If the request body is malformed.500 Internal Server Error
: For server-side issues.
-
DELETE /members/leave
(Optional, for graceful shutdown)- Description: A member can proactively announce its departure from the cluster. This allows other nodes to quickly mark it as inactive.
- Request:
Content-Type: application/json
- Body:
{"id": "node-gamma"}
- Responses:
204 No Content
: On successful processing.400 Bad Request
: If the request body is malformed.500 Internal Server Error
: For server-side issues.
-
POST /members/pairs_by_time
(Internal/Replication Endpoint)- Description: Used by other cluster members to request a list of key paths, their UUIDs, and their timestamps within a specified time range, optionally filtered by a key prefix. This is critical for both gradual bootstrapping and the regular 5-minute synchronization.
- Request:
Content-Type: application/json
- Body:
{ "start_timestamp": 1672531200000, // Unix milliseconds (inclusive) "end_timestamp": 1672617600000, // Unix milliseconds (exclusive), or 0 for "up to now" "limit": 15, // Max number of pairs to return "prefix": "home/room/" // Optional: filter by BadgerDB key prefix }
start_timestamp
: Earliest timestamp for data to be included.end_timestamp
: Latest timestamp (exclusive). If0
or omitted, it implies "up to the current time".limit
: Fixed at 15 for this design, to control batch size during sync.prefix
: Optional, to filter keys by a common BadgerDB key prefix.
- Responses:
200 OK
:Content-Type: application/json
with a JSON array of objects:[ {"path": "home/room/closet/socks", "uuid": "...", "timestamp": 1672531200000}, {"path": "users/john/profile", "uuid": "...", "timestamp": 1672531205000} ]
204 No Content
: If no data matches the criteria.400 Bad Request
: If request body is malformed or timestamps are invalid.500 Internal Server Error
: For server-side issues.
5. BadgerDB Integration
BadgerDB will be used as the embedded, local, single-node key-value store.
- Key Storage: As described in section 3.2, the HTTP path (without
/kv/
prefix and no leading/
) will directly map to the BadgerDB key. - Value Storage: Values will be marshaled JSON objects (
uuid
,timestamp
,data
). - Timestamp Indexing (for
pairs_by_time
): To efficiently query by timestamp, a manual secondary index will be maintained. EachPUT
operation will write two BadgerDB entries:- The primary data entry:
{badger_key}
->{uuid, timestamp, data}
. - A secondary timestamp index entry:
_ts:{timestamp_ms}:{badger_key}
->{uuid}
.- The
_ts
prefix ensures these index keys are grouped and don't conflict with data keys. - The timestamp (milliseconds) ensures lexicographical sorting by time.
- The
badger_key
in the index key allows for uniqueness and points back to the main data. - The value can simply be the
uuid
or even an empty string if only the key is needed. Storing theuuid
here is useful for direct lookups.
- The
- The primary data entry:
DELETE
Operations: ADELETE /kv/{path}
will remove both the primary data entry and its corresponding secondary index entry from BadgerDB.
6. Clustering and Consistency
6.1 Membership Management (Gossip Protocol)
- Each node maintains a local list of known cluster members (Node ID, Address, Last Seen Timestamp, Joined Timestamp).
- Every node will randomly pick a time between 1-2 minutes after its last check-up to initiate a gossip round.
- In a gossip round, the node randomly selects a subset of its healthy known members (e.g., 1-3 nodes) and performs a "gossip exchange":
- It sends its current local member list to the selected peers.
- Peers merge the received list with their own, updating
last_seen
timestamps for existing members and adding new ones. - If a node fails to respond to multiple gossip attempts, it is eventually marked as "suspected down" and then "dead" after a configurable timeout.
6.2 Data Replication (Periodic Syncs)
-
The system uses two types of data synchronization:
- Regular 5-Minute Sync: Catching up on recent changes.
- Catch-Up Sync (2-Minute Cycles): For nodes that detect they are significantly behind.
-
Regular 5-Minute Sync:
- Every 5 minutes, each node initiates a data synchronization cycle.
- It selects a random healthy peer.
- It sends
POST /members/pairs_by_time
to the peer, requesting the 15 latest UUIDs (by settinglimit: 15
andend_timestamp: current_time_ms
, withstart_timestamp: 0
or a very old value to ensure enough items are considered). - The remote node responds with its 15 latest (path, uuid, timestamp) pairs.
- The local node compares these with its own latest 15. If it finds any data it doesn't have, or an older version of data it does have, it will fetch the full data via
GET /kv/{path}
and update its local store. - If the local node detects it's significantly behind (e.g., many of the remote node's latest 15 UUIDs are missing or much newer locally, indicating a large gap), it triggers the Catch-Up Sync.
-
Catch-Up Sync (2-Minute Cycles):
- This mode is activated when a node determines it's behind its peers (e.g., during the 5-minute sync or bootstrapping).
- It runs every 2 minutes (ensuring it doesn't align with the 5-minute sync).
- The node identifies the
oldest_known_timestamp_among_peers_latest_15
from its last regular sync. - It then sends
POST /members/pairs_by_time
to a random healthy peer, requesting 15 UUIDs older than that timestamp (e.g.,end_timestamp: oldest_known_timestamp_ms
,limit: 15
,start_timestamp: 0
or further back). - It continuously iterates backwards in time in 2-minute cycles, progressively asking for older sets of 15 UUIDs until it has caught up to a reasonable historical depth (e.g., configured
BOOTSTRAP_MAX_AGE_HOURS
). - History Depth: The system aims to keep track of at least 3 revisions per path for conflict resolution and eventually for versioning. The
BOOTSTRAP_MAX_AGE_MILLIS
(defaulting to 30 days) governs how far back in time a node will actively fetch during a full sync.
6.3 Conflict Resolution
When two nodes have different versions of the same key (same BadgerDB key), the conflict resolution logic is applied:
- Timestamp Wins: The data with the most recent
timestamp
(Unix milliseconds) is considered the correct version. - Timestamp Collision (Tie-Breaker): If two conflicting versions have the exact same
timestamp
:- Majority Vote: The system will query a quorum of healthy peers (
GET /kv/{path}
or an internal check for UUID/timestamp) to see which UUID/timestamp pair the majority holds. The version held by the majority wins. - Oldest Node Priority (Tie-Breaker for Majority): If there's an even number of nodes, and thus a tie in the majority vote (e.g., 2 nodes say version A, 2 nodes say version B), the version held by the node with the oldest
joined_timestamp
(i.e., the oldest active member in the cluster) takes precedence. This provides a deterministic tie-breaker. - Implementation Note: For majority vote, a node might need to request the
{"uuid", "timestamp"}
pairs for a specificpath
from multiple peers. This implies an internal query mechanism or aggregating responses frompairs_by_time
for the specific key.
- Majority Vote: The system will query a quorum of healthy peers (
7. Bootstrapping New Nodes (Gradual Full Sync)
This process is initiated when a new node starts up and has no existing data or member list.
- Seed Node Configuration: The new node must be configured with a list of initial
seed_nodes
(e.g.,["host1:port", "host2:port"]
). - Join Request: The new node attempts to
POST /members/join
to one of its configured seed nodes, providing its ownid
,address
, and itsjoined_timestamp
(its startup time). - Member List Discovery: Upon a successful join, the seed node responds with its current list of known cluster members. The new node populates its local member list.
- Gradual Data Synchronization Loop (Catch-Up Mode):
- The new node sets its
current_end_timestamp = current_time_ms
. - It defines a
sync_batch_size
(e.g., 15 UUIDs per request, as perpairs_by_time
limit
). - It also defines a
throttle_delay
(e.g., 100ms betweenpairs_by_time
requests to different peers) and afetch_delay
(e.g., 50ms between individualGET /kv/{path}
requests for full data). - Loop backwards in time:
- The node determines the
oldest_timestamp_fetched
from its last batch ofsync_batch_size
items. Initially, this would becurrent_time_ms
. - Randomly pick a healthy peer from its member list.
- Send
POST /members/pairs_by_time
to the peer withend_timestamp: oldest_timestamp_fetched
,limit: sync_batch_size
, andstart_timestamp: 0
. This asks for 15 items older than the oldest one just processed. - Process the received
{"path", "uuid", "timestamp"}
pairs:- For each remote pair, it fetches its local version from BadgerDB.
- Conflict Resolution: Apply the logic from section 6.3. If local data is missing or older, initiate a
GET /kv/{path}
to fetch the full data and store it.
- Throttling:
- Wait for
throttle_delay
after eachpairs_by_time
request. - Wait for
fetch_delay
after each individualGET /kv/{path}
request for full data.
- Wait for
- Termination: The loop continues until the
oldest_timestamp_fetched
goes below the configuredBOOTSTRAP_MAX_AGE_MILLIS
(defaulting to 30 days ago, configurable value). The node may also terminate if multiple consecutivepairs_by_time
queries return no new (older) data.
- The node determines the
- The new node sets its
- Full Participation: Once the gradual sync is complete, the node fully participates in the regular 5-minute replication cycles and accepts external client writes (if not in read-only mode). During the sync, the node will operate in a
syncing
mode, rejecting external client writes with503 Service Unavailable
.
8. Operational Modes
- Normal Mode: Full read/write capabilities, participates in all replication and gossip activities.
- Read-Only Mode:
- Node will reject
PUT
andDELETE
requests from external clients with a403 Forbidden
status. - It will still accept
PUT
andDELETE
operations that originate from recognized cluster members during replication, allowing it to remain eventually consistent. GET
requests are always allowed.- This mode is primarily for reducing write load or protecting data on specific nodes.
- Node will reject
- Syncing Mode (Internal during Bootstrap):
- While a new node is undergoing its initial gradual sync, it operates in this internal mode.
- External
PUT
/DELETE
requests will be rejected with503 Service Unavailable
. - Internal replication from other members is fully active.
9. Logging
A structured logging library (e.g., zap
or logrus
) will be used.
- Log Levels: Support for
DEBUG
,INFO
,WARN
,ERROR
,FATAL
. Configurable. - Log Format: JSON for easy parsing by log aggregators.
- Key Events to Log:
- Startup/Shutdown: Server start/stop, configuration loaded.
- API Requests: Incoming HTTP request details (method, path, client IP, status code, duration).
- BadgerDB Operations: Errors during put/get/delete, database open/close, secondary index operations.
- Membership: Node joined/left, gossip rounds initiated/received, member status changes (up, suspected, down), tie-breaker decisions.
- Replication: Sync cycle start/end, type of sync (regular/catch-up), number of keys compared, number of keys fetched, conflict resolutions (including details of timestamp collision resolution).
- Errors: Data serialization/deserialization, network errors, unhandled exceptions.
- Operational Mode Changes: Entering/exiting read-only mode, syncing mode.
10. Future Work (Rough Order of Priority)
These items are considered out of scope for the initial design but are planned for future versions.
- Authentication/Authorization (Before First Release): Implement robust authentication for API endpoints (e.g., API keys, mTLS) and potentially basic authorization for access to
kv
paths. - Client Libraries/Functions (Bash, Python, Go): Develop official client libraries or helper functions to simplify interaction with the API for common programming environments.
- Data Compression (gzip): Implement Gzip compression for data values stored in BadgerDB to reduce storage footprint and potentially improve I/O performance.
- Data Revisions & Simple Backups:
- Hold at least 3 revisions per path. This would involve a mechanism to store previous versions of data when a
PUT
occurs, potentially using a separate BadgerDB key namespace (e.g.,_rev:{badger_key}:{timestamp_of_revision}
). - The current
GET /kv/{path}
would continue to return only the latest. A new API might be introduced to fetch specific historical revisions. - Simple backup strategies could leverage these revisions or BadgerDB's native snapshot capabilities.
- Hold at least 3 revisions per path. This would involve a mechanism to store previous versions of data when a
- Monitoring & Metrics (Grafana Support in v3): Integrate with a metrics system like Prometheus, exposing key performance indicators (e.g., request rates, error rates, replication lag, BadgerDB stats) for visualization in dashboards like Grafana.