Files

ryyst 83ad9eea8c Initial KVS implementation with core functionality

- Go module setup with BadgerDB, Gorilla Mux, Logrus, UUID, and YAML
- Core data structures for distributed key-value store
- HTTP REST API with /kv/ endpoints (GET, PUT, DELETE)
- Member management endpoints (/members/)
- Timestamp indexing for efficient time-based queries
- YAML configuration with auto-generation
- Structured JSON logging with configurable levels
- Operational modes (normal, read-only, syncing)
- Basic health check endpoint
- Graceful shutdown handling

Tested basic functionality - all core endpoints working correctly.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-09-10 07:20:12 +03:00

23 KiB

Raw Permalink Blame History

Gossip in GO, lazy syncing K/J database

Software Design Document: Clustered Key-Value Store

1. Introduction

1.1 Goals

This document outlines the design for a minimalistic, clustered key-value database system written in Go. The primary goals are:

Eventual Consistency: Prioritize availability and partition tolerance over strong consistency.
Local-First Truth: Local operations should be fast, with replication happening in the background.
Gossip-Style Membership: Decentralized mechanism for nodes to discover and track each other.
Hierarchical Keys: Support for structured keys (e.g., /home/room/closet/socks).
Minimalistic Footprint: Efficient resource usage on servers.
Simple Configuration & Operation: Easy to deploy and manage.
Read-Only Mode: Ability for nodes to restrict external writes.
Gradual Bootstrapping: New nodes integrate smoothly without overwhelming the cluster.
Sophisticated Conflict Resolution: Handle timestamp collisions using majority vote, with oldest node as tie-breaker.

1.2 Non-Goals

Strong (linearizable/serializable) consistency.
Complex querying or indexing beyond key-based lookups and timestamp-filtered UUID lists.
Transaction support across multiple keys.

2. Architecture Overview

The system will consist of independent Go services (nodes) that communicate via HTTP/REST. Each node will embed a BadgerDB instance for local data storage and manage its own membership list through a gossip protocol. External clients interact with any available node, which then participates in the cluster's eventual consistency model.

Key Architectural Principles:

Decentralized: No central coordinator or leader.
Peer-to-Peer: Nodes communicate directly with each other for replication and membership.
API-Driven: All interactions, both external (clients) and internal (replication), occur over a RESTful HTTP API.

+----------------+       +----------------+       +----------------+
|    Node A      |       |    Node B      |       |    Node C      |
|  (Go Service)  |       |  (Go Service)  |       |  (Go Service)  |
|                |       |                |       |                |
| +------------+ |       | +------------+ |       | +------------+ |
| | HTTP Server| | <---- | | HTTP Server| | <---- | | HTTP Server| |
| | (API)      | | --->  | | (API)      | | --->  | | (API)      | |
| +------------+ |       | +------------+ |       | +------------+ |
|        |       |       |        |       |       |        |       |
| +------------+ |       | +------------+ |       | +------------+ |
| | Gossip     | | <---> | | Gossip     | | <---> | | Gossip     | |
| | Manager    | |       | | Manager    | |       | | Manager    | |
| +------------+ |       | +------------+ |       | +------------+ |
|        |       |       |        |       |       |        |       |
| +------------+ |       | +------------+ |       | +------------+ |
| | Replication| | <---> | | Replication| | <---> | | Replication| |
| | Logic      | |       | | Logic      | |       | | Logic      | |
| +------------+ |       | +------------+ |       | +------------+ |
|        |       |       |        |       |       |        |       |
| +------------+ |       | +------------+ |       | +------------+ |
| | BadgerDB   | |       | | BadgerDB   | |       | | BadgerDB   | |
| | (Local KV) | |       | | (Local KV) | |       | | (Local KV) | |
| +------------+ |       | +------------+ |       | +------------+ |
+----------------+       +----------------+       +----------------+
      ^
      |
      +----- External Clients (Interact with any Node's API)

3. Data Model

3.1 Logical Data Structure

Data is logically stored as a key-value pair, where the key is a hierarchical path and the value is a JSON object. Each pair also carries metadata for consistency and conflict resolution.

Logical Key: string (e.g., /home/room/closet/socks)
Logical Value: JSON object (e.g., {"count":7,"colors":["blue","red","black"]})

3.2 Internal Storage Structure (BadgerDB)

BadgerDB is a flat key-value store. To accommodate hierarchical keys and metadata, the following mapping will be used:

BadgerDB Key: The full logical key path, with the leading /kv/ prefix removed. Path segments will be separated by /. No leading / will be stored in the BadgerDB key.
- Example: For logical key /kv/home/room/closet/socks, the BadgerDB key will be home/room/closet/socks.

BadgerDB Value: A marshaled JSON object containing the uuid, timestamp, and the actual data JSON object. This allows for consistent versioning and conflict resolution.

// Example BadgerDB Value (marshaled JSON string)
{
  "uuid": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
  "timestamp": 1672531200000, // Unix timestamp in milliseconds
  "data": {
    "count": 7,
    "colors": ["blue", "red", "black"]
  }
}
```    *   **`uuid` (string):** A UUIDv4, unique identifier for this specific version of the data.
*   **`timestamp` (int64):** Unix timestamp representing the time of the last modification. **This will be in milliseconds since epoch**, providing higher precision and reducing collision risk. This is the primary mechanism for conflict resolution ("newest data wins").
*   **`data` (JSON object):** The actual user-provided JSON payload.

4. API Endpoints

All endpoints will communicate over HTTP/1.1 and utilize JSON for request/response bodies.

4.1 `/kv/` Endpoints (Data Operations - External/Internal)

These endpoints are for direct key-value manipulation by external clients and are also used internally by nodes when fetching full data during replication.

GET /kv/{path}
- Description: Retrieves the JSON object associated with the given hierarchical key path.
- Request: No body.
- Responses:
  - 200 OK: Content-Type: application/json with the stored JSON object.
  - 404 Not Found: If the key does not exist.
  - 500 Internal Server Error: For server-side issues (e.g., BadgerDB error).
- Example: GET /kv/home/room/closet/socks -> {"count":7,"colors":["blue","red","black"]}
PUT /kv/{path}
- Description: Creates or updates a JSON object at the given path. This operation will internally generate a new UUIDv4 and assign the current Unix timestamp (milliseconds) to the stored value.
- Request:
  - Content-Type: application/json
  - Body: The JSON object to store.
- Responses:
  - 200 OK (Update) or 201 Created (New): On success, returns {"uuid": "new-uuid", "timestamp": new-timestamp_ms}.
  - 400 Bad Request: If the request body is not valid JSON.
  - 403 Forbidden: If the node is in "read-only" mode and the request's origin is not a recognized cluster member (checked via IP/hostname).
  - 500 Internal Server Error: For server-side issues.
- Example: PUT /kv/settings/theme with body {"color":"dark","font_size":14} -> {"uuid": "...", "timestamp": ...}
DELETE /kv/{path}
- Description: Deletes the key-value pair at the given path.
- Request: No body.
- Responses:
  - 204 No Content: On successful deletion.
  - 404 Not Found: If the key does not exist.
  - 403 Forbidden: If the node is in "read-only" mode and the request is not from a recognized cluster member.
  - 500 Internal Server Error: For server-side issues.

4.2 `/members/` Endpoints (Membership & Internal Replication)

These endpoints are primarily for internal communication between cluster nodes, managing membership and facilitating data synchronization.

GET /members/
- Description: Returns a list of known active members in the cluster. This list is maintained locally by each node based on the gossip protocol.
- Request: No body.
- Responses:
  - 200 OK: Content-Type: application/json with a JSON array of member details.
```
[
  {"id": "node-alpha", "address": "192.168.1.10:8080", "last_seen": 1672531200000, "joined_timestamp": 1672530000000},
  {"id": "node-beta", "address": "192.168.1.11:8080", "last_seen": 1672531205000, "joined_timestamp": 1672530100000}
]
```
    - id (string): Unique identifier for the node.
    - address (string): host:port of the node's API endpoint.
    - last_seen (int64): Unix timestamp (milliseconds) of when this node was last successfully contacted or heard from.
    - joined_timestamp (int64): Unix timestamp (milliseconds) of when this node first joined the cluster. This is crucial for tie-breaking conflicts.
  - 500 Internal Server Error: For server-side issues.
POST /members/join
- Description: Used by a new node to announce its presence and attempt to join the cluster. Existing nodes use this to update their member list and respond with their current view of the cluster.
- Request:
  - Content-Type: application/json
  - Body:
```
{"id": "node-gamma", "address": "192.168.1.12:8080", "joined_timestamp": 1672532000000}
```
    - joined_timestamp will be set by the joining node (its startup time).
- Responses:
  - 200 OK: Acknowledgment, returning the current list of known members to the joining node (same format as GET /members/).
  - 400 Bad Request: If the request body is malformed.
  - 500 Internal Server Error: For server-side issues.
DELETE /members/leave (Optional, for graceful shutdown)
- Description: A member can proactively announce its departure from the cluster. This allows other nodes to quickly mark it as inactive.
- Request:
  - Content-Type: application/json
  - Body: {"id": "node-gamma"}
- Responses:
  - 204 No Content: On successful processing.
  - 400 Bad Request: If the request body is malformed.
  - 500 Internal Server Error: For server-side issues.
POST /members/pairs_by_time (Internal/Replication Endpoint)
- Description: Used by other cluster members to request a list of key paths, their UUIDs, and their timestamps within a specified time range, optionally filtered by a key prefix. This is critical for both gradual bootstrapping and the regular 5-minute synchronization.
- Request:
  - Content-Type: application/json
  - Body:
```
{
  "start_timestamp": 1672531200000, // Unix milliseconds (inclusive)
  "end_timestamp": 1672617600000,   // Unix milliseconds (exclusive), or 0 for "up to now"
  "limit": 15,                      // Max number of pairs to return
  "prefix": "home/room/"            // Optional: filter by BadgerDB key prefix
}
```
    - start_timestamp: Earliest timestamp for data to be included.
    - end_timestamp: Latest timestamp (exclusive). If 0 or omitted, it implies "up to the current time".
    - limit: Fixed at 15 for this design, to control batch size during sync.
    - prefix: Optional, to filter keys by a common BadgerDB key prefix.
- Responses:
  - 200 OK: Content-Type: application/json with a JSON array of objects:
```
[
  {"path": "home/room/closet/socks", "uuid": "...", "timestamp": 1672531200000},
  {"path": "users/john/profile", "uuid": "...", "timestamp": 1672531205000}
]
```
  - 204 No Content: If no data matches the criteria.
  - 400 Bad Request: If request body is malformed or timestamps are invalid.
  - 500 Internal Server Error: For server-side issues.

5. BadgerDB Integration

BadgerDB will be used as the embedded, local, single-node key-value store.

Key Storage: As described in section 3.2, the HTTP path (without /kv/ prefix and no leading /) will directly map to the BadgerDB key.
Value Storage: Values will be marshaled JSON objects (uuid, timestamp, data).
Timestamp Indexing (for pairs_by_time): To efficiently query by timestamp, a manual secondary index will be maintained. Each PUT operation will write two BadgerDB entries:
1. The primary data entry: {badger_key} -> {uuid, timestamp, data}.
2. A secondary timestamp index entry: _ts:{timestamp_ms}:{badger_key} -> {uuid}.
  - The _ts prefix ensures these index keys are grouped and don't conflict with data keys.
  - The timestamp (milliseconds) ensures lexicographical sorting by time.
  - The badger_key in the index key allows for uniqueness and points back to the main data.
  - The value can simply be the uuid or even an empty string if only the key is needed. Storing the uuid here is useful for direct lookups.
DELETE Operations: A DELETE /kv/{path} will remove both the primary data entry and its corresponding secondary index entry from BadgerDB.

6. Clustering and Consistency

6.1 Membership Management (Gossip Protocol)

Each node maintains a local list of known cluster members (Node ID, Address, Last Seen Timestamp, Joined Timestamp).
Every node will randomly pick a time between 1-2 minutes after its last check-up to initiate a gossip round.
In a gossip round, the node randomly selects a subset of its healthy known members (e.g., 1-3 nodes) and performs a "gossip exchange":
1. It sends its current local member list to the selected peers.
2. Peers merge the received list with their own, updating last_seen timestamps for existing members and adding new ones.
3. If a node fails to respond to multiple gossip attempts, it is eventually marked as "suspected down" and then "dead" after a configurable timeout.

6.2 Data Replication (Periodic Syncs)

The system uses two types of data synchronization:
1. Regular 5-Minute Sync: Catching up on recent changes.
2. Catch-Up Sync (2-Minute Cycles): For nodes that detect they are significantly behind.
Regular 5-Minute Sync:
- Every 5 minutes, each node initiates a data synchronization cycle.
- It selects a random healthy peer.
- It sends POST /members/pairs_by_time to the peer, requesting the 15 latest UUIDs (by setting limit: 15 and end_timestamp: current_time_ms, with start_timestamp: 0 or a very old value to ensure enough items are considered).
- The remote node responds with its 15 latest (path, uuid, timestamp) pairs.
- The local node compares these with its own latest 15. If it finds any data it doesn't have, or an older version of data it does have, it will fetch the full data via GET /kv/{path} and update its local store.
- If the local node detects it's significantly behind (e.g., many of the remote node's latest 15 UUIDs are missing or much newer locally, indicating a large gap), it triggers the Catch-Up Sync.
Catch-Up Sync (2-Minute Cycles):
- This mode is activated when a node determines it's behind its peers (e.g., during the 5-minute sync or bootstrapping).
- It runs every 2 minutes (ensuring it doesn't align with the 5-minute sync).
- The node identifies the oldest_known_timestamp_among_peers_latest_15 from its last regular sync.
- It then sends POST /members/pairs_by_time to a random healthy peer, requesting 15 UUIDs older than that timestamp (e.g., end_timestamp: oldest_known_timestamp_ms, limit: 15, start_timestamp: 0 or further back).
- It continuously iterates backwards in time in 2-minute cycles, progressively asking for older sets of 15 UUIDs until it has caught up to a reasonable historical depth (e.g., configured BOOTSTRAP_MAX_AGE_HOURS).
- History Depth: The system aims to keep track of at least 3 revisions per path for conflict resolution and eventually for versioning. The BOOTSTRAP_MAX_AGE_MILLIS (defaulting to 30 days) governs how far back in time a node will actively fetch during a full sync.

6.3 Conflict Resolution

When two nodes have different versions of the same key (same BadgerDB key), the conflict resolution logic is applied:

Timestamp Wins: The data with the most recent timestamp (Unix milliseconds) is considered the correct version.
Timestamp Collision (Tie-Breaker): If two conflicting versions have the exact same timestamp:
- Majority Vote: The system will query a quorum of healthy peers (GET /kv/{path} or an internal check for UUID/timestamp) to see which UUID/timestamp pair the majority holds. The version held by the majority wins.
- Oldest Node Priority (Tie-Breaker for Majority): If there's an even number of nodes, and thus a tie in the majority vote (e.g., 2 nodes say version A, 2 nodes say version B), the version held by the node with the oldest joined_timestamp (i.e., the oldest active member in the cluster) takes precedence. This provides a deterministic tie-breaker.
- Implementation Note: For majority vote, a node might need to request the {"uuid", "timestamp"} pairs for a specific path from multiple peers. This implies an internal query mechanism or aggregating responses from pairs_by_time for the specific key.

7. Bootstrapping New Nodes (Gradual Full Sync)

This process is initiated when a new node starts up and has no existing data or member list.

Seed Node Configuration: The new node must be configured with a list of initial seed_nodes (e.g., ["host1:port", "host2:port"]).
Join Request: The new node attempts to POST /members/join to one of its configured seed nodes, providing its own id, address, and its joined_timestamp (its startup time).
Member List Discovery: Upon a successful join, the seed node responds with its current list of known cluster members. The new node populates its local member list.
Gradual Data Synchronization Loop (Catch-Up Mode):
- The new node sets its current_end_timestamp = current_time_ms.
- It defines a sync_batch_size (e.g., 15 UUIDs per request, as per pairs_by_time limit).
- It also defines a throttle_delay (e.g., 100ms between pairs_by_time requests to different peers) and a fetch_delay (e.g., 50ms between individual GET /kv/{path} requests for full data).
- Loop backwards in time:
  - The node determines the oldest_timestamp_fetched from its last batch of sync_batch_size items. Initially, this would be current_time_ms.
  - Randomly pick a healthy peer from its member list.
  - Send POST /members/pairs_by_time to the peer with end_timestamp: oldest_timestamp_fetched, limit: sync_batch_size, and start_timestamp: 0. This asks for 15 items older than the oldest one just processed.
  - Process the received {"path", "uuid", "timestamp"} pairs:
    - For each remote pair, it fetches its local version from BadgerDB.
    - Conflict Resolution: Apply the logic from section 6.3. If local data is missing or older, initiate a GET /kv/{path} to fetch the full data and store it.
  - Throttling:
    - Wait for throttle_delay after each pairs_by_time request.
    - Wait for fetch_delay after each individual GET /kv/{path} request for full data.
  - Termination: The loop continues until the oldest_timestamp_fetched goes below the configured BOOTSTRAP_MAX_AGE_MILLIS (defaulting to 30 days ago, configurable value). The node may also terminate if multiple consecutive pairs_by_time queries return no new (older) data.
Full Participation: Once the gradual sync is complete, the node fully participates in the regular 5-minute replication cycles and accepts external client writes (if not in read-only mode). During the sync, the node will operate in a syncing mode, rejecting external client writes with 503 Service Unavailable.

8. Operational Modes

Normal Mode: Full read/write capabilities, participates in all replication and gossip activities.
Read-Only Mode:
- Node will reject PUT and DELETE requests from external clients with a 403 Forbidden status.
- It will still accept PUT and DELETE operations that originate from recognized cluster members during replication, allowing it to remain eventually consistent.
- GET requests are always allowed.
- This mode is primarily for reducing write load or protecting data on specific nodes.
Syncing Mode (Internal during Bootstrap):
- While a new node is undergoing its initial gradual sync, it operates in this internal mode.
- External PUT/DELETE requests will be rejected with 503 Service Unavailable.
- Internal replication from other members is fully active.

9. Logging

A structured logging library (e.g., zap or logrus) will be used.

Log Levels: Support for DEBUG, INFO, WARN, ERROR, FATAL. Configurable.
Log Format: JSON for easy parsing by log aggregators.
Key Events to Log:
- Startup/Shutdown: Server start/stop, configuration loaded.
- API Requests: Incoming HTTP request details (method, path, client IP, status code, duration).
- BadgerDB Operations: Errors during put/get/delete, database open/close, secondary index operations.
- Membership: Node joined/left, gossip rounds initiated/received, member status changes (up, suspected, down), tie-breaker decisions.
- Replication: Sync cycle start/end, type of sync (regular/catch-up), number of keys compared, number of keys fetched, conflict resolutions (including details of timestamp collision resolution).
- Errors: Data serialization/deserialization, network errors, unhandled exceptions.
- Operational Mode Changes: Entering/exiting read-only mode, syncing mode.

10. Future Work (Rough Order of Priority)

These items are considered out of scope for the initial design but are planned for future versions.

Authentication/Authorization (Before First Release): Implement robust authentication for API endpoints (e.g., API keys, mTLS) and potentially basic authorization for access to kv paths.
Client Libraries/Functions (Bash, Python, Go): Develop official client libraries or helper functions to simplify interaction with the API for common programming environments.
Data Compression (gzip): Implement Gzip compression for data values stored in BadgerDB to reduce storage footprint and potentially improve I/O performance.
Data Revisions & Simple Backups:
- Hold at least 3 revisions per path. This would involve a mechanism to store previous versions of data when a PUT occurs, potentially using a separate BadgerDB key namespace (e.g., _rev:{badger_key}:{timestamp_of_revision}).
- The current GET /kv/{path} would continue to return only the latest. A new API might be introduced to fetch specific historical revisions.
- Simple backup strategies could leverage these revisions or BadgerDB's native snapshot capabilities.
Monitoring & Metrics (Grafana Support in v3): Integrate with a metrics system like Prometheus, exposing key performance indicators (e.g., request rates, error rates, replication lag, BadgerDB stats) for visualization in dashboards like Grafana.

23 KiB Raw Permalink Blame History