Files
kalzu-value-store/design_v2.md
2025-09-09 22:23:20 +03:00

324 lines
23 KiB
Markdown

# Gossip in GO, lazy syncing K/J database
## Software Design Document: Clustered Key-Value Store
### 1. Introduction
#### 1.1 Goals
This document outlines the design for a minimalistic, clustered key-value database system written in Go. The primary goals are:
* **Eventual Consistency:** Prioritize availability and partition tolerance over strong consistency.
* **Local-First Truth:** Local operations should be fast, with replication happening in the background.
* **Gossip-Style Membership:** Decentralized mechanism for nodes to discover and track each other.
* **Hierarchical Keys:** Support for structured keys (e.g., `/home/room/closet/socks`).
* **Minimalistic Footprint:** Efficient resource usage on servers.
* **Simple Configuration & Operation:** Easy to deploy and manage.
* **Read-Only Mode:** Ability for nodes to restrict external writes.
* **Gradual Bootstrapping:** New nodes integrate smoothly without overwhelming the cluster.
* **Sophisticated Conflict Resolution:** Handle timestamp collisions using majority vote, with oldest node as tie-breaker.
#### 1.2 Non-Goals
* Strong (linearizable/serializable) consistency.
* Complex querying or indexing beyond key-based lookups and timestamp-filtered UUID lists.
* Transaction support across multiple keys.
### 2. Architecture Overview
The system will consist of independent Go services (nodes) that communicate via HTTP/REST. Each node will embed a BadgerDB instance for local data storage and manage its own membership list through a gossip protocol. External clients interact with any available node, which then participates in the cluster's eventual consistency model.
**Key Architectural Principles:**
* **Decentralized:** No central coordinator or leader.
* **Peer-to-Peer:** Nodes communicate directly with each other for replication and membership.
* **API-Driven:** All interactions, both external (clients) and internal (replication), occur over a RESTful HTTP API.
```
+----------------+ +----------------+ +----------------+
| Node A | | Node B | | Node C |
| (Go Service) | | (Go Service) | | (Go Service) |
| | | | | |
| +------------+ | | +------------+ | | +------------+ |
| | HTTP Server| | <---- | | HTTP Server| | <---- | | HTTP Server| |
| | (API) | | ---> | | (API) | | ---> | | (API) | |
| +------------+ | | +------------+ | | +------------+ |
| | | | | | | | |
| +------------+ | | +------------+ | | +------------+ |
| | Gossip | | <---> | | Gossip | | <---> | | Gossip | |
| | Manager | | | | Manager | | | | Manager | |
| +------------+ | | +------------+ | | +------------+ |
| | | | | | | | |
| +------------+ | | +------------+ | | +------------+ |
| | Replication| | <---> | | Replication| | <---> | | Replication| |
| | Logic | | | | Logic | | | | Logic | |
| +------------+ | | +------------+ | | +------------+ |
| | | | | | | | |
| +------------+ | | +------------+ | | +------------+ |
| | BadgerDB | | | | BadgerDB | | | | BadgerDB | |
| | (Local KV) | | | | (Local KV) | | | | (Local KV) | |
| +------------+ | | +------------+ | | +------------+ |
+----------------+ +----------------+ +----------------+
^
|
+----- External Clients (Interact with any Node's API)
```
### 3. Data Model
#### 3.1 Logical Data Structure
Data is logically stored as a key-value pair, where the key is a hierarchical path and the value is a JSON object. Each pair also carries metadata for consistency and conflict resolution.
* **Logical Key:** `string` (e.g., `/home/room/closet/socks`)
* **Logical Value:** `JSON object` (e.g., `{"count":7,"colors":["blue","red","black"]}`)
#### 3.2 Internal Storage Structure (BadgerDB)
BadgerDB is a flat key-value store. To accommodate hierarchical keys and metadata, the following mapping will be used:
* **BadgerDB Key:** The full logical key path, with the leading `/kv/` prefix removed. Path segments will be separated by `/`. **No leading `/` will be stored in the BadgerDB key.**
* Example: For logical key `/kv/home/room/closet/socks`, the BadgerDB key will be `home/room/closet/socks`.
* **BadgerDB Value:** A marshaled JSON object containing the `uuid`, `timestamp`, and the actual `data` JSON object. This allows for consistent versioning and conflict resolution.
```json
// Example BadgerDB Value (marshaled JSON string)
{
"uuid": "a1b2c3d4-e5f6-7890-1234-567890abcdef",
"timestamp": 1672531200000, // Unix timestamp in milliseconds
"data": {
"count": 7,
"colors": ["blue", "red", "black"]
}
}
``` * **`uuid` (string):** A UUIDv4, unique identifier for this specific version of the data.
* **`timestamp` (int64):** Unix timestamp representing the time of the last modification. **This will be in milliseconds since epoch**, providing higher precision and reducing collision risk. This is the primary mechanism for conflict resolution ("newest data wins").
* **`data` (JSON object):** The actual user-provided JSON payload.
### 4. API Endpoints
All endpoints will communicate over HTTP/1.1 and utilize JSON for request/response bodies.
#### 4.1 `/kv/` Endpoints (Data Operations - External/Internal)
These endpoints are for direct key-value manipulation by external clients and are also used internally by nodes when fetching full data during replication.
* **`GET /kv/{path}`**
* **Description:** Retrieves the JSON object associated with the given hierarchical key path.
* **Request:** No body.
* **Responses:**
* `200 OK`: `Content-Type: application/json` with the stored JSON object.
* `404 Not Found`: If the key does not exist.
* `500 Internal Server Error`: For server-side issues (e.g., BadgerDB error).
* **Example:** `GET /kv/home/room/closet/socks` -> `{"count":7,"colors":["blue","red","black"]}`
* **`PUT /kv/{path}`**
* **Description:** Creates or updates a JSON object at the given path. This operation will internally generate a new UUIDv4 and assign the current Unix timestamp (milliseconds) to the stored value.
* **Request:**
* `Content-Type: application/json`
* Body: The JSON object to store.
* **Responses:**
* `200 OK` (Update) or `201 Created` (New): On success, returns `{"uuid": "new-uuid", "timestamp": new-timestamp_ms}`.
* `400 Bad Request`: If the request body is not valid JSON.
* `403 Forbidden`: If the node is in "read-only" mode and the request's origin is not a recognized cluster member (checked via IP/hostname).
* `500 Internal Server Error`: For server-side issues.
* **Example:** `PUT /kv/settings/theme` with body `{"color":"dark","font_size":14}` -> `{"uuid": "...", "timestamp": ...}`
* **`DELETE /kv/{path}`**
* **Description:** Deletes the key-value pair at the given path.
* **Request:** No body.
* **Responses:**
* `204 No Content`: On successful deletion.
* `404 Not Found`: If the key does not exist.
* `403 Forbidden`: If the node is in "read-only" mode and the request is not from a recognized cluster member.
* `500 Internal Server Error`: For server-side issues.
#### 4.2 `/members/` Endpoints (Membership & Internal Replication)
These endpoints are primarily for internal communication between cluster nodes, managing membership and facilitating data synchronization.
* **`GET /members/`**
* **Description:** Returns a list of known active members in the cluster. This list is maintained locally by each node based on the gossip protocol.
* **Request:** No body.
* **Responses:**
* `200 OK`: `Content-Type: application/json` with a JSON array of member details.
```json
[
{"id": "node-alpha", "address": "192.168.1.10:8080", "last_seen": 1672531200000, "joined_timestamp": 1672530000000},
{"id": "node-beta", "address": "192.168.1.11:8080", "last_seen": 1672531205000, "joined_timestamp": 1672530100000}
]
```
* `id` (string): Unique identifier for the node.
* `address` (string): `host:port` of the node's API endpoint.
* `last_seen` (int64): Unix timestamp (milliseconds) of when this node was last successfully contacted or heard from.
* `joined_timestamp` (int64): Unix timestamp (milliseconds) of when this node first joined the cluster. This is crucial for tie-breaking conflicts.
* `500 Internal Server Error`: For server-side issues.
* **`POST /members/join`**
* **Description:** Used by a new node to announce its presence and attempt to join the cluster. Existing nodes use this to update their member list and respond with their current view of the cluster.
* **Request:**
* `Content-Type: application/json`
* Body:
```json
{"id": "node-gamma", "address": "192.168.1.12:8080", "joined_timestamp": 1672532000000}
```
* `joined_timestamp` will be set by the joining node (its startup time).
* **Responses:**
* `200 OK`: Acknowledgment, returning the current list of known members to the joining node (same format as `GET /members/`).
* `400 Bad Request`: If the request body is malformed.
* `500 Internal Server Error`: For server-side issues.
* **`DELETE /members/leave` (Optional, for graceful shutdown)**
* **Description:** A member can proactively announce its departure from the cluster. This allows other nodes to quickly mark it as inactive.
* **Request:**
* `Content-Type: application/json`
* Body: `{"id": "node-gamma"}`
* **Responses:**
* `204 No Content`: On successful processing.
* `400 Bad Request`: If the request body is malformed.
* `500 Internal Server Error`: For server-side issues.
* **`POST /members/pairs_by_time` (Internal/Replication Endpoint)**
* **Description:** Used by other cluster members to request a list of key paths, their UUIDs, and their timestamps within a specified time range, optionally filtered by a key prefix. This is critical for both gradual bootstrapping and the regular 5-minute synchronization.
* **Request:**
* `Content-Type: application/json`
* Body:
```json
{
"start_timestamp": 1672531200000, // Unix milliseconds (inclusive)
"end_timestamp": 1672617600000, // Unix milliseconds (exclusive), or 0 for "up to now"
"limit": 15, // Max number of pairs to return
"prefix": "home/room/" // Optional: filter by BadgerDB key prefix
}
```
* `start_timestamp`: Earliest timestamp for data to be included.
* `end_timestamp`: Latest timestamp (exclusive). If `0` or omitted, it implies "up to the current time".
* `limit`: **Fixed at 15** for this design, to control batch size during sync.
* `prefix`: Optional, to filter keys by a common BadgerDB key prefix.
* **Responses:**
* `200 OK`: `Content-Type: application/json` with a JSON array of objects:
```json
[
{"path": "home/room/closet/socks", "uuid": "...", "timestamp": 1672531200000},
{"path": "users/john/profile", "uuid": "...", "timestamp": 1672531205000}
]
```
* `204 No Content`: If no data matches the criteria.
* `400 Bad Request`: If request body is malformed or timestamps are invalid.
* `500 Internal Server Error`: For server-side issues.
### 5. BadgerDB Integration
BadgerDB will be used as the embedded, local, single-node key-value store.
* **Key Storage:** As described in section 3.2, the HTTP path (without `/kv/` prefix and no leading `/`) will directly map to the BadgerDB key.
* **Value Storage:** Values will be marshaled JSON objects (`uuid`, `timestamp`, `data`).
* **Timestamp Indexing (for `pairs_by_time`):** To efficiently query by timestamp, a manual secondary index will be maintained. Each `PUT` operation will write two BadgerDB entries:
1. The primary data entry: `{badger_key}` -> `{uuid, timestamp, data}`.
2. A secondary timestamp index entry: `_ts:{timestamp_ms}:{badger_key}` -> `{uuid}`.
* The `_ts` prefix ensures these index keys are grouped and don't conflict with data keys.
* The timestamp (milliseconds) ensures lexicographical sorting by time.
* The `badger_key` in the index key allows for uniqueness and points back to the main data.
* The value can simply be the `uuid` or even an empty string if only the key is needed. Storing the `uuid` here is useful for direct lookups.
* **`DELETE` Operations:** A `DELETE /kv/{path}` will remove both the primary data entry and its corresponding secondary index entry from BadgerDB.
### 6. Clustering and Consistency
#### 6.1 Membership Management (Gossip Protocol)
* Each node maintains a local list of known cluster members (Node ID, Address, Last Seen Timestamp, Joined Timestamp).
* Every node will randomly pick a time **between 1-2 minutes** after its last check-up to initiate a gossip round.
* In a gossip round, the node randomly selects a subset of its healthy known members (e.g., 1-3 nodes) and performs a "gossip exchange":
1. It sends its current local member list to the selected peers.
2. Peers merge the received list with their own, updating `last_seen` timestamps for existing members and adding new ones.
3. If a node fails to respond to multiple gossip attempts, it is eventually marked as "suspected down" and then "dead" after a configurable timeout.
#### 6.2 Data Replication (Periodic Syncs)
* The system uses two types of data synchronization:
1. **Regular 5-Minute Sync:** Catching up on recent changes.
2. **Catch-Up Sync (2-Minute Cycles):** For nodes that detect they are significantly behind.
* **Regular 5-Minute Sync:**
* Every **5 minutes**, each node initiates a data synchronization cycle.
* It selects a random healthy peer.
* It sends `POST /members/pairs_by_time` to the peer, requesting **the 15 latest UUIDs** (by setting `limit: 15` and `end_timestamp: current_time_ms`, with `start_timestamp: 0` or a very old value to ensure enough items are considered).
* The remote node responds with its 15 latest (path, uuid, timestamp) pairs.
* The local node compares these with its own latest 15. If it finds any data it doesn't have, or an older version of data it does have, it will fetch the full data via `GET /kv/{path}` and update its local store.
* If the local node detects it's significantly behind (e.g., many of the remote node's latest 15 UUIDs are missing or much newer locally, indicating a large gap), it triggers the **Catch-Up Sync**.
* **Catch-Up Sync (2-Minute Cycles):**
* This mode is activated when a node determines it's behind its peers (e.g., during the 5-minute sync or bootstrapping).
* It runs every **2 minutes** (ensuring it doesn't align with the 5-minute sync).
* The node identifies the `oldest_known_timestamp_among_peers_latest_15` from its last regular sync.
* It then sends `POST /members/pairs_by_time` to a random healthy peer, requesting **15 UUIDs older than that timestamp** (e.g., `end_timestamp: oldest_known_timestamp_ms`, `limit: 15`, `start_timestamp: 0` or further back).
* It continuously iterates backwards in time in 2-minute cycles, progressively asking for older sets of 15 UUIDs until it has caught up to a reasonable historical depth (e.g., configured `BOOTSTRAP_MAX_AGE_HOURS`).
* **History Depth:** The system aims to keep track of **at least 3 revisions per path** for conflict resolution and eventually for versioning. The `BOOTSTRAP_MAX_AGE_MILLIS` (defaulting to 30 days) governs how far back in time a node will actively fetch during a full sync.
#### 6.3 Conflict Resolution
When two nodes have different versions of the same key (same BadgerDB key), the conflict resolution logic is applied:
1. **Timestamp Wins:** The data with the **most recent `timestamp` (Unix milliseconds)** is considered the correct version.
2. **Timestamp Collision (Tie-Breaker):** If two conflicting versions have the **exact same `timestamp`**:
* **Majority Vote:** The system will query a quorum of healthy peers (`GET /kv/{path}` or an internal check for UUID/timestamp) to see which UUID/timestamp pair the majority holds. The version held by the majority wins.
* **Oldest Node Priority (Tie-Breaker for Majority):** If there's an even number of nodes, and thus a tie in the majority vote (e.g., 2 nodes say version A, 2 nodes say version B), the version held by the node with the **oldest `joined_timestamp`** (i.e., the oldest active member in the cluster) takes precedence. This provides a deterministic tie-breaker.
* *Implementation Note:* For majority vote, a node might need to request the `{"uuid", "timestamp"}` pairs for a specific `path` from multiple peers. This implies an internal query mechanism or aggregating responses from `pairs_by_time` for the specific key.
### 7. Bootstrapping New Nodes (Gradual Full Sync)
This process is initiated when a new node starts up and has no existing data or member list.
1. **Seed Node Configuration:** The new node must be configured with a list of initial `seed_nodes` (e.g., `["host1:port", "host2:port"]`).
2. **Join Request:** The new node attempts to `POST /members/join` to one of its configured seed nodes, providing its own `id`, `address`, and its `joined_timestamp` (its startup time).
3. **Member List Discovery:** Upon a successful join, the seed node responds with its current list of known cluster members. The new node populates its local member list.
4. **Gradual Data Synchronization Loop (Catch-Up Mode):**
* The new node sets its `current_end_timestamp = current_time_ms`.
* It defines a `sync_batch_size` (e.g., 15 UUIDs per request, as per `pairs_by_time` `limit`).
* It also defines a `throttle_delay` (e.g., 100ms between `pairs_by_time` requests to different peers) and a `fetch_delay` (e.g., 50ms between individual `GET /kv/{path}` requests for full data).
* **Loop backwards in time:**
* The node determines the `oldest_timestamp_fetched` from its *last* batch of `sync_batch_size` items. Initially, this would be `current_time_ms`.
* Randomly pick a healthy peer from its member list.
* Send `POST /members/pairs_by_time` to the peer with `end_timestamp: oldest_timestamp_fetched`, `limit: sync_batch_size`, and `start_timestamp: 0`. This asks for 15 items *older than* the oldest one just processed.
* Process the received `{"path", "uuid", "timestamp"}` pairs:
* For each remote pair, it fetches its local version from BadgerDB.
* **Conflict Resolution:** Apply the logic from section 6.3. If local data is missing or older, initiate a `GET /kv/{path}` to fetch the full data and store it.
* **Throttling:**
* Wait for `throttle_delay` after each `pairs_by_time` request.
* Wait for `fetch_delay` after each individual `GET /kv/{path}` request for full data.
* **Termination:** The loop continues until the `oldest_timestamp_fetched` goes below the configured `BOOTSTRAP_MAX_AGE_MILLIS` (defaulting to 30 days ago, configurable value). The node may also terminate if multiple consecutive `pairs_by_time` queries return no new (older) data.
5. **Full Participation:** Once the gradual sync is complete, the node fully participates in the regular 5-minute replication cycles and accepts external client writes (if not in read-only mode). During the sync, the node will operate in a `syncing` mode, rejecting external client writes with `503 Service Unavailable`.
### 8. Operational Modes
* **Normal Mode:** Full read/write capabilities, participates in all replication and gossip activities.
* **Read-Only Mode:**
* Node will reject `PUT` and `DELETE` requests from **external clients** with a `403 Forbidden` status.
* It will **still accept** `PUT` and `DELETE` operations that originate from **recognized cluster members** during replication, allowing it to remain eventually consistent.
* `GET` requests are always allowed.
* This mode is primarily for reducing write load or protecting data on specific nodes.
* **Syncing Mode (Internal during Bootstrap):**
* While a new node is undergoing its initial gradual sync, it operates in this internal mode.
* External `PUT`/`DELETE` requests will be **rejected with `503 Service Unavailable`**.
* Internal replication from other members is fully active.
### 9. Logging
A structured logging library (e.g., `zap` or `logrus`) will be used.
* **Log Levels:** Support for `DEBUG`, `INFO`, `WARN`, `ERROR`, `FATAL`. Configurable.
* **Log Format:** JSON for easy parsing by log aggregators.
* **Key Events to Log:**
* **Startup/Shutdown:** Server start/stop, configuration loaded.
* **API Requests:** Incoming HTTP request details (method, path, client IP, status code, duration).
* **BadgerDB Operations:** Errors during put/get/delete, database open/close, secondary index operations.
* **Membership:** Node joined/left, gossip rounds initiated/received, member status changes (up, suspected, down), tie-breaker decisions.
* **Replication:** Sync cycle start/end, type of sync (regular/catch-up), number of keys compared, number of keys fetched, conflict resolutions (including details of timestamp collision resolution).
* **Errors:** Data serialization/deserialization, network errors, unhandled exceptions.
* **Operational Mode Changes:** Entering/exiting read-only mode, syncing mode.
### 10. Future Work (Rough Order of Priority)
These items are considered out of scope for the initial design but are planned for future versions.
* **Authentication/Authorization (Before First Release):** Implement robust authentication for API endpoints (e.g., API keys, mTLS) and potentially basic authorization for access to `kv` paths.
* **Client Libraries/Functions (Bash, Python, Go):** Develop official client libraries or helper functions to simplify interaction with the API for common programming environments.
* **Data Compression (gzip):** Implement Gzip compression for data values stored in BadgerDB to reduce storage footprint and potentially improve I/O performance.
* **Data Revisions & Simple Backups:**
* Hold **at least 3 revisions per path**. This would involve a mechanism to store previous versions of data when a `PUT` occurs, potentially using a separate BadgerDB key namespace (e.g., `_rev:{badger_key}:{timestamp_of_revision}`).
* The current `GET /kv/{path}` would continue to return only the latest. A new API might be introduced to fetch specific historical revisions.
* Simple backup strategies could leverage these revisions or BadgerDB's native snapshot capabilities.
* **Monitoring & Metrics (Grafana Support in v3):** Integrate with a metrics system like Prometheus, exposing key performance indicators (e.g., request rates, error rates, replication lag, BadgerDB stats) for visualization in dashboards like Grafana.