Snapshot Management
This document details the operational mechanics, configuration tuning, and failure handling of HarmonyLite's snapshot system. For high-level concepts, see Architecture.
Snapshots are the primary mechanism for disaster recovery and node provisioning. Misconfiguration here can lead to slow recovery times or storage exhaustion.
Snapshot Lifecycle
The snapshot process is a multi-step operation designed to ensure data consistency without locking the database for writes.
1. Atomic Creation (VACUUM INTO)
HarmonyLite uses SQLite's VACUUM INTO 'filename' command to create a transactionally consistent backup.
- Non-Blocking: Writers can continue to operate while the snapshot is being generated.
- WAL Inclusion: The checkpoint includes the current WAL state, ensuring no data is lost locally.
2. Sanitization
Before upload, the snapshot is "cleaned":
__harmonylite_*tables removal: Change logs are stripped to reduce size. The snapshot represents the state, not the history.- Trigger drop: Replication triggers are removed so they don't fire during the restore process (which would cause infinite loops).
- Optimization: A final
VACUUMis run on the snapshot file to reclaim space from deleted cleanup rows.
3. Cleanup
Temporary files are stored in os.TempDir() with the pattern harmonylite-snapshot-*. The system attempts to remove these files immediately after upload.
- Retry Logic: If file deletion fails (e.g., file lock), the system retries 5 times with a 1-second backoff before logging an error.
Snapshot Leader Election
When running a cluster of publishers (publish = true), it is inefficient and potentially dangerous for multiple nodes to upload snapshots simultaneously.
The Lease Mechanism
HarmonyLite uses a distributed lease backed by the NATS Key-Value store (snapshot-leader key).
- Lease Key:
snapshot-leader - Default TTL: 30 seconds
- Heartbeat: Every 10 seconds (TTL / 3)
Split-Brain Protection
Each node runs a background electionLoop.
- Acquire: Attempts to create/update the key with its Node ID.
- Renew: If it owns the key, it updates the timestamp.
- Loss: If
AcquireLeasefails (e.g., NATS outage), the node immediately steps down (isLeader.Store(false)).
If NATS is temporarily unreachable, no snapshots will be taken. This is a safety feature to prevent split-brain uploads.
Negative Scenarios & Failure Modes
Understanding how the system behaves during failures is critical for operations.
1. Storage Backend Failures
If the configured storage (S3, SFTP, etc.) is unavailable during a snapshot upload:
- Behavior: The
SaveSnapshot()function returns an error. - Retry: There is no immediate retry for that specific snapshot interval. The system waits for the next configured
interval. - Impact: Recovery points become stale.
- Monitoring: Watch logs for
Unable to cleanup temp pathorUpload failed.
2. Disk Space Exhaustion
Snapshot creation requires temporary disk space roughly equal to the database size.
- Error:
no space left on deviceduringVACUUM INTO. - Result: The snapshot is aborted.
- Recovery: The system cleans up partial files (best effort). If cleanup fails, you may see leftover
harmonylite-snapshot-*directories in/tmp.
3. Leader Flapping
If network latency between the leader and NATS exceeds leader_ttl:
- Symptom: Leadership constantly switches between nodes.
- Log: frequent
Lost snapshot leadershipandBecame snapshot leadermessages. - Fix: Increase
snapshot.leader_ttlin configuration (e.g., from30000to60000).
4. "System Busy" (Overlap)
If a snapshot takes longer to generate/upload than the configured interval:
- Error:
ErrPendingSnapshot("system busy capturing snapshot"). - Behavior: The new snapshot attempt is skipped.
- Prevention: Ensure
interval>> (Generation Time + Upload Time).
5. Corrupted Snapshot Download
If a downloaded snapshot is corrupted (truncated file, bad checksum):
- Detection: SQLite
RestoreFromwill fail or the database will be unopenable. - Recovery:
- Automatic: The node logs the error and effectively restarts the initialization process.
- Manual Intervention: Stop the node, delete the local
data.dbandseq-map.cbor, and restart.
Configuration Tuning
Basic Schedule
[snapshot]
enabled = true
interval = 3600000 # 1 hour (Default). Set lower for lower RPO.
Leadership Tuning
For high-latency networks or heavy-load leaders:
[snapshot]
leader_ttl = 60000 # Increase to 60s to prevent flapping
Storage Backends
S3 (AWS/MinIO)
Supports standard AWS SDK authentication (Env vars, IAM roles).
[snapshot.s3]
endpoint = "s3.us-east-1.amazonaws.com" # or minio:9000
bucket = "backups"
path = "prod-cluster" # Prefix for keys
region = "us-east-1"
force_path_style = false # Set true for MinIO
SFTP
Useful for "air-gapped" backups to a secure host.
[snapshot.sftp]
# format: sftp://user:pass@host:port/path
url = "sftp://backup:secret@10.0.0.5:22/mnt/backups"
Restoration & Recovery
How to Force a Restore
To force a node to re-download the latest snapshot:
- Stop the service:
systemctl stop harmonylite - Clear data:
rm /var/lib/harmonylite/data.db - Clear sequence map:
rm /var/lib/harmonylite/seq-map.cbor- Critical: Removing the sequence map forces the node to check for snapshots. If you verify the DB but keep the map, it will try to replay streams instead.
- Start service:
systemctl start harmonylite
Verifying Integrity
You can verify a snapshot integrity manually:
sqlite3 snapshot.db "PRAGMA integrity_check;"