Troubleshooting Guide
This guide helps you diagnose and resolve common issues with HarmonyLite deployments. It covers installation problems, replication issues, performance bottlenecks, and recovery procedures.
Diagnostic Tools
Before diving into specific issues, familiarize yourself with these diagnostic tools:
Log Analysis
HarmonyLite logs provide valuable troubleshooting information. When using systemd, logs are sent to the journal. Enable verbose logging temporarily by modifying your config:
[logging]
verbose = true
format = "json" # or "console" for human-readable format
Access logs using journalctl:
# View all logs for the HarmonyLite service
journalctl -u harmonylite
# View recent logs
journalctl -u harmonylite -n 100
# Follow logs in real-time
journalctl -u harmonylite -f
Prometheus Metrics
Enable Prometheus metrics to monitor performance:
[prometheus]
enable = true
bind = "0.0.0.0:3010"
Access metrics at http://<node-ip>:3010/metrics
Health Check Endpoint
HarmonyLite provides a health check HTTP endpoint that can be used for monitoring the status of nodes. Enable it in your configuration:
[health_check]
enable = true
bind = "0.0.0.0:8090"
path = "/health"
detailed = true
Access the health status at http://<node-ip>:8090/health
This health check can be integrated with container orchestration systems like Docker and Kubernetes for automated monitoring and failover. See the Health Check documentation for more details.
Available Metrics
HarmonyLite exposes the following metrics that can be used for monitoring and troubleshooting:
Database Metrics
Metric | Type | Description |
---|---|---|
published | Counter | Number of database change rows that have been published to the NATS stream |
pending_publish | Gauge | Number of rows that are pending to be published, which can indicate a backlog |
count_changes | Histogram | Latency (in microseconds) for counting changes in the database |
scan_changes | Histogram | Latency (in microseconds) for scanning change rows in the database |
Performance Indicators
- High
pending_publish
values indicate that HarmonyLite is experiencing delays in propagating changes. - Increasing
count_changes
orscan_changes
latencies may indicate database performance issues. - Low
published
rate compared to write activity could indicate replication issues.
Understanding HarmonyLite Metrics
HarmonyLite uses a change data capture (CDC) mechanism to track and replicate database changes. The metrics help monitor this process:
-
Change Detection: When database changes occur, HarmonyLite detects them and marks them as pending in a change log table.
-
Change Publishing: The pending changes are published to NATS streams, and the
published
counter increases. -
Replication: Other nodes consume these published changes and apply them to their local databases.
Monitoring these metrics provides insights into the health of this process. For example:
- A consistently high
pending_publish
value could indicate network issues or that consumers are not keeping up with the change rate. - If
count_changes
andscan_changes
latencies increase, it might indicate that the SQLite database is under heavy load.
NATS Monitoring
Check NATS server status:
# If using embedded NATS
curl http://localhost:8222/varz
curl http://localhost:8222/jsz
# List streams
curl http://localhost:8222/jsz?streams=1
SQLite Analysis
Examine the SQLite database directly:
sqlite3 /path/to/your.db
Useful SQLite commands:
-- Check if triggers are installed
SELECT name FROM sqlite_master WHERE type='trigger' AND name LIKE '__harmonylite%';
-- Check change log tables
SELECT name FROM sqlite_master WHERE type='table' AND name LIKE '__harmonylite%';
-- Count pending changes
SELECT COUNT(*) FROM __harmonylite___change_log_global;
Performance Profiling with pprof
HarmonyLite includes Go's built-in performance profiling which can be enabled with the -pprof
flag to diagnose performance issues:
# Start HarmonyLite with profiling enabled on port 6060
./harmonylite -config /path/to/config.toml -pprof "127.0.0.1:6060"
Once enabled, you can access the following profiling endpoints:
- Overview:
http://127.0.0.1:6060/debug/pprof/
- CPU Profile:
http://127.0.0.1:6060/debug/pprof/profile
(runs for 30 seconds by default) - Heap Memory Profile:
http://127.0.0.1:6060/debug/pprof/heap
- Goroutine Stack Traces:
http://127.0.0.1:6060/debug/pprof/goroutine
- Thread Creation Profile:
http://127.0.0.1:6060/debug/pprof/threadcreate
- Blocking Profile:
http://127.0.0.1:6060/debug/pprof/block
- Execution Trace:
http://127.0.0.1:6060/debug/pprof/trace
For more advanced analysis, use the Go pprof tool:
# CPU profiling
go tool pprof http://127.0.0.1:6060/debug/pprof/profile
# Memory profiling
go tool pprof http://127.0.0.1:6060/debug/pprof/heap
# For a 5-second CPU profile:
go tool pprof http://127.0.0.1:6060/debug/pprof/profile?seconds=5
Inside the pprof interactive shell:
top
: Show top functions by usageweb
: Generate a graph visualization (requires Graphviz)list [function]
: Show source code with profiling data
Note: Use profiling carefully in production environments as it exposes internal details about your application and may impact performance.
Common Issues and Solutions
Installation and Setup
Problem: HarmonyLite Fails to Start
Symptoms:
- Service fails to start
- "command not found" errors
- Permission denied errors
Potential Causes and Solutions:
-
Binary not executable:
chmod +x /path/to/harmonylite
-
Missing dependencies:
ldd /path/to/harmonylite
# Install any missing dependencies -
Permission issues:
# Check file ownership
ls -la /path/to/harmonylite
# Check directory permissions
ls -la /var/lib/harmonylite
# Fix permissions
chown harmonylite:harmonylite /var/lib/harmonylite
chmod 750 /var/lib/harmonylite -
Config file problems:
# Validate config manually
cat /path/to/config.toml
Problem: Configuration Validation Errors
Symptoms:
- "Invalid configuration" errors
- Service starts but exits immediately
Solutions:
- Verify TOML syntax is valid
- Check that all required fields are present
- Ensure paths exist and are accessible
- Validate that
node_id
is unique within the cluster
Replication Issues
Problem: Changes Not Replicating
Symptoms:
- Changes made on one node are not appearing on other nodes
- Replication metrics show no activity
Potential Causes and Solutions:
-
NATS connectivity issues:
# Check NATS status
curl http://localhost:8222/varz
# Test connection from other nodes
telnet <nats-server-ip> 4222 -
Triggers not installed:
-- Check triggers
SELECT name FROM sqlite_master WHERE type='trigger' AND name LIKE '__harmonylite%';
-- Reinstall triggers
-- Exit SQLite and run:
harmonylite -config /path/to/config.toml -cleanup
-- Then restart HarmonyLite -
Change logs not being created:
-- Make a test change with trusted_schema enabled
PRAGMA trusted_schema = ON;
INSERT INTO test_table (name) VALUES ('test');
-- Check if it appears in change log
SELECT * FROM __harmonylite__test_table_change_log ORDER BY id DESC LIMIT 1; -
Publishing disabled:
# Check config.toml for:
publish = false # Should be true for nodes that need to send changes -
NATS stream not created:
# Check if streams exist
curl http://localhost:8222/jsz?streams=1
# Recreate streams
# First stop HarmonyLite, then restart with clean state
rm /path/to/seq-map.cbor
# Restart HarmonyLite
Problem: High Replication Latency
Symptoms:
- Changes take a long time to propagate
- High
pending_publish
metrics
Solutions:
-
Increase shards:
[replication_log]
shards = 4 # Increase from default -
Enable compression:
[replication_log]
compress = true -
Check network latency:
ping <other-node-ip>
-
Monitor disk I/O:
iostat -x 1
-
Adjust cleanup interval:
# Decrease to cleanup more frequently
cleanup_interval = 30000 # 30 seconds
Database Issues
Problem: Database Locks
Symptoms:
- "database is locked" errors
- Operations timing out
- Replication stalls
Solutions:
-
Check for long-running transactions:
PRAGMA busy_timeout = 30000; -- Set in your application
-
Use WAL journal mode:
PRAGMA journal_mode = WAL; -- Set in your application
-
Check for other processes accessing the database:
lsof | grep your.db
-
Verify SQLite version:
sqlite3 --version
# Should be 3.35.0 or newer -
Consider timeout settings:
# Add to application connection string
?_timeout=30000&_journal_mode=WAL
Problem: Database Corruption
Symptoms:
- "malformed database" errors
- Unexpected query results
- Application crashes
Solutions:
-
Check database integrity:
PRAGMA integrity_check;
-
Restore from snapshot:
# Stop HarmonyLite
systemctl stop harmonylite
# Remove corrupt database
rm /path/to/your.db
# Restart to trigger recovery
systemctl start harmonylite -
Recover from backup:
# Restore from backup
cp /path/to/backup.db /path/to/your.db
# Remove sequence map to force reinitialization
rm /path/to/seq-map.cbor
# Restart HarmonyLite
systemctl start harmonylite
Snapshot and Recovery
Problem: Snapshot Creation Fails
Symptoms:
- "Failed to create snapshot" errors
- No snapshots appearing in storage
snapshot_age
metric keeps increasing
Solutions:
-
Check storage connectivity:
# Test S3 access
aws s3 ls s3://your-bucket/
# Test WebDAV
curl -u username:password https://webdav.example.com/ -
Verify permissions:
# For local file storage
ls -la /path/to/snapshot/dir
# For S3
aws s3 ls s3://your-bucket/ --debug -
Ensure enough disk space:
df -h
-
Force snapshot creation:
harmonylite -config /path/to/config.toml -save-snapshot
-
Check storage configuration:
[snapshot]
enabled = true
store = "s3" # Verify this matches your credentials
[snapshot.s3]
# Verify all credentials are correct
Problem: Recovery from Snapshot Fails
Symptoms:
- "Failed to restore snapshot" errors
- Service fails to start after deleting database
- Inconsistent state after recovery
Solutions:
-
Check sequence map:
# Remove sequence map to force full recovery
rm /path/to/seq-map.cbor -
Verify snapshot access:
# For S3
aws s3 ls s3://your-bucket/harmonylite/snapshots/ -
Try manual restore:
# Download snapshot manually
aws s3 cp s3://your-bucket/harmonylite/snapshots/latest.db /tmp/
# Replace database
cp /tmp/latest.db /path/to/your.db
# Fix permissions
chown harmonylite:harmonylite /path/to/your.db
# Remove sequence map
rm /path/to/seq-map.cbor
# Restart
systemctl start harmonylite -
Check logs for specific errors:
journalctl -u harmonylite | grep "snapshot"
Performance Issues
Problem: High CPU Usage
Symptoms:
- CPU consistently above 70%
- Slow response times
- Process using excessive resources
Solutions:
-
Profile the process:
top -p $(pgrep harmonylite)
-
Check if compression is causing overhead:
# Try disabling compression temporarily
[replication_log]
compress = false -
Adjust shard count:
# If too high, reduce:
[replication_log]
shards = 2 # Start low and increase as needed -
Monitor change volume:
# Check Prometheus metrics
curl http://localhost:3010/metrics | grep harmonylite_published -
Consider hardware upgrade if consistently high
Problem: Memory Leaks
Symptoms:
- Steadily increasing memory usage
- Eventually crashes with out-of-memory errors
- Degraded performance over time
Solutions:
-
Monitor memory usage:
ps -o pid,user,%mem,rss,command -p $(pgrep harmonylite)
-
Set memory limits in systemd:
# In /etc/systemd/system/harmonylite.service
[Service]
MemoryLimit=512M -
Restart periodically if needed:
# In crontab
0 4 * * * systemctl restart harmonylite -
Update to latest version as memory leaks are often fixed in updates
NATS Issues
Problem: NATS Connection Failures
Symptoms:
- "Failed to connect to NATS" errors
- Intermittent disconnections
- Stream creation failures
Solutions:
-
Check NATS server status:
curl http://localhost:8222/varz
-
Verify NATS URLs:
[nats]
urls = ["nats://server1:4222", "nats://server2:4222"]
# Verify all servers are running -
Increase connection retry settings:
[nats]
connect_retries = 10
reconnect_wait_seconds = 5 -
Check authentication:
[nats]
# Verify credentials match server configuration
user_name = "harmonylite"
user_password = "your-password" -
Test NATS connectivity directly:
# Install NATS CLI
curl -sf https://install.nats.io/install.sh | sh
# Test connection
nats pub test.subject "hello" --server nats://server:4222
Problem: JetStream Errors
Symptoms:
- "Failed to create stream" errors
- "No responders available" errors
- Stream memory or storage errors
Solutions:
-
Check JetStream status:
curl http://localhost:8222/jsz
-
Verify JetStream is enabled on NATS server
-
Check storage limits:
# On NATS server
df -h /path/to/jetstream/storage -
Adjust stream settings:
[replication_log]
max_entries = 1024 # Reduce if storage is limited -
Recreate streams if corrupted:
# Using NATS CLI
nats stream ls --server nats://server:4222
nats stream rm harmonylite-changes-1 --server nats://server:4222
# Then restart HarmonyLite
Sleep Timeout and Serverless Operation
Problem: HarmonyLite Exits Unexpectedly
Symptoms:
- Process terminates after a period of inactivity
- Log shows "No more events to process, initiating shutdown"
Solutions:
-
Check sleep timeout setting:
# Disable automatic shutdown by setting to 0 (default)
sleep_timeout = 0 -
Adjust timeout duration if you want the serverless behavior:
# Set longer timeout in milliseconds, e.g., 30 minutes
sleep_timeout = 1800000 -
Ensure your orchestration system can handle the expected restarts if using serverless mode
Fixing Triggers and Schema Issues
Problem: Missing or Corrupted Triggers
Symptoms:
- Changes not being captured
- Missing change log tables
- Schema change errors
Solutions:
-
Check if triggers exist:
SELECT name FROM sqlite_master WHERE type='trigger' AND name LIKE '__harmonylite%';
-
Clean up and reinstall triggers:
harmonylite -config /path/to/config.toml -cleanup
-
Verify SQLite version compatibility:
sqlite3 --version
# Should be 3.35.0 or newer -
Enable trusted schema in applications:
PRAGMA trusted_schema = ON;
Problem: Schema Changes Break Replication
Symptoms:
- Errors after changing table structures
- "no such column" errors
- Replication stops after ALTER TABLE operations
Solutions:
-
Proper schema change procedure:
- Stop applications
- Apply changes on one node
- Run cleanup to reset triggers:
harmonylite -config /path/to/config.toml -cleanup
- Restart HarmonyLite
- Wait for replication
- Repeat on other nodes
-
Verify table structure is identical on all nodes:
.schema table_name
-
Check for foreign key issues:
PRAGMA foreign_key_check;
Recovery Procedures
Full Node Recovery
If a node is completely corrupted or needs to be rebuilt:
-
Stop HarmonyLite:
systemctl stop harmonylite
-
Clean up existing files:
rm /var/lib/harmonylite/data.db
rm /var/lib/harmonylite/seq-map.cbor -
Start HarmonyLite (it will recover automatically):
systemctl start harmonylite
-
Monitor logs for recovery progress:
journalctl -u harmonylite -f
Manual Database Repair
For advanced recovery when automatic procedures fail:
-
Create a backup first:
cp /var/lib/harmonylite/data.db /var/lib/harmonylite/data.db.bak
-
Try SQLite recovery:
sqlite3 /var/lib/harmonylite/data.db "PRAGMA integrity_check;"
-
Dump and restore if integrity check fails:
# Dump schema
echo .schema | sqlite3 /var/lib/harmonylite/data.db.bak > schema.sql
# Dump data (excluding HarmonyLite tables)
sqlite3 /var/lib/harmonylite/data.db.bak <<EOF
.mode insert
.output data.sql
SELECT * FROM sqlite_master WHERE type='table' AND name NOT LIKE '__harmonylite%';
.quit
EOF
# Create new database
sqlite3 /var/lib/harmonylite/data.db < schema.sql
sqlite3 /var/lib/harmonylite/data.db < data.sql
# Reset sequence map
rm /var/lib/harmonylite/seq-map.cbor
# Restart HarmonyLite
systemctl restart harmonylite
Diagnostic Commands Reference
Issue | Diagnostic Command | What to Look For |
---|---|---|
Node Status | systemctl status harmonylite | Active (running) status |
Logs | journalctl -u harmonylite -n 100 | Recent error messages |
Process Resources | ps -o pid,%cpu,%mem,vsz,rss -p $(pgrep harmonylite) | CPU/memory usage |
Health Check | curl http://localhost:8090/health | "status":"healthy" |
NATS Status | curl http://localhost:8222/varz | Server running, connections |
NATS Streams | curl http://localhost:8222/jsz?streams=1 | Stream existence, message counts |
Database Size | du -sh /var/lib/harmonylite/data.db | Growth trends |
Database Integrity | echo "PRAGMA integrity_check;" | sqlite3 /var/lib/harmonylite/data.db | "ok" result |
Triggers | echo "SELECT count(*) FROM sqlite_master WHERE type='trigger' AND name LIKE '__harmonylite%';" | sqlite3 /var/lib/harmonylite/data.db | Non-zero count |
Change Log Tables | echo "SELECT count(*) FROM sqlite_master WHERE type='table' AND name LIKE '__harmonylite%';" | sqlite3 /var/lib/harmonylite/data.db | Non-zero count |
Pending Changes | echo "SELECT count(*) FROM __harmonylite___global_change_log;" | sqlite3 /var/lib/harmonylite/data.db | Should be low or zero |
Network Connectivity | ss -tpln | grep harmonylite | Listening ports |
Getting More Help
If you're still having issues after following this guide:
-
Check GitHub Issues for similar problems and solutions
-
Gather diagnostic information:
# Create diagnostic bundle
mkdir -p /tmp/harmonylite-diag
cp /etc/harmonylite/config.toml /tmp/harmonylite-diag/
journalctl -u harmonylite -n 1000 > /tmp/harmonylite-diag/journal-logs.txt
curl http://localhost:3010/metrics > /tmp/harmonylite-diag/metrics.txt
curl http://localhost:8222/varz > /tmp/harmonylite-diag/nats-varz.json
curl http://localhost:8222/jsz > /tmp/harmonylite-diag/nats-jsz.json
harmonylite -version > /tmp/harmonylite-diag/version.txt
tar -czf harmonylite-diag.tar.gz -C /tmp harmonylite-diag -
Open a GitHub Issue with the diagnostic bundle and detailed description of your problem
-
Join Community Discussion for assistance from other users and developers