Troubleshooting Guide

This guide helps you diagnose and resolve common issues with HarmonyLite deployments. It covers installation problems, replication issues, performance bottlenecks, and recovery procedures.

Diagnostic Tools

Before diving into specific issues, familiarize yourself with these diagnostic tools:

Log Analysis

HarmonyLite logs provide valuable troubleshooting information. When using systemd, logs are sent to the journal. Enable verbose logging temporarily by modifying your config:

[logging]
verbose = true
format = "json"  # or "console" for human-readable format

Access logs using journalctl:

# View all logs for the HarmonyLite service
journalctl -u harmonylite

# View recent logs
journalctl -u harmonylite -n 100

# Follow logs in real-time
journalctl -u harmonylite -f

Prometheus Metrics

Enable Prometheus metrics to monitor performance:

[prometheus]
enable = true
bind = "0.0.0.0:3010"

Access metrics at http://<node-ip>:3010/metrics

Health Check Endpoint

HarmonyLite provides a health check HTTP endpoint that can be used for monitoring the status of nodes. Enable it in your configuration:

[health_check]
enable = true
bind = "0.0.0.0:8090"
path = "/health"
detailed = true

Access the health status at http://<node-ip>:8090/health

This health check can be integrated with container orchestration systems like Docker and Kubernetes for automated monitoring and failover. See the Health Check documentation for more details.

Available Metrics

HarmonyLite exposes the following metrics that can be used for monitoring and troubleshooting:

Database Metrics

Metric	Type	Description
`published`	Counter	Number of database change rows that have been published to the NATS stream
`pending_publish`	Gauge	Number of rows that are pending to be published, which can indicate a backlog
`count_changes`	Histogram	Latency (in microseconds) for counting changes in the database
`scan_changes`	Histogram	Latency (in microseconds) for scanning change rows in the database

Performance Indicators

High pending_publish values indicate that HarmonyLite is experiencing delays in propagating changes.
Increasing count_changes or scan_changes latencies may indicate database performance issues.
Low published rate compared to write activity could indicate replication issues.

Understanding HarmonyLite Metrics

HarmonyLite uses a change data capture (CDC) mechanism to track and replicate database changes. The metrics help monitor this process:

Change Detection: When database changes occur, HarmonyLite detects them and marks them as pending in a change log table.
Change Publishing: The pending changes are published to NATS streams, and the published counter increases.
Replication: Other nodes consume these published changes and apply them to their local databases.

Monitoring these metrics provides insights into the health of this process. For example:

A consistently high pending_publish value could indicate network issues or that consumers are not keeping up with the change rate.
If count_changes and scan_changes latencies increase, it might indicate that the SQLite database is under heavy load.

NATS Monitoring

Check NATS server status:

# If using embedded NATS
curl http://localhost:8222/varz
curl http://localhost:8222/jsz

# List streams
curl http://localhost:8222/jsz?streams=1

SQLite Analysis

Examine the SQLite database directly:

sqlite3 /path/to/your.db

Useful SQLite commands:

-- Check if triggers are installed
SELECT name FROM sqlite_master WHERE type='trigger' AND name LIKE '__harmonylite%';

-- Check change log tables
SELECT name FROM sqlite_master WHERE type='table' AND name LIKE '__harmonylite%';

-- Count pending changes
SELECT COUNT(*) FROM __harmonylite___change_log_global;

Performance Profiling with pprof

HarmonyLite includes Go's built-in performance profiling which can be enabled with the -pprof flag to diagnose performance issues:

# Start HarmonyLite with profiling enabled on port 6060
./harmonylite -config /path/to/config.toml -pprof "127.0.0.1:6060"

Once enabled, you can access the following profiling endpoints:

Overview: http://127.0.0.1:6060/debug/pprof/
CPU Profile: http://127.0.0.1:6060/debug/pprof/profile (runs for 30 seconds by default)
Heap Memory Profile: http://127.0.0.1:6060/debug/pprof/heap
Goroutine Stack Traces: http://127.0.0.1:6060/debug/pprof/goroutine
Thread Creation Profile: http://127.0.0.1:6060/debug/pprof/threadcreate
Blocking Profile: http://127.0.0.1:6060/debug/pprof/block
Execution Trace: http://127.0.0.1:6060/debug/pprof/trace

For more advanced analysis, use the Go pprof tool:

# CPU profiling
go tool pprof http://127.0.0.1:6060/debug/pprof/profile

# Memory profiling
go tool pprof http://127.0.0.1:6060/debug/pprof/heap

# For a 5-second CPU profile:
go tool pprof http://127.0.0.1:6060/debug/pprof/profile?seconds=5

Inside the pprof interactive shell:

top: Show top functions by usage
web: Generate a graph visualization (requires Graphviz)
list [function]: Show source code with profiling data

Note: Use profiling carefully in production environments as it exposes internal details about your application and may impact performance.

Common Issues and Solutions

Installation and Setup

Problem: HarmonyLite Fails to Start

Symptoms:

Service fails to start
"command not found" errors
Permission denied errors

Potential Causes and Solutions:

Binary not executable:
```
chmod +x /path/to/harmonylite
```

Missing dependencies:

ldd /path/to/harmonylite
# Install any missing dependencies

Permission issues:

# Check file ownership
ls -la /path/to/harmonylite

# Check directory permissions
ls -la /var/lib/harmonylite

# Fix permissions
chown harmonylite:harmonylite /var/lib/harmonylite
chmod 750 /var/lib/harmonylite

Config file problems:

# Validate config manually
cat /path/to/config.toml

Problem: Configuration Validation Errors

Symptoms:

"Invalid configuration" errors
Service starts but exits immediately

Solutions:

Verify TOML syntax is valid
Check that all required fields are present
Ensure paths exist and are accessible
Validate that node_id is unique within the cluster

Replication Issues

Problem: Changes Not Replicating

Symptoms:

Changes made on one node are not appearing on other nodes
Replication metrics show no activity

Potential Causes and Solutions:

NATS connectivity issues:

# Check NATS status
curl http://localhost:8222/varz

# Test connection from other nodes
telnet <nats-server-ip> 4222

Triggers not installed:

-- Check triggers
SELECT name FROM sqlite_master WHERE type='trigger' AND name LIKE '__harmonylite%';

-- Reinstall triggers
-- Exit SQLite and run:
harmonylite -config /path/to/config.toml -cleanup
-- Then restart HarmonyLite

Change logs not being created:

-- Make a test change with trusted_schema enabled
PRAGMA trusted_schema = ON;
INSERT INTO test_table (name) VALUES ('test');

-- Check if it appears in change log
SELECT * FROM __harmonylite__test_table_change_log ORDER BY id DESC LIMIT 1;

Publishing disabled:

# Check config.toml for:
publish = false  # Should be true for nodes that need to send changes

NATS stream not created:

# Check if streams exist
curl http://localhost:8222/jsz?streams=1

# Recreate streams
# First stop HarmonyLite, then restart with clean state
rm /path/to/seq-map.cbor
# Restart HarmonyLite

Problem: High Replication Latency

Symptoms:

Changes take a long time to propagate
High pending_publish metrics

Solutions:

Increase shards:

[replication_log]
shards = 4  # Increase from default

Enable compression:
```
[replication_log]
compress = true
```
Check network latency:
```
ping <other-node-ip>
```
Monitor disk I/O:
```
iostat -x 1
```

Adjust cleanup interval:

# Decrease to cleanup more frequently
cleanup_interval = 30000  # 30 seconds

Database Issues

Problem: Database Locks

Symptoms:

"database is locked" errors
Operations timing out
Replication stalls

Solutions:

Check for long-running transactions:

PRAGMA busy_timeout = 30000;  -- Set in your application

Use WAL journal mode:

PRAGMA journal_mode = WAL;  -- Set in your application

Check for other processes accessing the database:
```
lsof | grep your.db
```

Verify SQLite version:

sqlite3 --version
# Should be 3.35.0 or newer

Consider timeout settings:

# Add to application connection string
?_timeout=30000&_journal_mode=WAL

Problem: Database Corruption

Symptoms:

"malformed database" errors
Unexpected query results
Application crashes

Solutions:

Check database integrity:
```
PRAGMA integrity_check;
```

Restore from snapshot:

# Stop HarmonyLite
systemctl stop harmonylite

# Remove corrupt database
rm /path/to/your.db

# Restart to trigger recovery
systemctl start harmonylite

Recover from backup:

# Restore from backup
cp /path/to/backup.db /path/to/your.db

# Remove sequence map to force reinitialization
rm /path/to/seq-map.cbor

# Restart HarmonyLite
systemctl start harmonylite

Snapshot and Recovery

Problem: Snapshot Creation Fails

Symptoms:

"Failed to create snapshot" errors
No snapshots appearing in storage
snapshot_age metric keeps increasing

Solutions:

Check storage connectivity:

# Test S3 access
aws s3 ls s3://your-bucket/

# Test WebDAV
curl -u username:password https://webdav.example.com/

Verify permissions:

# For local file storage
ls -la /path/to/snapshot/dir

# For S3
aws s3 ls s3://your-bucket/ --debug

Ensure enough disk space:
```
df -h
```

Force snapshot creation:

harmonylite -config /path/to/config.toml -save-snapshot

Check storage configuration:

[snapshot]
enabled = true
store = "s3"  # Verify this matches your credentials

[snapshot.s3]
# Verify all credentials are correct

Problem: Recovery from Snapshot Fails

Symptoms:

"Failed to restore snapshot" errors
Service fails to start after deleting database
Inconsistent state after recovery

Solutions:

Check sequence map:

# Remove sequence map to force full recovery
rm /path/to/seq-map.cbor

Verify snapshot access:

# For S3
aws s3 ls s3://your-bucket/harmonylite/snapshots/

Try manual restore:

# Download snapshot manually
aws s3 cp s3://your-bucket/harmonylite/snapshots/latest.db /tmp/

# Replace database
cp /tmp/latest.db /path/to/your.db

# Fix permissions
chown harmonylite:harmonylite /path/to/your.db

# Remove sequence map
rm /path/to/seq-map.cbor

# Restart
systemctl start harmonylite

Check logs for specific errors:

journalctl -u harmonylite | grep "snapshot"

Performance Issues

Problem: High CPU Usage

Symptoms:

CPU consistently above 70%
Slow response times
Process using excessive resources

Solutions:

Profile the process:
```
top -p $(pgrep harmonylite)
```

Check if compression is causing overhead:

# Try disabling compression temporarily
[replication_log]
compress = false

Adjust shard count:

# If too high, reduce:
[replication_log]
shards = 2  # Start low and increase as needed

Monitor change volume:

# Check Prometheus metrics
curl http://localhost:3010/metrics | grep harmonylite_published

Consider hardware upgrade if consistently high

Problem: Memory Leaks

Symptoms:

Steadily increasing memory usage
Eventually crashes with out-of-memory errors
Degraded performance over time

Solutions:

Monitor memory usage:

ps -o pid,user,%mem,rss,command -p $(pgrep harmonylite)

Set memory limits in systemd:

# In /etc/systemd/system/harmonylite.service
[Service]
MemoryLimit=512M

Restart periodically if needed:

# In crontab
0 4 * * * systemctl restart harmonylite

Update to latest version as memory leaks are often fixed in updates

NATS Issues

Problem: NATS Connection Failures

Symptoms:

"Failed to connect to NATS" errors
Intermittent disconnections
Stream creation failures

Solutions:

Check NATS server status:
```
curl http://localhost:8222/varz
```

Verify NATS URLs:

[nats]
urls = ["nats://server1:4222", "nats://server2:4222"]
# Verify all servers are running

Increase connection retry settings:

[nats]
connect_retries = 10
reconnect_wait_seconds = 5

Check authentication:

[nats]
# Verify credentials match server configuration
user_name = "harmonylite"
user_password = "your-password"

Test NATS connectivity directly:

# Install NATS CLI
curl -sf https://install.nats.io/install.sh | sh

# Test connection
nats pub test.subject "hello" --server nats://server:4222

Problem: JetStream Errors

Symptoms:

"Failed to create stream" errors
"No responders available" errors
Stream memory or storage errors

Solutions:

Check JetStream status:
```
curl http://localhost:8222/jsz
```
Verify JetStream is enabled on NATS server

Check storage limits:

# On NATS server
df -h /path/to/jetstream/storage

Adjust stream settings:

[replication_log]
max_entries = 1024  # Reduce if storage is limited

Recreate streams if corrupted:

# Using NATS CLI
nats stream ls --server nats://server:4222
nats stream rm harmonylite-changes-1 --server nats://server:4222
# Then restart HarmonyLite

Sleep Timeout and Serverless Operation

Problem: HarmonyLite Exits Unexpectedly

Symptoms:

Process terminates after a period of inactivity
Log shows "No more events to process, initiating shutdown"

Solutions:

Check sleep timeout setting:

# Disable automatic shutdown by setting to 0 (default)
sleep_timeout = 0

Adjust timeout duration if you want the serverless behavior:

# Set longer timeout in milliseconds, e.g., 30 minutes
sleep_timeout = 1800000

Ensure your orchestration system can handle the expected restarts if using serverless mode

Fixing Triggers and Schema Issues

Problem: Missing or Corrupted Triggers

Symptoms:

Changes not being captured
Missing change log tables
Schema change errors

Solutions:

Check if triggers exist:

SELECT name FROM sqlite_master WHERE type='trigger' AND name LIKE '__harmonylite%';

Clean up and reinstall triggers:

harmonylite -config /path/to/config.toml -cleanup

Verify SQLite version compatibility:

sqlite3 --version
# Should be 3.35.0 or newer

Enable trusted schema in applications:
```
PRAGMA trusted_schema = ON;
```

Problem: Schema Changes Break Replication

Symptoms:

Errors after changing table structures
"no such column" errors
Replication stops after ALTER TABLE operations

Solutions:

Proper schema change procedure:
- Stop applications
- Apply changes on one node
- Run cleanup to reset triggers:
```
harmonylite -config /path/to/config.toml -cleanup
```
- Restart HarmonyLite
- Wait for replication
- Repeat on other nodes
Verify table structure is identical on all nodes:
```
.schema table_name
```
Check for foreign key issues:
```
PRAGMA foreign_key_check;
```

Recovery Procedures

Full Node Recovery

If a node is completely corrupted or needs to be rebuilt:

Stop HarmonyLite:
```
systemctl stop harmonylite
```

Clean up existing files:

rm /var/lib/harmonylite/data.db
rm /var/lib/harmonylite/seq-map.cbor

Start HarmonyLite (it will recover automatically):
```
systemctl start harmonylite
```
Monitor logs for recovery progress:
```
journalctl -u harmonylite -f
```

Manual Database Repair

For advanced recovery when automatic procedures fail:

Create a backup first:

cp /var/lib/harmonylite/data.db /var/lib/harmonylite/data.db.bak

Try SQLite recovery:

sqlite3 /var/lib/harmonylite/data.db "PRAGMA integrity_check;"

Dump and restore if integrity check fails:

# Dump schema
echo .schema | sqlite3 /var/lib/harmonylite/data.db.bak > schema.sql

# Dump data (excluding HarmonyLite tables)
sqlite3 /var/lib/harmonylite/data.db.bak <<EOF
.mode insert
.output data.sql
SELECT * FROM sqlite_master WHERE type='table' AND name NOT LIKE '__harmonylite%';
.quit
EOF

# Create new database
sqlite3 /var/lib/harmonylite/data.db < schema.sql
sqlite3 /var/lib/harmonylite/data.db < data.sql

# Reset sequence map
rm /var/lib/harmonylite/seq-map.cbor

# Restart HarmonyLite
systemctl restart harmonylite

Diagnostic Commands Reference

Issue	Diagnostic Command	What to Look For
Node Status	`systemctl status harmonylite`	Active (running) status
Logs	`journalctl -u harmonylite -n 100`	Recent error messages
Process Resources	`ps -o pid,%cpu,%mem,vsz,rss -p $(pgrep harmonylite)`	CPU/memory usage
Health Check	`curl http://localhost:8090/health`	"status":"healthy"
NATS Status	`curl http://localhost:8222/varz`	Server running, connections
NATS Streams	`curl http://localhost:8222/jsz?streams=1`	Stream existence, message counts
Database Size	`du -sh /var/lib/harmonylite/data.db`	Growth trends
Database Integrity	`echo "PRAGMA integrity_check;" \| sqlite3 /var/lib/harmonylite/data.db`	"ok" result
Triggers	`echo "SELECT count(*) FROM sqlite_master WHERE type='trigger' AND name LIKE '__harmonylite%';" \| sqlite3 /var/lib/harmonylite/data.db`	Non-zero count
Change Log Tables	`echo "SELECT count(*) FROM sqlite_master WHERE type='table' AND name LIKE '__harmonylite%';" \| sqlite3 /var/lib/harmonylite/data.db`	Non-zero count
Pending Changes	`echo "SELECT count(*) FROM __harmonylite___global_change_log;" \| sqlite3 /var/lib/harmonylite/data.db`	Should be low or zero
Network Connectivity	`ss -tpln \| grep harmonylite`	Listening ports

Getting More Help

If you're still having issues after following this guide:

Check GitHub Issues for similar problems and solutions

Gather diagnostic information:

# Create diagnostic bundle
mkdir -p /tmp/harmonylite-diag
cp /etc/harmonylite/config.toml /tmp/harmonylite-diag/
journalctl -u harmonylite -n 1000 > /tmp/harmonylite-diag/journal-logs.txt
curl http://localhost:3010/metrics > /tmp/harmonylite-diag/metrics.txt
curl http://localhost:8222/varz > /tmp/harmonylite-diag/nats-varz.json
curl http://localhost:8222/jsz > /tmp/harmonylite-diag/nats-jsz.json
harmonylite -version > /tmp/harmonylite-diag/version.txt
tar -czf harmonylite-diag.tar.gz -C /tmp harmonylite-diag

Open a GitHub Issue with the diagnostic bundle and detailed description of your problem
Join Community Discussion for assistance from other users and developers

Diagnostic Tools​

Log Analysis​

Prometheus Metrics​

Health Check Endpoint​

Available Metrics​

Database Metrics​

Performance Indicators​

Understanding HarmonyLite Metrics​

NATS Monitoring​

SQLite Analysis​

Performance Profiling with pprof​

Common Issues and Solutions​

Installation and Setup​

Problem: HarmonyLite Fails to Start​

Problem: Configuration Validation Errors​

Replication Issues​

Problem: Changes Not Replicating​

Problem: High Replication Latency​

Database Issues​

Problem: Database Locks​

Problem: Database Corruption​

Snapshot and Recovery​

Problem: Snapshot Creation Fails​

Problem: Recovery from Snapshot Fails​

Performance Issues​

Problem: High CPU Usage​

Problem: Memory Leaks​

NATS Issues​

Problem: NATS Connection Failures​

Problem: JetStream Errors​

Sleep Timeout and Serverless Operation​

Problem: HarmonyLite Exits Unexpectedly​

Fixing Triggers and Schema Issues​

Problem: Missing or Corrupted Triggers​

Problem: Schema Changes Break Replication​

Recovery Procedures​

Full Node Recovery​

Manual Database Repair​

Diagnostic Commands Reference​

Getting More Help​

Diagnostic Tools

Log Analysis

Prometheus Metrics

Health Check Endpoint

Available Metrics

Database Metrics

Performance Indicators

Understanding HarmonyLite Metrics

NATS Monitoring

SQLite Analysis

Performance Profiling with pprof

Common Issues and Solutions

Installation and Setup

Problem: HarmonyLite Fails to Start

Problem: Configuration Validation Errors

Replication Issues

Problem: Changes Not Replicating

Problem: High Replication Latency

Database Issues

Problem: Database Locks

Problem: Database Corruption

Snapshot and Recovery

Problem: Snapshot Creation Fails

Problem: Recovery from Snapshot Fails

Performance Issues

Problem: High CPU Usage

Problem: Memory Leaks

NATS Issues

Problem: NATS Connection Failures

Problem: JetStream Errors

Sleep Timeout and Serverless Operation

Problem: HarmonyLite Exits Unexpectedly

Fixing Triggers and Schema Issues

Problem: Missing or Corrupted Triggers

Problem: Schema Changes Break Replication

Recovery Procedures

Full Node Recovery

Manual Database Repair

Diagnostic Commands Reference

Getting More Help