Troubleshooting¶
This guide helps diagnose and resolve common issues with ICOS-FL.
Connection Issues¶
DataClay Connection Errors¶
Symptoms:
- DataClayException: Failed to connect to DataClay
- RuntimeError: Failed to initialize TimeSeriesData
Solutions:
Check if DataClay services are running:
docker compose ps
Verify Redis is operational:
docker compose logs redis
Ensure correct connection parameters:
# Should match your environment client = Client(proxy_host="127.0.0.1", proxy_port=8676, dataset="admin")
Restart DataClay services:
docker compose restart redis metadata-service backend proxy
Federation Connection Errors¶
Symptoms:
- UNAVAILABLE: failed to connect to all addresses
- Clients unable to connect to server
Solutions:
Check server address configuration:
# In pyproject.toml [tool.flwr.federations.remote-deployment] address = "127.0.0.1:9093" # Change to actual server IP
Verify network connectivity:
ping <server-ip> telnet <server-ip> 9093
Check firewall settings:
# On Linux sudo ufw status # On Windows netsh advfirewall show currentprofile
Ensure SuperLink is running:
docker compose -f docker/simulation.yml ps superlink
Data Collection Issues¶
No Metrics Being Collected¶
Symptoms:
- No data available yet: DataFrame is None
- Empty results in consumer.py
Solutions:
Check if Scaphandre is running:
docker compose logs scaphandre
Verify OpenTelemetry collector configuration:
# Check config cat otel-config.yaml # Check logs docker compose logs otel-collector
Ensure bridge configuration has been applied:
# Run bridge configuration python bridgeConfig.py # Check logs docker compose logs bridge-config
Restart the metrics pipeline:
docker compose restart scaphandre otel-collector bridge-config bridge
Missing Specific Metrics¶
Symptoms:
- Some columns missing from TimeSeriesData
- KeyError: 'expected_metric_name'
Solutions:
Check bridge configuration:
# In bridgeConfig.py, ensure metrics are added rc_scaphandre.add_metric("scaph_host_power_microwatts")
Verify metrics are available in Scaphandre:
curl http://localhost:8080/metrics | grep power_microwatts
Update bridge configuration and restart:
python bridgeConfig.py docker compose restart bridge
Training Issues¶
Out of Memory Errors¶
Symptoms:
- RuntimeError: CUDA out of memory
- MemoryError during training
Solutions:
Reduce batch size:
# In pyproject.toml batch-size = 32 # Decrease from default
Use smaller model:
hidden-layer-size = 10 # Decrease size num-layers = 1 # Reduce layers
Enable CPU fallback:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
Limit DataClay memory usage:
# In docker-compose.yml environment: - DATACLAY_MEMORY_CHECK_INTERVAL=300 # More frequent cleanup
Model Convergence Issues¶
Symptoms: - Steadily increasing or fluctuating loss values - Poor prediction accuracy
Solutions:
Adjust learning rate:
learning-rate = 0.0005 # Try lower value
Increase local epochs:
local-epochs = 200 # More training per round
Normalize data properly:
# Check normalization in Processor class # or use custom preprocessing
Try different model architecture:
# See custom_models.rst for examples
Docker Issues¶
Container Startup Failures¶
Symptoms: - Containers exit with non-zero status - Services show as “unhealthy” or keep restarting
Solutions:
Check container logs:
docker compose logs <service-name>
Verify host resource availability:
# Check disk space df -h # Check memory free -m
Ensure proper dependency order:
# In docker-compose.yml depends_on: - redis - metadata-service
Check for port conflicts:
# List used ports netstat -tulpn | grep LISTEN
Debugging Techniques¶
Checking DataClay Objects¶
from dataclay import Client
client = Client(proxy_host="127.0.0.1", dataset="admin")
client.start()
# List all aliases
from dataclay.client.api import get_all_aliases
print(get_all_aliases())
# Get TimeSeriesData
from icos_fl.utils.fetcher import TimeSeriesData
tsd = TimeSeriesData.get_by_alias("timeseries")
df = tsd.get_dataframe()
print(df.shape if df is not None else "No data")
Enabling Verbose Logging¶
# Set environment variables for more detailed logs
export LOG_LEVEL=DEBUG
# Run with debug output
python -m icos_fl.client.client --log-level=DEBUG
# For Flower components
flwr run . remote-deployment --log_level=DEBUG