Troubleshooting¶

This guide helps diagnose and resolve common issues with ICOS-FL.

Connection Issues¶

DataClay Connection Errors¶

Symptoms: - DataClayException: Failed to connect to DataClay - RuntimeError: Failed to initialize TimeSeriesData

Solutions:

Check if DataClay services are running:
```
docker compose ps
```
Verify Redis is operational:
```
docker compose logs redis
```

Ensure correct connection parameters:

# Should match your environment
client = Client(proxy_host="127.0.0.1", proxy_port=8676, dataset="admin")

Restart DataClay services:

docker compose restart redis metadata-service backend proxy

Federation Connection Errors¶

Symptoms: - UNAVAILABLE: failed to connect to all addresses - Clients unable to connect to server

Solutions:

Check server address configuration:

# In pyproject.toml
[tool.flwr.federations.remote-deployment]
address = "127.0.0.1:9093"  # Change to actual server IP

Verify network connectivity:

ping <server-ip>
telnet <server-ip> 9093

Check firewall settings:

# On Linux
sudo ufw status
# On Windows
netsh advfirewall show currentprofile

Ensure SuperLink is running:

docker compose -f docker/simulation.yml ps superlink

Data Collection Issues¶

No Metrics Being Collected¶

Symptoms: - No data available yet: DataFrame is None - Empty results in consumer.py

Solutions:

Check if Scaphandre is running:
```
docker compose logs scaphandre
```

Verify OpenTelemetry collector configuration:

# Check config
cat otel-config.yaml

# Check logs
docker compose logs otel-collector

Ensure bridge configuration has been applied:

# Run bridge configuration
python bridgeConfig.py

# Check logs
docker compose logs bridge-config

Restart the metrics pipeline:

docker compose restart scaphandre otel-collector bridge-config bridge

Missing Specific Metrics¶

Symptoms: - Some columns missing from TimeSeriesData - KeyError: 'expected_metric_name'

Solutions:

Check bridge configuration:

# In bridgeConfig.py, ensure metrics are added
rc_scaphandre.add_metric("scaph_host_power_microwatts")

Verify metrics are available in Scaphandre:

curl http://localhost:8080/metrics | grep power_microwatts

Update bridge configuration and restart:

python bridgeConfig.py
docker compose restart bridge

Training Issues¶

Out of Memory Errors¶

Symptoms: - RuntimeError: CUDA out of memory - MemoryError during training

Solutions:

Reduce batch size:

# In pyproject.toml
batch-size = 32  # Decrease from default

Use smaller model:

hidden-layer-size = 10  # Decrease size
num-layers = 1  # Reduce layers

Enable CPU fallback:

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Limit DataClay memory usage:

# In docker-compose.yml
environment:
  - DATACLAY_MEMORY_CHECK_INTERVAL=300  # More frequent cleanup

Model Convergence Issues¶

Symptoms: - Steadily increasing or fluctuating loss values - Poor prediction accuracy

Solutions:

Adjust learning rate:

learning-rate = 0.0005  # Try lower value

Increase local epochs:

local-epochs = 200  # More training per round

Normalize data properly:

# Check normalization in Processor class
# or use custom preprocessing

Try different model architecture:
```
# See custom_models.rst for examples
```

Docker Issues¶

Container Startup Failures¶

Symptoms: - Containers exit with non-zero status - Services show as “unhealthy” or keep restarting

Solutions:

Check container logs:
```
docker compose logs <service-name>
```

Verify host resource availability:

# Check disk space
df -h

# Check memory
free -m

Ensure proper dependency order:

# In docker-compose.yml
depends_on:
  - redis
  - metadata-service

Check for port conflicts:

# List used ports
netstat -tulpn | grep LISTEN

Debugging Techniques¶

Checking DataClay Objects¶

from dataclay import Client
client = Client(proxy_host="127.0.0.1", dataset="admin")
client.start()

# List all aliases
from dataclay.client.api import get_all_aliases
print(get_all_aliases())

# Get TimeSeriesData
from icos_fl.utils.fetcher import TimeSeriesData
tsd = TimeSeriesData.get_by_alias("timeseries")
df = tsd.get_dataframe()
print(df.shape if df is not None else "No data")

Enabling Verbose Logging¶

# Set environment variables for more detailed logs
export LOG_LEVEL=DEBUG

# Run with debug output
python -m icos_fl.client.client --log-level=DEBUG

# For Flower components
flwr run . remote-deployment --log_level=DEBUG