Configuration¶
This page documents the configuration options available in ICOS-FL.
Configuration Files¶
ICOS-FL uses several configuration files:
pyproject.toml: Main project configuration
otel-config.yaml: OpenTelemetry collector configuration
docker-compose.yml: Container configuration
bridgeConfig.py: Bridge configuration script
pyproject.toml Configuration¶
The pyproject.toml file is the primary configuration file for ICOS-FL. It is divided into multiple sections:
Build System¶
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
Project Metadata¶
[project]
name = "icos-fl"
version = "0.1.0"
description = "ICOS-FL: Flower-powered FL framework for real-time resource monitoring (LSTM) & predictions."
license = "MIT"
readme = "README.md"
requires-python = ">=3.10"
authors = [
{ name = "Anastasios Kaltakis", email = "anastasioskaltakis@gmail.com" },
]
dependencies = [
"flwr[simulation]>=1.17.0",
"torch==2.5.1",
"wandb==0.19.8",
"pandas>=2.2.3",
"scikit-learn>=1.6.1",
"dataclay==4.0.0",
]
Flower Application Configuration¶
[tool.flwr.app]
publisher = "Anastasios Kaltakis"
[tool.flwr.app.components]
serverapp = "icos_fl.server.server:app"
clientapp = "icos_fl.client.client:app"
[tool.flwr.app.config]
# Server configuration
num-server-rounds = 10
fraction-fit = 1.0
fraction-evaluate = 1.0
min-fit-clients = 2
min-evaluate-clients = 2
min-available-clients = 2
server-device = "cpu"
# LSTM model configuration
hidden-layer-size = 10
time-step = 10
num-layers = 1
# Resource metric to monitor and predict
metric = "cpu_usage"
batch-size = 64
train-test-split = 0.8
local-epochs = 100
learning-rate = 0.001
use-wandb = false
Federation Configuration¶
[tool.flwr.federations]
default = "local-deployment"
[tool.flwr.federations.local-deployment]
address = "127.0.0.1:9093"
insecure = true
[tool.flwr.federations.remote-deployment]
address = "127.0.0.1:9093"
insecure = true
OpenTelemetry Configuration¶
The otel-config.yaml file configures the OpenTelemetry collector:
receivers:
prometheus:
config:
scrape_configs:
- job_name: 'scaphandre'
scrape_interval: 3s # Scrape metrics every 3 seconds
static_configs:
- targets: ['127.0.0.1:8080']
processors:
batch:
timeout: 180s # Batch metrics for 3 minutes
exporters:
otlp:
endpoint: 127.0.0.1:4317
tls:
insecure: true
service:
pipelines:
metrics:
receivers: [prometheus]
processors: [batch]
exporters: [otlp]
Docker Compose Configuration¶
The docker-compose.yml file configures the Docker containers:
services:
redis:
image: redis:latest
restart: unless-stopped
scaphandre:
image: docker.io/hubblo/scaphandre
command: prometheus -p 8080 -a 0.0.0.0
privileged: true
# ...
proxy:
build: .
ports:
- 8676:8676
depends_on:
- metadata-service
- backend
environment:
- DATACLAY_PROXY_MDS_HOST=metadata-service
- DATACLAY_KV_HOST=redis
command: python -m dataclay.proxy
# ...
Bridge Configuration¶
The bridge configuration is set through the bridgeConfig.py script:
# Create a ResourceConfiguration for Scaphandre
rc_scaphandre = ResourceConfiguration("scaphandre-metrics", scaphandre_rules)
# Add the specific metrics you want to track
rc_scaphandre.add_metric("scaph_host_power_microwatts")
rc_scaphandre.add_metric("scaph_host_load_avg_one")
rc_scaphandre.add_metric("scaph_host_memory_total_bytes")
rc_scaphandre.add_metric("scaph_host_memory_available_bytes")
# Add this configuration to the bridge
bc.set_res_config(rc_scaphandre)
Configuration Parameters¶
Server Configuration¶
Parameter |
Default |
Description |
|---|---|---|
num-server-rounds |
10 |
Number of federated learning rounds |
fraction-fit |
1.0 |
Fraction of clients to select for training |
fraction-evaluate |
1.0 |
Fraction of clients to select for evaluation |
min-fit-clients |
2 |
Minimum number of clients for training |
min-evaluate-clients |
2 |
Minimum number of clients for evaluation |
min-available-clients |
2 |
Minimum clients before starting a round |
server-device |
“cpu” |
Device to use for server-side operations |
LSTM Model Configuration¶
Parameter |
Default |
Description |
|---|---|---|
hidden-layer-size |
10 |
Size of the LSTM hidden layer |
time-step |
10 |
Number of time steps in input sequence |
num-layers |
1 |
Number of LSTM layers |
learning-rate |
0.001 |
Learning rate for model optimization |
Training Configuration¶
Parameter |
Default |
Description |
|---|---|---|
batch-size |
64 |
Batch size for training |
train-test-split |
0.8 |
Ratio for train/validation split |
local-epochs |
100 |
Number of local training epochs per round |
metric |
“cpu_usage” |
Metric to predict (cpu_usage, memory_usage, power_consumption) |
use-wandb |
false |
Whether to use Weights & Biases for logging |
Time Series Data Configuration¶
Parameter |
Default |
Description |
|---|---|---|
max_rows |
300 |
Maximum rows in the sliding window |
scrape_interval |
3s |
Interval between metrics scrapes |
batch_timeout |
180s |
Interval for batching metrics |
Environment Variables¶
ICOS-FL respects these environment variables:
Variable |
Description |
|---|---|
DATACLAY_PROXY_HOST |
Host address for the DataClay proxy |
DATACLAY_PROXY_PORT |
Port for the DataClay proxy |
BRIDGE_CONFIGURATION_ALIAS |
Alias for the bridge configuration |
TIMESERIES_ALIAS |
Alias for the TimeSeriesData object |
LOG_LEVEL |
Logging level (DEBUG, INFO, WARNING, ERROR) |
Example Configuration¶
Here’s an example of a complete configuration for predicting memory usage with a larger model:
[tool.flwr.app.config]
# Server configuration
num-server-rounds = 20
min-fit-clients = 3
min-evaluate-clients = 3
min-available-clients = 3
# LSTM model configuration
hidden-layer-size = 20
time-step = 15
num-layers = 2
# Training configuration
metric = "memory_usage"
batch-size = 32
local-epochs = 150
learning-rate = 0.0005
# Logging
use-wandb = true