Configuration¶

This page documents the configuration options available in ICOS-FL.

Configuration Files¶

ICOS-FL uses several configuration files:

pyproject.toml: Main project configuration
otel-config.yaml: OpenTelemetry collector configuration
docker-compose.yml: Container configuration
bridgeConfig.py: Bridge configuration script

pyproject.toml Configuration¶

The pyproject.toml file is the primary configuration file for ICOS-FL. It is divided into multiple sections:

Build System¶

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

Project Metadata¶

[project]
name = "icos-fl"
version = "0.1.0"
description = "ICOS-FL: Flower-powered FL framework for real-time resource monitoring (LSTM) & predictions."
license = "MIT"
readme = "README.md"
requires-python = ">=3.10"
authors = [
  { name = "Anastasios Kaltakis", email = "anastasioskaltakis@gmail.com" },
]
dependencies = [
  "flwr[simulation]>=1.17.0",
  "torch==2.5.1",
  "wandb==0.19.8",
  "pandas>=2.2.3",
  "scikit-learn>=1.6.1",
  "dataclay==4.0.0",
]

Flower Application Configuration¶

[tool.flwr.app]
publisher = "Anastasios Kaltakis"

[tool.flwr.app.components]
serverapp = "icos_fl.server.server:app"
clientapp = "icos_fl.client.client:app"

[tool.flwr.app.config]
# Server configuration
num-server-rounds = 10
fraction-fit = 1.0
fraction-evaluate = 1.0
min-fit-clients = 2
min-evaluate-clients = 2
min-available-clients = 2
server-device = "cpu"

# LSTM model configuration
hidden-layer-size = 10
time-step = 10
num-layers = 1

# Resource metric to monitor and predict
metric = "cpu_usage"
batch-size = 64
train-test-split = 0.8
local-epochs = 100
learning-rate = 0.001

use-wandb = false

Federation Configuration¶

[tool.flwr.federations]
default = "local-deployment"

[tool.flwr.federations.local-deployment]
address = "127.0.0.1:9093"
insecure = true

[tool.flwr.federations.remote-deployment]
address = "127.0.0.1:9093"
insecure = true

OpenTelemetry Configuration¶

The otel-config.yaml file configures the OpenTelemetry collector:

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: 'scaphandre'
          scrape_interval: 3s  # Scrape metrics every 3 seconds
          static_configs:
            - targets: ['127.0.0.1:8080']

processors:
  batch:
    timeout: 180s  # Batch metrics for 3 minutes

exporters:
  otlp:
    endpoint: 127.0.0.1:4317
    tls:
      insecure: true

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [batch]
      exporters: [otlp]

Docker Compose Configuration¶

The docker-compose.yml file configures the Docker containers:

services:
  redis:
    image: redis:latest
    restart: unless-stopped

  scaphandre:
    image: docker.io/hubblo/scaphandre
    command: prometheus -p 8080 -a 0.0.0.0
    privileged: true
    # ...

  proxy:
    build: .
    ports:
      - 8676:8676
    depends_on:
      - metadata-service
      - backend
    environment:
      - DATACLAY_PROXY_MDS_HOST=metadata-service
      - DATACLAY_KV_HOST=redis
    command: python -m dataclay.proxy
    # ...

Bridge Configuration¶

The bridge configuration is set through the bridgeConfig.py script:

# Create a ResourceConfiguration for Scaphandre
rc_scaphandre = ResourceConfiguration("scaphandre-metrics", scaphandre_rules)

# Add the specific metrics you want to track
rc_scaphandre.add_metric("scaph_host_power_microwatts")
rc_scaphandre.add_metric("scaph_host_load_avg_one")
rc_scaphandre.add_metric("scaph_host_memory_total_bytes")
rc_scaphandre.add_metric("scaph_host_memory_available_bytes")

# Add this configuration to the bridge
bc.set_res_config(rc_scaphandre)

Configuration Parameters¶

Server Configuration¶

Parameter	Default	Description
num-server-rounds	10	Number of federated learning rounds
fraction-fit	1.0	Fraction of clients to select for training
fraction-evaluate	1.0	Fraction of clients to select for evaluation
min-fit-clients	2	Minimum number of clients for training
min-evaluate-clients	2	Minimum number of clients for evaluation
min-available-clients	2	Minimum clients before starting a round
server-device	“cpu”	Device to use for server-side operations

LSTM Model Configuration¶

Parameter	Default	Description
hidden-layer-size	10	Size of the LSTM hidden layer
time-step	10	Number of time steps in input sequence
num-layers	1	Number of LSTM layers
learning-rate	0.001	Learning rate for model optimization

Training Configuration¶

Parameter	Default	Description
batch-size	64	Batch size for training
train-test-split	0.8	Ratio for train/validation split
local-epochs	100	Number of local training epochs per round
metric	“cpu_usage”	Metric to predict (cpu_usage, memory_usage, power_consumption)
use-wandb	false	Whether to use Weights & Biases for logging

Time Series Data Configuration¶

Parameter	Default	Description
max_rows	300	Maximum rows in the sliding window
scrape_interval	3s	Interval between metrics scrapes
batch_timeout	180s	Interval for batching metrics

Environment Variables¶

ICOS-FL respects these environment variables:

Variable	Description
DATACLAY_PROXY_HOST	Host address for the DataClay proxy
DATACLAY_PROXY_PORT	Port for the DataClay proxy
BRIDGE_CONFIGURATION_ALIAS	Alias for the bridge configuration
TIMESERIES_ALIAS	Alias for the TimeSeriesData object
LOG_LEVEL	Logging level (DEBUG, INFO, WARNING, ERROR)

Example Configuration¶

Here’s an example of a complete configuration for predicting memory usage with a larger model:

[tool.flwr.app.config]
# Server configuration
num-server-rounds = 20
min-fit-clients = 3
min-evaluate-clients = 3
min-available-clients = 3

# LSTM model configuration
hidden-layer-size = 20
time-step = 15
num-layers = 2

# Training configuration
metric = "memory_usage"
batch-size = 32
local-epochs = 150
learning-rate = 0.0005

# Logging
use-wandb = true