Architecture Overview¶
This page provides a high-level overview of the ICOS-FL architecture.
Core Components¶
ICOS-FL consists of several interconnected components:
High-level architecture of ICOS-FL¶
The architecture is divided into three main layers:
Data Collection Layer: Captures system metrics from nodes
Storage Layer: Persists time series data in DataClay
Learning Layer: Implements federated learning with Flower
Component Interactions¶
+----------------------+ +----------------------+ +----------------------+
| Data Collection | | Storage | | Learning Layer |
| Layer | | Layer | | |
| | | | | |
| ┌──────────────────┐ | | ┌──────────────────┐ | | ┌──────────────────┐ |
| │ Scaphandre │ | | │ DataClay │ | | │ SuperLink │ |
| │ Hardware Metrics│ | | │ Distributed │ | | │ Central Server │ |
| └────────┬─────────┘ | | │ Object Store │ | | └────────┬─────────┘ |
| │ | | └────────┬─────────┘ | | │ |
| ┌────────▼─────────┐ | | │ | | ┌────────▼─────────┐ |
| │ OpenTelemetry │ | | ┌────────▼─────────┐ | | │ SuperNodes │ |
| │ Collector │ | | │ TimeSeriesData │ | | │ Client Nodes │ |
| └────────┬─────────┘ | | │ Sliding Window │ | | └────────┬─────────┘ |
| │ | | └────────┬─────────┘ | | │ |
| ┌────────▼─────────┐ | | │ | | ┌────────▼─────────┐ |
| │ OTLP-Bridge ├─┼─────► │ | | │ LSTM Models │ |
| │ DataClay Link │ | | │ | | │ Sequence Models │ |
| └──────────────────┘ | | ┌────────▼─────────┐ | | └────────┬─────────┘ |
| | | │ Processor │◄┼─────┐ │ |
| | | │ Data Pipeline │ | │ ┌────────▼─────────┐ |
| | | └──────────────────┘ | | │ FedAvg Strategy │ |
+----------------------+ +----------------------+ | │ Model Aggregation│ |
| └──────────────────┘ |
+----------------------+
The ICOS-FL framework is organized around these components working together:
Data Collection Layer¶
Scaphandre: Collects hardware metrics from the system - Monitors CPU, memory, and power consumption - Exposes metrics through a Prometheus-compatible HTTP endpoint
OpenTelemetry Collector: Processes and batches metrics - Scrapes metrics from Scaphandre at configurable intervals - Applies transformations and batching - Forwards metrics to the OTLP-Bridge via gRPC
OTLP-Bridge: Connects OpenTelemetry to DataClay - Receives metrics via gRPC - Transforms metrics into a format suitable for storage - Stores metrics in DataClay using the TimeSeriesData object
Storage Layer¶
DataClay: Distributed object store - Provides persistent storage for metrics and model state - Consists of multiple services (Redis, Metadata Service, Backend, Proxy) - Enables efficient data access across nodes
TimeSeriesData: Manages metric storage - Implements a sliding window approach for time series data - Maintains a configurable number of recent data points - Provides methods for accessing and waiting for new data
Processor: Prepares data for model training - Normalizes raw metrics data - Creates sequences for LSTM input - Splits data into training and validation sets - Generates DataLoaders for model training
Learning Layer¶
SuperLink: Central federated learning server - Coordinates the federated learning process - Aggregates model updates from clients - Distributes global model to clients - Manages training rounds and evaluation
SuperNodes: Federated learning clients - Run on each node in the federation - Train local LSTM models on local data - Send model updates to the server - Apply global model updates
LSTM Models: Sequence prediction models - Predict future resource usage based on historical patterns - Train on local data without sharing raw metrics - Support custom architectures and hyperparameters
FedAvg Strategy: Federated averaging algorithm - Aggregates model updates from clients - Weights updates based on data quantity - Configurable parameters for client selection and participation
Cross-Cutting Concerns¶
Several aspects span multiple components:
Configuration Management - Central configuration via pyproject.toml - Component-specific configuration files - Runtime parameter overrides
Monitoring and Logging - Component-level logging - Metrics validation via consumer.py - Optional integration with Weights & Biases
Containerization - Docker-based deployment - Container orchestration via Docker Compose - Resource management and isolation
Security - Data privacy through federated learning - Network isolation options - Configurable authentication
Deployment Models¶
ICOS-FL supports multiple deployment models:
Single-node: All components on one machine (for development)
Multi-node Federation: Components distributed across machines
Hybrid: Mix of centralized and distributed components
The architecture is designed to be modular, allowing components to be replaced or extended as needed.