================== Metrics Collection ================== This guide explains how metrics are collected and processed in ICOS-FL. Metrics Collection Pipeline --------------------------- ICOS-FL uses a pipeline to collect, process, and store system metrics: 1. **Scaphandre**: Collects hardware metrics from the host system 2. **OpenTelemetry Collector**: Scrapes and processes metrics 3. **OTLP-Bridge**: Receives batched metrics and converts them for storage 4. **DataClay**: Stores processed metrics as time series data .. figure:: ../../_static/images/metrics_flow.png :alt: Metrics Collection Pipeline :align: center Metrics Collection Pipeline Collected Metrics ----------------- By default, ICOS-FL collects these system metrics: .. list-table:: :header-rows: 1 :align: left * - Metric - Description * - scaph_host_power_microwatts - Power consumption in microwatts * - scaph_host_load_avg_one - 1-minute CPU load average * - scaph_host_memory_total_bytes - Total system memory in bytes * - scaph_host_memory_available_bytes - Available system memory in bytes These metrics are transformed into more user-friendly values: .. list-table:: :header-rows: 1 :align: left * - Source Metric - Transformed Metric - Transformation * - scaph_host_power_microwatts - power_consumption - Converted to watts (divided by 1,000,000) * - scaph_host_load_avg_one - cpu_usage - Used directly * - memory_total - memory_available - memory_usage - Converted to MB (divided by 1,024*1,024) Configuring Metrics Collection ------------------------------ Customize which metrics are collected by modifying the Bridge Configuration: .. code-block:: python # In bridgeConfig.py rc_scaphandre = ResourceConfiguration("scaphandre-metrics", scaphandre_rules) # Add or remove metrics rc_scaphandre.add_metric("scaph_host_power_microwatts") rc_scaphandre.add_metric("scaph_host_load_avg_one") rc_scaphandre.add_metric("scaph_host_memory_total_bytes") rc_scaphandre.add_metric("scaph_host_memory_available_bytes") # Add new metric rc_scaphandre.add_metric("scaph_host_memory_cached_bytes") OpenTelemetry Configuration --------------------------- Adjust the OpenTelemetry scraping interval and batch settings in `otel-config.yaml`: .. code-block:: yaml receivers: prometheus: config: scrape_configs: - job_name: 'scaphandre' scrape_interval: 3s # Adjust collection frequency static_configs: - targets: ['127.0.0.1:8080'] processors: batch: timeout: 180s # Adjust batching interval Time Series Data Storage ------------------------ Metrics are stored in a sliding window in DataClay: .. code-block:: python # In icos_fl/utils/fetcher.py class TimeSeriesData(DataClayObject): """Class for managing time series data with a sliding window approach.""" def __init__(self, max_rows: int = 300) -> None: self.dataframe = None self.max_rows = max_rows self.waiters = list() The default configuration maintains 300 most recent data points (approximately 15 minutes with 3-second intervals). Accessing Collected Metrics --------------------------- To access the collected metrics programmatically: .. code-block:: python from dataclay import Client from icos_fl.utils.fetcher import TimeSeriesData # Connect to DataClay client = Client(proxy_host="127.0.0.1", dataset="admin") client.start() # Get TimeSeriesData tsd = TimeSeriesData.get_by_alias("timeseries") # Get current dataframe df = tsd.get_dataframe() # Display metrics print(df.head()) You can also use the included consumer.py script to monitor metrics: .. code-block:: bash python consumer.py Adding Custom Metrics --------------------- To collect additional metrics: 1. Ensure the metrics are exposed by Scaphandre or another OpenTelemetry source 2. Update the OpenTelemetry configuration to scrape these metrics 3. Modify the Bridge Configuration to collect the new metrics: .. code-block:: python rc_custom = ResourceConfiguration("custom-metrics", custom_rules) rc_custom.add_metric("my_custom_metric_name") bc.set_res_config(rc_custom) 4. Update your data processing code to handle the new metrics Data Preprocessing ------------------ When metrics are used for model training, they go through these preprocessing steps: 1. **Normalization**: Standardized to zero mean and unit variance 2. **Sequencing**: Converted to sequences of length `time_step` 3. **Train/Test Split**: Divided based on the `train_ratio` configuration 4. **Batching**: Grouped into batches of size `batch_size` This preprocessing is handled by the Processor class: .. code-block:: python from icos_fl.utils.processor import Processor processor = Processor( time_step=10, metric="cpu_usage", batch_size=64, train_ratio=0.8, device=torch.device("cpu") ) train_dataloader, val_dataloader, _, _ = processor.create_data_loaders(df)