Big Data System vs STADLE

Only 10% of AI projects reached production due to data privacy, bias, and the high computational cost of training.

Data is gathered to create large data stores.
These large data stores are used to solve a specific problem using machine learning.
The resulting model displays strong generalizability due to the volume of data trained on, and is eventually deployed.

Continuous data collection uses large amounts of communication bandwidth.
In privacy-focused applications, the transmission of data may be banned entirely - making model creation impossible.
Training large ML models on big data stores is computationally expensive.
- Traditional centralized training efficiency is limited by single-machine performance.
- Distributed learning approaches often incur overhead to maintain training performance.
Slow training processes lead to long delays between incremental model updates, leading to lack of flexibility in accommodating new data trends.

ML training is performed directly at the location of the data.
The resulting trained models are collected at the central server
Aggregation algorithms are used to produce an aggregated model from the collected models
The aggregated model is sent back to the data locations for further training.

Advanced aggregation algorithms can maintain training performance in restricted scenarios (Federated Learning) and increase the efficiency in standard ML scenarios (Distributed Learning).
Only model weights are transmitted between server and nodes - communication efficiency.
Training can be performed asynchronously across variable number of nodes - efficient and easily scalable distributed learning.
Training performed at data location, so data never transmitted - maintain data privacy.