Helmsman: Autonomous Synthesis of Federated Learning Systems via Multi-Agent CollaborationFederated Learning (FL) offers a powerful paradigm for training models on decentralized data, but its promise is often undermined by the immense complexity of designing and deploying robust systems. The need to select, combine, and tune strategies for multifaceted challenges like data heterogeneity and system constraints has become a critical bottleneck, resulting in brittle, bespoke solutions. To address this, we introduce Helmsman, a novel multi-agent system that automates the end-to-end synthesis of federated learning systems from high-level user specifications. It emulates a principled research and development workflow through three collaborative phases: (1) interactive human-in-the-loop planning to formulate a sound research plan, (2) modular code generation by supervised agent teams, and (3) a closed-loop of autonomous evaluation and refinement in a sandboxed simulation environment. To facilitate rigorous evaluation, we also introduce AgentFL-Bench, a new benchmark comprising 16 diverse tasks designed to assess the system-level generation capabilities of agentic systems in FL. Extensive experiments demonstrate that our approach generates solutions competitive with, and often superior to, established hand-crafted baselines. Our work represents a significant step towards the automated engineering of complex decentralized AI systems.
arXiv preprint
Federated Learning (FL) holds immense promise for privacy-centric collaborative AI, yet its practical deployment remains complex. Designing an effective FL system requires navigating a range of challenges including statistical heterogeneity, system constraints, and shifting task objectives. To date, this design process has been a manual, labor-intensive effort led by domain experts, resulting in static, bespoke solutions that are brittle in the face of real-world dynamics.
The intractable design space of FL, created by the combinatorial task of matching diverse challenges with specialized strategies.
The difficulty of FL design is rooted in its vast and combinatorial nature. Real-world deployments rarely present a single, isolated challenge. Instead, they involve a confluence of issues: clients may have non-IID data, heterogeneous computational capabilities, and unreliable network connections, all while pursuing diverse task objectives. Current FL research often operates in silos, developing point solutions for individual problems. While effective in isolation, these solutions are difficult to compose, and their interactions are unpredictable. This leaves practitioners facing an intractable design space. To address this bottleneck, we introduce Helmsman, a multi-agent system that automates the end-to-end synthesis of federated learning systems from high-level user specifications. By emulating a principled research and development workflow, Helmsman dramatically lowers the barrier to entry for creating robust and sophisticated FL solutions.
Figure 1: The automated FL development workflow of Helmsman. (a) Planning: A user query is refined into an actionable research plan via human-in-the-loop dialogue. (b) Coding: Specialized agent teams, managed by a Supervisor, collaboratively build a modular codebase. (c) Evaluation: The final code is autonomously tested and refined in a closed simulation loop until correct.
Helmsman transforms high-level user objectives into fully functional FL systems through a principled three-stage workflow that emulates human research and development practices.
Transform user queries into verified research plans through automated critique and human validation.
Supervised teams implement modular FL components following dependency-aware workflow.
Iterative diagnosis and debugging in sandboxed simulation until certified correctness.
To rigorously evaluate Helmsman's capabilities, we introduce AgentFL-Bench, a benchmark comprising 16 diverse tasks across 5 pivotal FL research domains: data heterogeneity, communication efficiency, personalization, active learning, and continual learning. Each task is defined by a standardized natural language query that provides an unambiguous problem specification, minimizing prompt-engineering variability and enabling fair comparisons between different agentic systems.
The tasks are designed for realism, reflecting the multifaceted nature of real-world FL challenges where issues like non-IID data and system constraints often co-occur. This requires agents to reason about and integrate appropriate strategies rather than solving isolated toy problems.
Table 1: The structured natural language template for specifying tasks to Helmsman. This template guides the user to provide a comprehensive and unambiguous problem definition, ensuring the Planning Agent receives the necessary context regarding the application domain, data characteristics, and desired FL objectives. The provided query example shows a complete instantiation of this template.
| Component | Description | Pattern Template |
|---|---|---|
| Problem Statement | Defines the high-level deployment context, including the application domain, system scale, and target infrastructure. | "I need to deploy application type on/across number device types..." |
| Task Description | Specifies the data characteristics, heterogeneity patterns, and any domain-specific challenges. | "Each client holds dataset characteristics with specific challenge..." |
| Framework Requirement | Outlines the core FL objectives, the model architecture to be trained, and the metrics for evaluation. | "Help me build a federated learning framework that objectives, training a model architecture, evaluating performance by metrics." |
Query Example:
"I need to deploy a personalized handwriting recognition app across 15 mobile devices. Each client holds FEMNIST data from individual users with unique writing styles. Help me build a personalized federated learning framework that balances global knowledge with local user adaptation for a CNN model, evaluating performance by average client test accuracy."
Extensive experiments on AgentFL-Bench demonstrate that Helmsman generates solutions competitive with, and often superior to, established hand-crafted baselines. Our evaluation spans diverse challenges across heterogeneous FL, communication efficiency, personalization, active learning, and continual learning domains.
Table 2: Performance on heterogeneous FL benchmarks. Best and Second-best results are highlighted.
| ID | Task | Dataset | Problem | FedAvg | FedProx | Specialized | Ours |
|---|---|---|---|---|---|---|---|
| Data Heterogeneity | |||||||
| Q1 | Object Recognition | CIFAR-10-LT | Quantity Skew | 70.68 | 69.13 | 78.39* | 76.26 |
| Q2 | Object Recognition | CIFAR-100-C | Feature Skew | 35.29 | 36.93 | 42.13† | 39.43 |
| Q3 | Object Recognition | CIFAR-10N | Label Noise | 74.32 | 79.42* | 80.67† | 81.14 |
| Distribution Shift | |||||||
| Q4 | Object Recognition | Office-Home | Domain Shift | 52.94 | 50.81 | 57.68* | 53.47 |
| Q5 | Human Activity | HAR | User Heterogeneity | 94.38 | 95.66 | 95.77* | 96.74 |
| Q6 | Speech Recognition | Speech Commands | Speaker Variation | 84.93 | 84.11 | 83.08* | 86.19 |
| Q7 | Medical Diagnosis | Fed-ISIC2019 | Site Heterogeneity | 55.43 | 61.79 | 63.62* | 63.80 |
| Q8 | Object Recognition | Caltech101 | Class Imbalance | 48.21 | 47.75 | 63.62* | 50.92 |
| System Heterogeneity | |||||||
| Q9 | Object Recognition | CIFAR-100 | Resource Constraint | 60.83 | 59.86 | 62.19‡ | 63.49 |
* FedNova, † FedNS, ‡ HeteroFL
For complex task Q16 (continual learning), Helmsman discovered a novel hybrid strategy combining client-side experience replay with global model distillation—achieving 79% accuracy improvement over FedWeIT while reducing catastrophic forgetting by 71%.
Helmsman achieved 62.5% full automation across AgentFL-Bench. The remaining 37.5% required human intervention for planning ambiguity or domain-specific coding. System autonomy is bounded by computational budget for iterative refinement.
Helmsman demonstrates that coordinated multi-agent collaboration can automate end-to-end FL system design, from initial planning through implementation and evaluation. By handling the complex design-implementation-testing cycle autonomously, Helmsman lowers barriers to creating sophisticated FL solutions and achieves competitive performance across diverse research domains. Future work will explore self-evolutionary capabilities, enabling the system to refine its strategies from experimental feedback toward fully autonomous FL engineering.
If you find Helmsman useful for your research, please consider citing our work:
@article{li2025helmsman,
title={Helmsman: Autonomous Synthesis of Federated Learning Systems via Multi-Agent Collaboration},
author={Li, Haoyuan and Funk, Mathias and Saeed, Aaqib},
journal={arXiv preprint arXiv:2510.14512},
year={2025}
}