HelmHelmsman: Autonomous Synthesis of Federated Learning Systems via Multi-Agent Collaboration

Haoyuan Li, Mathias Funk, Aaqib Saeed

Eindhoven University of Technology (TU/e)

Federated Learning (FL) offers a powerful paradigm for training models on decentralized data, but its promise is often undermined by the immense complexity of designing and deploying robust systems. The need to select, combine, and tune strategies for multifaceted challenges like data heterogeneity and system constraints has become a critical bottleneck, resulting in brittle, bespoke solutions. To address this, we introduce Helmsman, a novel multi-agent system that automates the end-to-end synthesis of federated learning systems from high-level user specifications. It emulates a principled research and development workflow through three collaborative phases: (1) interactive human-in-the-loop planning to formulate a sound research plan, (2) modular code generation by supervised agent teams, and (3) a closed-loop of autonomous evaluation and refinement in a sandboxed simulation environment. To facilitate rigorous evaluation, we also introduce AgentFL-Bench, a new benchmark comprising 16 diverse tasks designed to assess the system-level generation capabilities of agentic systems in FL. Extensive experiments demonstrate that our approach generates solutions competitive with, and often superior to, established hand-crafted baselines. Our work represents a significant step towards the automated engineering of complex decentralized AI systems.

Paper   

arXiv preprint

Introduction

Federated Learning (FL) holds immense promise for privacy-centric collaborative AI, yet its practical deployment remains complex. Designing an effective FL system requires navigating a range of challenges including statistical heterogeneity, system constraints, and shifting task objectives. To date, this design process has been a manual, labor-intensive effort led by domain experts, resulting in static, bespoke solutions that are brittle in the face of real-world dynamics.

The intractable design space of FL, created by the combinatorial task of matching diverse challenges with specialized strategies.

The difficulty of FL design is rooted in its vast and combinatorial nature. Real-world deployments rarely present a single, isolated challenge. Instead, they involve a confluence of issues: clients may have non-IID data, heterogeneous computational capabilities, and unreliable network connections, all while pursuing diverse task objectives. Current FL research often operates in silos, developing point solutions for individual problems. While effective in isolation, these solutions are difficult to compose, and their interactions are unpredictable. This leaves practitioners facing an intractable design space. To address this bottleneck, we introduce Helmsman, a multi-agent system that automates the end-to-end synthesis of federated learning systems from high-level user specifications. By emulating a principled research and development workflow, Helmsman dramatically lowers the barrier to entry for creating robust and sophisticated FL solutions.

Figure 1: The automated FL development workflow of Helmsman. (a) Planning: A user query is refined into an actionable research plan via human-in-the-loop dialogue. (b) Coding: Specialized agent teams, managed by a Supervisor, collaboratively build a modular codebase. (c) Evaluation: The final code is autonomously tested and refined in a closed simulation loop until correct.

Helmsman: Automated FL System Synthesis

Helmsman transforms high-level user objectives into fully functional FL systems through a principled three-stage workflow that emulates human research and development practices.

📋

Planning

Transform user queries into verified research plans through automated critique and human validation.

⚙️

Coding

Supervised teams implement modular FL components following dependency-aware workflow.

🔄

Evaluation

Iterative diagnosis and debugging in sandboxed simulation until certified correctness.

Interactive and Verifiable Planning

Modular Code Generation via Supervised Agent Teams

Autonomous Evaluation and Refinement

AgentFL-Bench

To rigorously evaluate Helmsman's capabilities, we introduce AgentFL-Bench, a benchmark comprising 16 diverse tasks across 5 pivotal FL research domains: data heterogeneity, communication efficiency, personalization, active learning, and continual learning. Each task is defined by a standardized natural language query that provides an unambiguous problem specification, minimizing prompt-engineering variability and enabling fair comparisons between different agentic systems.

The tasks are designed for realism, reflecting the multifaceted nature of real-world FL challenges where issues like non-IID data and system constraints often co-occur. This requires agents to reason about and integrate appropriate strategies rather than solving isolated toy problems.

Table 1: The structured natural language template for specifying tasks to Helmsman. This template guides the user to provide a comprehensive and unambiguous problem definition, ensuring the Planning Agent receives the necessary context regarding the application domain, data characteristics, and desired FL objectives. The provided query example shows a complete instantiation of this template.

Component Description Pattern Template
Problem Statement Defines the high-level deployment context, including the application domain, system scale, and target infrastructure. "I need to deploy application type on/across number device types..."
Task Description Specifies the data characteristics, heterogeneity patterns, and any domain-specific challenges. "Each client holds dataset characteristics with specific challenge..."
Framework Requirement Outlines the core FL objectives, the model architecture to be trained, and the metrics for evaluation. "Help me build a federated learning framework that objectives, training a model architecture, evaluating performance by metrics."

Query Example:

"I need to deploy a personalized handwriting recognition app across 15 mobile devices. Each client holds FEMNIST data from individual users with unique writing styles. Help me build a personalized federated learning framework that balances global knowledge with local user adaptation for a CNN model, evaluating performance by average client test accuracy."

Experimental Results

Extensive experiments on AgentFL-Bench demonstrate that Helmsman generates solutions competitive with, and often superior to, established hand-crafted baselines. Our evaluation spans diverse challenges across heterogeneous FL, communication efficiency, personalization, active learning, and continual learning domains.

Table 2: Performance on heterogeneous FL benchmarks. Best and Second-best results are highlighted.

ID Task Dataset Problem FedAvg FedProx Specialized Ours
Data Heterogeneity
Q1 Object Recognition CIFAR-10-LT Quantity Skew 70.68 69.13 78.39* 76.26
Q2 Object Recognition CIFAR-100-C Feature Skew 35.29 36.93 42.13† 39.43
Q3 Object Recognition CIFAR-10N Label Noise 74.32 79.42* 80.67† 81.14
Distribution Shift
Q4 Object Recognition Office-Home Domain Shift 52.94 50.81 57.68* 53.47
Q5 Human Activity HAR User Heterogeneity 94.38 95.66 95.77* 96.74
Q6 Speech Recognition Speech Commands Speaker Variation 84.93 84.11 83.08* 86.19
Q7 Medical Diagnosis Fed-ISIC2019 Site Heterogeneity 55.43 61.79 63.62* 63.80
Q8 Object Recognition Caltech101 Class Imbalance 48.21 47.75 63.62* 50.92
System Heterogeneity
Q9 Object Recognition CIFAR-100 Resource Constraint 60.83 59.86 62.19‡ 63.49

* FedNova, † FedNS, ‡ HeteroFL

Discussion

Novel Algorithmic Discovery

For complex task Q16 (continual learning), Helmsman discovered a novel hybrid strategy combining client-side experience replay with global model distillation—achieving 79% accuracy improvement over FedWeIT while reducing catastrophic forgetting by 71%.

Limits of Autonomy

Helmsman achieved 62.5% full automation across AgentFL-Bench. The remaining 37.5% required human intervention for planning ambiguity or domain-specific coding. System autonomy is bounded by computational budget for iterative refinement.

Conclusion

Helmsman demonstrates that coordinated multi-agent collaboration can automate end-to-end FL system design, from initial planning through implementation and evaluation. By handling the complex design-implementation-testing cycle autonomously, Helmsman lowers barriers to creating sophisticated FL solutions and achieves competitive performance across diverse research domains. Future work will explore self-evolutionary capabilities, enabling the system to refine its strategies from experimental feedback toward fully autonomous FL engineering.

Citation

If you find Helmsman useful for your research, please consider citing our work:

@article{li2025helmsman,
  title={Helmsman: Autonomous Synthesis of Federated Learning Systems via Multi-Agent Collaboration},
  author={Li, Haoyuan and Funk, Mathias and Saeed, Aaqib},
  journal={arXiv preprint arXiv:2510.14512},
  year={2025}
}