Helmsman: Autonomous Synthesis of Federated Learning Systems via Multi-Agent Collaboration

Eindhoven University of Technology (TU/e)

Federated Learning (FL) offers a powerful paradigm for training models on decentralized data, but its promise is often undermined by the immense complexity of designing and deploying robust systems. The need to select, combine, and tune strategies for multifaceted challenges like data heterogeneity and system constraints has become a critical bottleneck, resulting in brittle, bespoke solutions. To address this, we introduce Helmsman, a novel multi-agent system that automates the end-to-end synthesis of federated learning systems from high-level user specifications. It emulates a principled research and development workflow through three collaborative phases: (1) interactive human-in-the-loop planning to formulate a sound research plan, (2) modular code generation by supervised agent teams, and (3) a closed-loop of autonomous evaluation and refinement in a sandboxed simulation environment. To facilitate rigorous evaluation, we also introduce AgentFL-Bench, a new benchmark comprising 16 diverse tasks designed to assess the system-level generation capabilities of agentic systems in FL. Extensive experiments demonstrate that our approach generates solutions competitive with, and often superior to, established hand-crafted baselines. Our work represents a significant step towards the automated engineering of complex decentralized AI systems.

Paper

arXiv preprint

Introduction

Federated Learning (FL) holds immense promise for privacy-centric collaborative AI, yet its practical deployment remains complex. Designing an effective FL system requires navigating a range of challenges including statistical heterogeneity, system constraints, and shifting task objectives. To date, this design process has been a manual, labor-intensive effort led by domain experts, resulting in static, bespoke solutions that are brittle in the face of real-world dynamics.

The intractable design space of FL, created by the combinatorial task of matching diverse challenges with specialized strategies.

The difficulty of FL design is rooted in its vast and combinatorial nature. Real-world deployments rarely present a single, isolated challenge. Instead, they involve a confluence of issues: clients may have non-IID data, heterogeneous computational capabilities, and unreliable network connections, all while pursuing diverse task objectives. Current FL research often operates in silos, developing point solutions for individual problems. While effective in isolation, these solutions are difficult to compose, and their interactions are unpredictable. This leaves practitioners facing an intractable design space. To address this bottleneck, we introduce Helmsman, a multi-agent system that automates the end-to-end synthesis of federated learning systems from high-level user specifications. By emulating a principled research and development workflow, Helmsman dramatically lowers the barrier to entry for creating robust and sophisticated FL solutions.

Figure 1: The automated FL development workflow of Helmsman. (a) Planning: A user query is refined into an actionable research plan via human-in-the-loop dialogue. (b) Coding: Specialized agent teams, managed by a Supervisor, collaboratively build a modular codebase. (c) Evaluation: The final code is autonomously tested and refined in a closed simulation loop until correct.

Helmsman: Automated FL System Synthesis

Helmsman transforms high-level user objectives into fully functional FL systems through a principled three-stage workflow that emulates human research and development practices.

📋

Planning

Transform user queries into verified research plans through automated critique and human validation.

⚙️

Coding

Supervised teams implement modular FL components following dependency-aware workflow.

🔄

Evaluation

Iterative diagnosis and debugging in sandboxed simulation until certified correctness.

◆ Interactive and Verifiable Planning

▸ Planning Agent drafts research plans using web search and curated FL literature database
▸ Reflection Agent performs automated critique to ensure plan completeness and feasibility
▸ Human-in-the-loop validation ensures alignment, safety, and resource optimization

◆ Modular Code Generation via Supervised Agent Teams

▸ Supervisor Agent decomposes plans into modular components following separation of concerns
▸ Dedicated Coder-Tester teams collaboratively implement Task, Client, Strategy, and Server modules
▸ Dependency-aware workflow ensures modular correctness before integration

◆ Autonomous Evaluation and Refinement

▸ Sandboxed Flower simulation executes integrated codebase for short diagnostic runs
▸ Evaluator Agent performs hierarchical diagnosis: Runtime Integrity then Semantic Correctness verification
▸ Debugger Agent produces targeted patches, iterating until system achieves certified correctness

AgentFL-Bench

To rigorously evaluate Helmsman's capabilities, we introduce AgentFL-Bench, a benchmark comprising 16 diverse tasks across 5 pivotal FL research domains: data heterogeneity, communication efficiency, personalization, active learning, and continual learning. Each task is defined by a standardized natural language query that provides an unambiguous problem specification, minimizing prompt-engineering variability and enabling fair comparisons between different agentic systems.

The tasks are designed for realism, reflecting the multifaceted nature of real-world FL challenges where issues like non-IID data and system constraints often co-occur. This requires agents to reason about and integrate appropriate strategies rather than solving isolated toy problems.

Table 1: The structured natural language template for specifying tasks to Helmsman. This template guides the user to provide a comprehensive and unambiguous problem definition, ensuring the Planning Agent receives the necessary context regarding the application domain, data characteristics, and desired FL objectives. The provided query example shows a complete instantiation of this template.

Component	Description	Pattern Template
Problem Statement	Defines the high-level deployment context, including the application domain, system scale, and target infrastructure.	"I need to deploy application type on/across number device types..."
Task Description	Specifies the data characteristics, heterogeneity patterns, and any domain-specific challenges.	"Each client holds dataset characteristics with specific challenge..."
Framework Requirement	Outlines the core FL objectives, the model architecture to be trained, and the metrics for evaluation.	"Help me build a federated learning framework that objectives, training a model architecture, evaluating performance by metrics."

Query Example:

"I need to deploy a personalized handwriting recognition app across 15 mobile devices. Each client holds FEMNIST data from individual users with unique writing styles. Help me build a personalized federated learning framework that balances global knowledge with local user adaptation for a CNN model, evaluating performance by average client test accuracy."

Experimental Results

Extensive experiments on AgentFL-Bench demonstrate that Helmsman generates solutions competitive with, and often superior to, established hand-crafted baselines. Our evaluation spans diverse challenges across heterogeneous FL, communication efficiency, personalization, active learning, and continual learning domains.

Table 2: Performance on heterogeneous FL benchmarks. Best and Second-best results are highlighted.

ID	Task	Dataset	Problem	FedAvg	FedProx	Specialized	Ours
Data Heterogeneity
Q1	Object Recognition	CIFAR-10-LT	Quantity Skew	70.68	69.13	78.39*	76.26
Q2	Object Recognition	CIFAR-100-C	Feature Skew	35.29	36.93	42.13†	39.43
Q3	Object Recognition	CIFAR-10N	Label Noise	74.32	79.42*	80.67†	81.14
Distribution Shift
Q4	Object Recognition	Office-Home	Domain Shift	52.94	50.81	57.68*	53.47
Q5	Human Activity	HAR	User Heterogeneity	94.38	95.66	95.77*	96.74
Q6	Speech Recognition	Speech Commands	Speaker Variation	84.93	84.11	83.08*	86.19
Q7	Medical Diagnosis	Fed-ISIC2019	Site Heterogeneity	55.43	61.79	63.62*	63.80
Q8	Object Recognition	Caltech101	Class Imbalance	48.21	47.75	63.62*	50.92
System Heterogeneity
Q9	Object Recognition	CIFAR-100	Resource Constraint	60.83	59.86	62.19‡	63.49

* FedNova, † FedNS, ‡ HeteroFL

ID	Task	Dataset	Challenge	FedAvg	FedProx	Specialized	Ours
Communication-Efficiency
Q10	Object Recognition	CIFAR-100	Bandwidth Limits	42.14	45.26	46.17*	48.43
Q11	Handwriting Recog.	FEMNIST	Connectivity Limits	87.06	88.19	88.86*	89.03
Personalization
Q12	Handwriting Recog.	FEMNIST	Local Adaptation	74.61	75.32	77.06§	76.75
Q13	Object Recognition	CIFAR-10	Distribution Skew	75.83	80.09	81.74§	79.11

ID	Task	Dataset	Challenge	FedAvg	FedProx	Specialized	Ours
Active Learning
Q14	Medical Diagnosis	DermaMNIST	Sample Selection	71.70	71.51	74.37#	72.64
Q15	Object Recognition	CIFAR-10	Distribution Skew	64.19	66.03	77.14#	68.59
Continual Learning
Q16	Object Recognition	Split-CIFAR100	Incremental Tasks	16.38	16.71	28.56¶	51.04

Discussion

Novel Algorithmic Discovery

For complex task Q16 (continual learning), Helmsman discovered a novel hybrid strategy combining client-side experience replay with global model distillation—achieving 79% accuracy improvement over FedWeIT while reducing catastrophic forgetting by 71%.

Limits of Autonomy

Helmsman achieved 62.5% full automation across AgentFL-Bench. The remaining 37.5% required human intervention for planning ambiguity or domain-specific coding. System autonomy is bounded by computational budget for iterative refinement.

Method	Task = 5	Task = 10
FedAvg	16.38	0.75	7.83	0.76
FedProx	16.71	0.76	8.19	0.75
FedEWC	16.06	0.68	10.07	0.62
FedWeIT	28.56	0.49	20.48	0.45
FedLwF	30.94	0.42	21.74	0.46
TARGET	34.89	0.24	25.65	0.27
Helmsman	51.04	0.07	47.53	0.18

Conclusion

Helmsman demonstrates that coordinated multi-agent collaboration can automate end-to-end FL system design, from initial planning through implementation and evaluation. By handling the complex design-implementation-testing cycle autonomously, Helmsman lowers barriers to creating sophisticated FL solutions and achieves competitive performance across diverse research domains. Future work will explore self-evolutionary capabilities, enabling the system to refine its strategies from experimental feedback toward fully autonomous FL engineering.

Citation

If you find Helmsman useful for your research, please consider citing our work:

@article{li2025helmsman,
  title={Helmsman: Autonomous Synthesis of Federated Learning Systems via Multi-Agent Collaboration},
  author={Li, Haoyuan and Funk, Mathias and Saeed, Aaqib},
  journal={arXiv preprint arXiv:2510.14512},
  year={2025}
}

Helmsman: Autonomous Synthesis of Federated Learning Systems via Multi-Agent Collaboration

Introduction

Helmsman: Automated FL System Synthesis

Planning

Coding

Evaluation

◆ Interactive and Verifiable Planning

◆ Modular Code Generation via Supervised Agent Teams

◆ Autonomous Evaluation and Refinement

AgentFL-Bench

AgentFL-Bench Task Distribution

Experimental Results

Discussion

Novel Algorithmic Discovery

Limits of Autonomy

Task Q16: Extended Evaluation on Split-CIFAR100

System Autonomy Analysis

Conclusion

Citation

Method	Task = 5		Task = 10
	Acc ↑	F ↓	Acc ↑	F ↓
FedAvg	16.38	0.75	7.83	0.76
FedProx	16.71	0.76	8.19	0.75
FedEWC	16.06	0.68	10.07	0.62
FedWeIT	28.56	0.49	20.48	0.45
FedLwF	30.94	0.42	21.74	0.46
TARGET	34.89	0.24	25.65	0.27
Helmsman	51.04	0.07	47.53	0.18

Helmsman: Autonomous Synthesis of Federated Learning Systems via Multi-Agent Collaboration

Introduction

Helmsman: Automated FL System Synthesis

Planning

Coding

Evaluation

◆ Interactive and Verifiable Planning

◆ Modular Code Generation via Supervised Agent Teams

◆ Autonomous Evaluation and Refinement

AgentFL-Bench 📊 View Task Distribution

AgentFL-Bench Task Distribution

Experimental Results

Discussion

Novel Algorithmic Discovery

Limits of Autonomy

Task Q16: Extended Evaluation on Split-CIFAR100

System Autonomy Analysis

Conclusion

Citation

AgentFL-Bench