UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models

ICML 2026

Van-Tuan Tran, Hong-Hanh Nguyen-Le, Marco Ruffini, Merim Dzaferagic

Trinity College Dublin · University College Dublin

Figure 1. The UB-SMoE self-reinforcing cycle. Universal Pseudo-Gradient (PG) reconstructs learning signals for non-activated experts, maintaining their viability. This enables Dynamic Modulated Routing (DMR) to effectively balance expert utilization, which in turn generates real gradients that further refine all experts — sustaining every expert across heterogeneous clients.

8.7×

better low-resource performance over heterogeneous LoRA-rank methods

45.0%

computation reduction on low-resource clients

0.4267

best avg. accuracy on Commonsense-15K (OLMoE-1B-7B)

Abstract

Heterogeneous LoRA-rank methods address system heterogeneity in federated fine-tuning of foundation models by assigning client-specific ranks based on computational capabilities. However, these methods achieve only marginal computational savings, as dense feed-forward computations dominate. Sparse Mixture-of-Experts (SMoE) provides a promising alternative through conditional computation, yet we identify that its naive application to heterogeneous federated settings introduces two critical discordances: (i) expert utilization imbalance and (ii) non-differentiability of Top-K routing.

Our convergence analysis demonstrates that these discordances lead to degraded convergence, particularly for resource-constrained clients. To address these challenges, we propose Universally Balanced Sparse Mixture-of-Experts (UB-SMoE), which introduces Dynamic Modulated Routing (DMR) to rebalance expert utilization, and a Universal Pseudo-Gradient (PG) to reconstruct learning signals for non-activated experts. These mechanisms form a self-reinforcing cycle that maintains expert viability across heterogeneous clients. Experiments on benchmarks show that UB-SMoE achieves up to 45.0% computational reduction on low-resource clients while improving their performance by 8.7× compared to existing heterogeneous LoRA-rank methods.

Motivation

Real-world federated networks comprise devices with vastly different computational budgets. A single model configuration for all devices is ineffective: we cannot exploit high-end clients if the global model is shrunk for edge devices, and a large model cannot run on low-resource clients. We need resource-adaptive fine-tuning, where models scale their capacity to match client capabilities.

Limitation of Heterogeneous LoRA-rank

Dense FFN Computation Dominates

LoRA injects low-rank updates whose cost scales as $\mathcal{O}(r_c(d{+}l))$ per layer, but this is negligible against the FFN cost of $\mathcal{O}(d{\cdot}l)$, which is unchanged regardless of rank. Heterogeneous LoRA-rank methods therefore yield only ~5% computation reduction for low-resource clients, and the merged weights $W_0 + \Delta W$ remain dense at inference — deployment latency is unchanged.

Limitation of Naive Federated SMoE

Two Optimization Discordances

SMoE offers natural resource adaptability (high-resource clients activate more experts, low-resource clients fewer), but its naive use in heterogeneous FL triggers (i) expert utilization imbalance and (ii) non-differentiability of Top-K routing. Together these degrade convergence for the clients that can least afford it.

The Two Critical Discordances

Discordance 1

Expert Utilization Imbalance

Experts activated by high-resource clients receive frequent updates and become over-specialized, while those relevant to low-resource clients remain severely under-utilized. This causes a “rich-get-richer” dynamic where a few experts dominate and others collapse.

Discordance 2

Non-Differentiability of Top-K Routing

The gating function $\gamma_i(x)$ is non-zero only for selected experts, so non-activated experts receive zero gradients. Low-resource clients with high sparsity therefore leave most experts without any learning signal round after round.

Key Insight: Our convergence analysis (Theorem 4.1) proves these discordances introduce a bias in the stochastic-gradient estimator that creates an irreducible error floor in the global objective — one that scales inversely with client computational budgets, explaining why resource-constrained clients systematically underperform in federated SMoE systems.

Theoretical Foundation

We formally analyze the convergence of heterogeneous federated SMoE and show that the two discordances are not merely empirical nuisances — they impose a fundamental limit on accuracy.

Gradient Bias

Biased Sparse-MoE Gradient Estimator

Because Top-K routing zeroes out non-activated experts, the local stochastic gradient is a biased estimator of the true gradient. The bias is proportional to the mass placed on experts a client never activates — which grows with sparsity (i.e., with how resource-constrained the client is).

Theorem 4.1

Irreducible Error Floor

This bias translates into an irreducible error floor that scales inversely with the client's computation budget. Tighter budgets (fewer activated experts) ⇒ larger floor ⇒ worse convergence. This is precisely why simply plugging SMoE into heterogeneous FL fails the clients it was meant to help.

UB-SMoE Framework

UB-SMoE resolves both discordances with two complementary mechanisms that form a self-reinforcing cycle: PG maintains expert viability through approximate gradients, enabling DMR to balance utilization, which generates real gradients that further refine all experts.

Mechanism 1

Dynamic Modulated Routing (DMR)

DMR regulates expert selection using global utilization statistics through a learnable modulation vector $\boldsymbol{\phi}^{(l)}$. Rather than applying Top-K on raw affinities $\mathbf{s}^{(l)}$, it identifies a small candidate set $\mathcal{T}^{(l)}$ (top $N_p{=}2$ experts) and modulates only those logits: $m_i^{(l)} = s_i^{(l)} + \phi_i^{(l)}$ for $i \in \mathcal{T}^{(l)}$. An $L_2$-regularized $\phi$ with bounded range prevents the modulation from overriding semantic relevance, while a server-side utilization-aware update with momentum keeps utilization balanced.

Mechanism 2

Universal Pseudo-Gradient (PG)

PG approximately reconstructs gradients for non-activated experts, ensuring every expert receives a meaningful update in every round regardless of client sparsity. By feeding learning signals to dormant experts, PG keeps them viable — so that DMR has real experts to route to and balance. The global utilization rate $\tilde{u}_i^{(l)} = \sum_c p_c \, a_{c,i}^{(l)} / n_c^{(l)}$ closes the loop between client activations and server-side balancing.

Why the cycle works: PG keeps experts alive → DMR can safely promote under-utilized experts → balanced utilization produces real (not pseudo) gradients → experts are refined → viability is sustained. The two mechanisms are individually helpful but jointly far stronger than either alone (see ablation below).

Main Results

We evaluate UB-SMoE on Commonsense-15K (8 commonsense-reasoning datasets) and the TeleQuAD telecommunications question-answering benchmark, using OLMoE-1B-7B and OLMo-1B base models. Baselines span heterogeneous LoRA-rank methods (HetLoRA, FlexLoRA, FLoRA, FLoRIST) and heterogeneous sparsity methods (A$^3$SMoE, SMoE-LLB).

Commonsense-15K — SMoE Backbone (OLMoE-1B-7B)

Table 1 · Average accuracy on Commonsense-15K with OLMoE-1B-7B, averaged over all computation budgets

Dataset	HetLoRA	FlexLoRA	FLoRA	FLoRIST	SMoE-LLB	A³SMoE	UB-SMoE
ARC-Challenge	0.1284	0.3012	0.0868	0.0960	0.3080	0.3347	0.3611
ARC-Easy	0.1582	0.4136	0.1004	0.1097	0.4213	0.4724	0.5017
BoolQ	0.3709	0.4573	0.3472	0.3407	0.5122	0.4301	0.4952
HellaSwag	0.0714	0.1607	0.0674	0.0605	0.2096	0.2448	0.2258
OpenBookQA	0.1200	0.3225	0.1160	0.1380	0.3360	0.3925	0.4030
PIQA	0.2278	0.3327	0.2285	0.2319	0.4244	0.4219	0.5118
Social IQa	0.1452	0.3099	0.0885	0.0814	0.3801	0.4193	0.4486
WinoGrande	0.2770	0.3445	0.1786	0.1257	0.4410	0.3733	0.4665
Average	0.1874	0.3303	0.1517	0.1480	0.3791	0.3861	0.4267

UB-SMoE achieves the best average (0.4267), beating the strongest sparsity baseline A³SMoE by 10.5% and the strongest LoRA-rank baseline FlexLoRA by 29.1%.

Commonsense-15K — Dense Backbone (OLMo-1B)

Table 2 · Accuracy on Commonsense-15K with dense OLMo-1B; UB-SMoE evaluated at budget β₄ with matched trainable activated parameters

Dataset	FlexLoRA	HetLoRA	FLoRA	FLoRIST	UB-SMoE
ARC-Challenge	0.2654	0.2159	0.1212	0.0734	0.5333
ARC-Easy	0.2740	0.2180	0.1077	0.0614	0.7184
BoolQ	0.5101	0.5523	0.2651	0.1917	0.4697
HellaSwag	0.1358	0.1104	0.1911	0.1294	0.3536
OpenBookQA	0.2660	0.2340	0.1000	0.0780	0.6320
PIQA	0.4777	0.5005	0.4918	0.4913	0.6931
Social IQa	0.3557	0.3362	0.1029	0.0583	0.6039
WinoGrande	0.4775	0.4854	0.2928	0.2092	0.5043
Average	0.3453	0.3316	0.2091	0.1616	0.5636

Even against dense LoRA-rank adaptation under matched trainable parameters, UB-SMoE (0.5636) surpasses FlexLoRA by 63.2%, HetLoRA by 70.0%, FLoRA by 169.5%, and FLoRIST by 248.6%.

Performance Across Resource Budgets (the 8.7× story)

Across four computation budgets β₁–β₄, heterogeneous LoRA-rank methods collapse under tight budgets while sparsity-based methods stay effective.

Table 3 · Average accuracy on Commonsense-15K across computation budgets (heterogeneous setting)

Budget	HetLoRA	FlexLoRA	FLoRA	FLoRIST	SMoE-LLB	A³SMoE	UB-SMoE
β₁ (low)	0.0079	0.0456	0.0094	0.0112	0.3531	0.3629	0.3936
β₂	0.0699	0.2818	0.0480	0.0538	0.3847	0.4310	0.3359
β₃	0.2137	0.5375	0.2497	0.2545	0.3961	0.4096	0.4716
β₄ (high)	0.4580	0.4563	0.2996	0.2724	0.3824	0.3410	0.5240
Average	0.1874	0.3303	0.1517	0.1480	0.3791	0.3861	0.4313

At the tightest budget β₁, UB-SMoE (0.3936) outperforms FlexLoRA (0.0456) — the strongest LoRA-rank baseline — by 8.7×, while LoRA-rank methods (HetLoRA 0.0079, FLoRA 0.0094, FLoRIST 0.0112) are essentially non-functional.

TeleQuAD — Telecommunications QA (non-IID)

BERTScore F1 on the domain-specific TeleQuAD benchmark under non-IID client data, where UB-SMoE is best at every budget.

Table 4 · TeleQuAD BERTScore F1 (×100) under non-IID data distribution

Budget	HetLoRA	FlexLoRA	SMoE-LLB	A³SMoE	UB-SMoE
β₁ (low)	35.84	37.73	44.56	37.24	47.66
β₂	40.76	44.13	59.68	41.71	60.04
β₃	44.63	42.87	65.98	41.33	66.31
β₄ (high)	46.26	42.10	68.23	42.52	68.29
Average	41.87	41.71	59.61	40.70	60.58

Under non-IID data, UB-SMoE improves from 47.66 at β₁ to 68.29 at β₄ and leads at every budget — demonstrating robustness to both resource constraints and data heterogeneity.

Ablation — Contribution of Each Mechanism

Table 5 · Ablation on Commonsense-15K (avg. over 8 reasoning tasks). PG = Universal Pseudo-Gradient; DMR components are φ-regularization and the utilization-aware update.

Pseudo-Gradient	φ Regularization	Utilization-aware Update	Avg. Accuracy
–	–	–	0.1701
✓	–	–	0.2659
✓	✓	–	0.2839
–	✓	–	0.3591
–	✓	✓	0.4009
✓	✓	✓	0.4267

PG alone lifts accuracy from 0.1701 → 0.2659 by providing learning signals to all experts; adding the DMR components resolves the utilization imbalance, reaching 0.4267 — confirming the two mechanisms are complementary.

Scalability to 32 Clients

Table 6 · Scalability on Commonsense-15K with 32 clients under different heterogeneity patterns (avg. over budgets)

Heterogeneity Pattern	HetLoRA	FlexLoRA	A³SMoE	SMoE-LLB	UB-SMoE
Uniform (50% participation)	0.1505	0.2070	0.3998	0.3928	0.4036
Long-tail (75% low-resource)	0.1832	0.2585	0.2988	0.2961	0.3047
Reverse long-tail (75% high-resource)	0.2242	0.2601	0.4148	0.3928	0.4272

UB-SMoE stays ahead across client scale and resource distribution, including the challenging long-tail setting dominated by low-resource clients.

Key Findings

Finding 1

Dramatic Gains for Low-Resource Clients

Where heterogeneous LoRA-rank methods effectively fail (β₁ accuracies near 0.01), UB-SMoE stays strong (0.3936) — an 8.7× improvement over the best LoRA-rank baseline, with up to 45% computation reduction.

Finding 2

Balanced Expert Utilization

UB-SMoE achieves the highest global utilization entropy ($H \approx 6.5$) with lower variance, progressively improving balance across communication rounds — eliminating the “rich-get-richer” collapse of naive federated SMoE.

Finding 3

Robust to Scale & Heterogeneity

UB-SMoE generalizes across backbones (OLMoE-1B-7B, dense OLMo-1B), domains (commonsense reasoning, telecom), data distributions (IID & non-IID), and scales gracefully to 32 clients under uniform, long-tail, and reverse long-tail patterns.

Citation

If you find this work useful in your research, please consider citing:

@inproceedings{tran2026ubsmoe,
  title     = {UB-SMoE: Universally Balanced Sparse
               Mixture-of-Experts for Resource-adaptive
               Federated Fine-tuning of Foundation Models},
  author    = {Tran, Van-Tuan and Nguyen-Le, Hong-Hanh
               and Ruffini, Marco and Dzaferagic, Merim},
  booktitle = {Proceedings of the International Conference
               on Machine Learning (ICML)},
  year      = {2026}
}