UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models
Abstract
Heterogeneous LoRA-rank methods address system heterogeneity in federated fine-tuning of foundation models by assigning client-specific ranks based on computational capabilities. However, these methods achieve only marginal computational savings, as dense feed-forward computations dominate. Sparse Mixture-of-Experts (SMoE) provides a promising alternative through conditional computation, yet we identify that its naive application to heterogeneous federated settings introduces two critical discordances: (i) expert utilization imbalance and (ii) non-differentiability of Top-K routing.
Our convergence analysis demonstrates that these discordances lead to degraded convergence, particularly for resource-constrained clients. To address these challenges, we propose Universally Balanced Sparse Mixture-of-Experts (UB-SMoE), which introduces Dynamic Modulated Routing (DMR) to rebalance expert utilization, and a Universal Pseudo-Gradient (PG) to reconstruct learning signals for non-activated experts. These mechanisms form a self-reinforcing cycle that maintains expert viability across heterogeneous clients. Experiments on benchmarks show that UB-SMoE achieves up to 45.0% computational reduction on low-resource clients while improving their performance by 8.7× compared to existing heterogeneous LoRA-rank methods.
Motivation
Real-world federated networks comprise devices with vastly different computational budgets. A single model configuration for all devices is ineffective: we cannot exploit high-end clients if the global model is shrunk for edge devices, and a large model cannot run on low-resource clients. We need resource-adaptive fine-tuning, where models scale their capacity to match client capabilities.
Dense FFN Computation Dominates
LoRA injects low-rank updates whose cost scales as $\mathcal{O}(r_c(d{+}l))$ per layer, but this is negligible against the FFN cost of $\mathcal{O}(d{\cdot}l)$, which is unchanged regardless of rank. Heterogeneous LoRA-rank methods therefore yield only ~5% computation reduction for low-resource clients, and the merged weights $W_0 + \Delta W$ remain dense at inference — deployment latency is unchanged.
Two Optimization Discordances
SMoE offers natural resource adaptability (high-resource clients activate more experts, low-resource clients fewer), but its naive use in heterogeneous FL triggers (i) expert utilization imbalance and (ii) non-differentiability of Top-K routing. Together these degrade convergence for the clients that can least afford it.
The Two Critical Discordances
Expert Utilization Imbalance
Experts activated by high-resource clients receive frequent updates and become over-specialized, while those relevant to low-resource clients remain severely under-utilized. This causes a “rich-get-richer” dynamic where a few experts dominate and others collapse.
Non-Differentiability of Top-K Routing
The gating function $\gamma_i(x)$ is non-zero only for selected experts, so non-activated experts receive zero gradients. Low-resource clients with high sparsity therefore leave most experts without any learning signal round after round.
Key Insight: Our convergence analysis (Theorem 4.1) proves these discordances introduce a bias in the stochastic-gradient estimator that creates an irreducible error floor in the global objective — one that scales inversely with client computational budgets, explaining why resource-constrained clients systematically underperform in federated SMoE systems.
Theoretical Foundation
We formally analyze the convergence of heterogeneous federated SMoE and show that the two discordances are not merely empirical nuisances — they impose a fundamental limit on accuracy.
Biased Sparse-MoE Gradient Estimator
Because Top-K routing zeroes out non-activated experts, the local stochastic gradient is a biased estimator of the true gradient. The bias is proportional to the mass placed on experts a client never activates — which grows with sparsity (i.e., with how resource-constrained the client is).
Irreducible Error Floor
This bias translates into an irreducible error floor that scales inversely with the client's computation budget. Tighter budgets (fewer activated experts) ⇒ larger floor ⇒ worse convergence. This is precisely why simply plugging SMoE into heterogeneous FL fails the clients it was meant to help.
UB-SMoE Framework
UB-SMoE resolves both discordances with two complementary mechanisms that form a self-reinforcing cycle: PG maintains expert viability through approximate gradients, enabling DMR to balance utilization, which generates real gradients that further refine all experts.
Dynamic Modulated Routing (DMR)
DMR regulates expert selection using global utilization statistics through a learnable modulation vector $\boldsymbol{\phi}^{(l)}$. Rather than applying Top-K on raw affinities $\mathbf{s}^{(l)}$, it identifies a small candidate set $\mathcal{T}^{(l)}$ (top $N_p{=}2$ experts) and modulates only those logits: $m_i^{(l)} = s_i^{(l)} + \phi_i^{(l)}$ for $i \in \mathcal{T}^{(l)}$. An $L_2$-regularized $\phi$ with bounded range prevents the modulation from overriding semantic relevance, while a server-side utilization-aware update with momentum keeps utilization balanced.
Universal Pseudo-Gradient (PG)
PG approximately reconstructs gradients for non-activated experts, ensuring every expert receives a meaningful update in every round regardless of client sparsity. By feeding learning signals to dormant experts, PG keeps them viable — so that DMR has real experts to route to and balance. The global utilization rate $\tilde{u}_i^{(l)} = \sum_c p_c \, a_{c,i}^{(l)} / n_c^{(l)}$ closes the loop between client activations and server-side balancing.
Why the cycle works: PG keeps experts alive → DMR can safely promote under-utilized experts → balanced utilization produces real (not pseudo) gradients → experts are refined → viability is sustained. The two mechanisms are individually helpful but jointly far stronger than either alone (see ablation below).
Main Results
We evaluate UB-SMoE on Commonsense-15K (8 commonsense-reasoning datasets) and the TeleQuAD telecommunications question-answering benchmark, using OLMoE-1B-7B and OLMo-1B base models. Baselines span heterogeneous LoRA-rank methods (HetLoRA, FlexLoRA, FLoRA, FLoRIST) and heterogeneous sparsity methods (A$^3$SMoE, SMoE-LLB).
Commonsense-15K — SMoE Backbone (OLMoE-1B-7B)
| Dataset | HetLoRA | FlexLoRA | FLoRA | FLoRIST | SMoE-LLB | A³SMoE | UB-SMoE |
|---|---|---|---|---|---|---|---|
| ARC-Challenge | 0.1284 | 0.3012 | 0.0868 | 0.0960 | 0.3080 | 0.3347 | 0.3611 |
| ARC-Easy | 0.1582 | 0.4136 | 0.1004 | 0.1097 | 0.4213 | 0.4724 | 0.5017 |
| BoolQ | 0.3709 | 0.4573 | 0.3472 | 0.3407 | 0.5122 | 0.4301 | 0.4952 |
| HellaSwag | 0.0714 | 0.1607 | 0.0674 | 0.0605 | 0.2096 | 0.2448 | 0.2258 |
| OpenBookQA | 0.1200 | 0.3225 | 0.1160 | 0.1380 | 0.3360 | 0.3925 | 0.4030 |
| PIQA | 0.2278 | 0.3327 | 0.2285 | 0.2319 | 0.4244 | 0.4219 | 0.5118 |
| Social IQa | 0.1452 | 0.3099 | 0.0885 | 0.0814 | 0.3801 | 0.4193 | 0.4486 |
| WinoGrande | 0.2770 | 0.3445 | 0.1786 | 0.1257 | 0.4410 | 0.3733 | 0.4665 |
| Average | 0.1874 | 0.3303 | 0.1517 | 0.1480 | 0.3791 | 0.3861 | 0.4267 |
Commonsense-15K — Dense Backbone (OLMo-1B)
| Dataset | FlexLoRA | HetLoRA | FLoRA | FLoRIST | UB-SMoE |
|---|---|---|---|---|---|
| ARC-Challenge | 0.2654 | 0.2159 | 0.1212 | 0.0734 | 0.5333 |
| ARC-Easy | 0.2740 | 0.2180 | 0.1077 | 0.0614 | 0.7184 |
| BoolQ | 0.5101 | 0.5523 | 0.2651 | 0.1917 | 0.4697 |
| HellaSwag | 0.1358 | 0.1104 | 0.1911 | 0.1294 | 0.3536 |
| OpenBookQA | 0.2660 | 0.2340 | 0.1000 | 0.0780 | 0.6320 |
| PIQA | 0.4777 | 0.5005 | 0.4918 | 0.4913 | 0.6931 |
| Social IQa | 0.3557 | 0.3362 | 0.1029 | 0.0583 | 0.6039 |
| WinoGrande | 0.4775 | 0.4854 | 0.2928 | 0.2092 | 0.5043 |
| Average | 0.3453 | 0.3316 | 0.2091 | 0.1616 | 0.5636 |
Performance Across Resource Budgets (the 8.7× story)
Across four computation budgets β₁–β₄, heterogeneous LoRA-rank methods collapse under tight budgets while sparsity-based methods stay effective.
| Budget | HetLoRA | FlexLoRA | FLoRA | FLoRIST | SMoE-LLB | A³SMoE | UB-SMoE |
|---|---|---|---|---|---|---|---|
| β₁ (low) | 0.0079 | 0.0456 | 0.0094 | 0.0112 | 0.3531 | 0.3629 | 0.3936 |
| β₂ | 0.0699 | 0.2818 | 0.0480 | 0.0538 | 0.3847 | 0.4310 | 0.3359 |
| β₃ | 0.2137 | 0.5375 | 0.2497 | 0.2545 | 0.3961 | 0.4096 | 0.4716 |
| β₄ (high) | 0.4580 | 0.4563 | 0.2996 | 0.2724 | 0.3824 | 0.3410 | 0.5240 |
| Average | 0.1874 | 0.3303 | 0.1517 | 0.1480 | 0.3791 | 0.3861 | 0.4313 |
TeleQuAD — Telecommunications QA (non-IID)
BERTScore F1 on the domain-specific TeleQuAD benchmark under non-IID client data, where UB-SMoE is best at every budget.
| Budget | HetLoRA | FlexLoRA | SMoE-LLB | A³SMoE | UB-SMoE |
|---|---|---|---|---|---|
| β₁ (low) | 35.84 | 37.73 | 44.56 | 37.24 | 47.66 |
| β₂ | 40.76 | 44.13 | 59.68 | 41.71 | 60.04 |
| β₃ | 44.63 | 42.87 | 65.98 | 41.33 | 66.31 |
| β₄ (high) | 46.26 | 42.10 | 68.23 | 42.52 | 68.29 |
| Average | 41.87 | 41.71 | 59.61 | 40.70 | 60.58 |
Ablation — Contribution of Each Mechanism
| Pseudo-Gradient | φ Regularization | Utilization-aware Update | Avg. Accuracy |
|---|---|---|---|
| – | – | – | 0.1701 |
| ✓ | – | – | 0.2659 |
| ✓ | ✓ | – | 0.2839 |
| – | ✓ | – | 0.3591 |
| – | ✓ | ✓ | 0.4009 |
| ✓ | ✓ | ✓ | 0.4267 |
Scalability to 32 Clients
| Heterogeneity Pattern | HetLoRA | FlexLoRA | A³SMoE | SMoE-LLB | UB-SMoE |
|---|---|---|---|---|---|
| Uniform (50% participation) | 0.1505 | 0.2070 | 0.3998 | 0.3928 | 0.4036 |
| Long-tail (75% low-resource) | 0.1832 | 0.2585 | 0.2988 | 0.2961 | 0.3047 |
| Reverse long-tail (75% high-resource) | 0.2242 | 0.2601 | 0.4148 | 0.3928 | 0.4272 |
Key Findings
Dramatic Gains for Low-Resource Clients
Where heterogeneous LoRA-rank methods effectively fail (β₁ accuracies near 0.01), UB-SMoE stays strong (0.3936) — an 8.7× improvement over the best LoRA-rank baseline, with up to 45% computation reduction.
Balanced Expert Utilization
UB-SMoE achieves the highest global utilization entropy ($H \approx 6.5$) with lower variance, progressively improving balance across communication rounds — eliminating the “rich-get-richer” collapse of naive federated SMoE.
Robust to Scale & Heterogeneity
UB-SMoE generalizes across backbones (OLMoE-1B-7B, dense OLMo-1B), domains (commonsense reasoning, telecom), data distributions (IID & non-IID), and scales gracefully to 32 clients under uniform, long-tail, and reverse long-tail patterns.
Citation
If you find this work useful in your research, please consider citing:
@inproceedings{tran2026ubsmoe,
title = {UB-SMoE: Universally Balanced Sparse
Mixture-of-Experts for Resource-adaptive
Federated Fine-tuning of Foundation Models},
author = {Tran, Van-Tuan and Nguyen-Le, Hong-Hanh
and Ruffini, Marco and Dzaferagic, Merim},
booktitle = {Proceedings of the International Conference
on Machine Learning (ICML)},
year = {2026}
}