Poster Session: Thu, Jul 9, 2026 • 2:30 PM – 4:15 PM KST Coex: HALL A, Seoul, South Korea

Mixing Expertise with Confidence: A Mixture of Experts Framework for Robust Multi-Modal Continual Learning

Md Abdullah Al Forhad1 Yuansheng Zhu2 Abhinab Acharya2 Xumin Liu2 Qi Yu2 Weishi Shi1
1University of North Texas
2Rochester Institute of Technology
The Mixture of Experts (MoE) framework is widely used in continual learning to mitigate catastrophic forgetting. MoEs typically combine a small inter-task shared parameter space with largely independent expert parameters. However, as the number of tasks increases, the shared space becomes a bottleneck, reintroducing forgetting, while fully independent experts require explicit task ID predictors (e.g., routers), adding complexity. In this work, we eliminate the inter-task shared parameter space and the need for a task ID predictor by enabling expert communication and allowing knowledge to be shared dynamically, akin to human collaboration. We bridge the inter-expert knowledge sharing by leveraging the open-set learning capabilities of a multimodal foundation model (e.g., CLIP), thereby providing “expert priors” that bolster each expert’s task-specific representations. Guided by these priors, experts learn calibrated inter-task posteriors. Additionally, multivariate Gaussians over the learned posteriors promote complementary specialization among experts. We propose new evaluation benchmarks that simulate realistic continual learning scenarios, and our prior-conditioned strategy consistently outperforms existing methods across diverse settings without relying on reference datasets or replay memory.

Method Overview

Method overview: MoE confidence framework

Figure 1: (a) CLIP confidently predicts the correct class “cat” when the true label is present, but makes a high-confidence mistake (open-set error) when the “cat” label is absent. (b) training: open-set errors from previous experts filter semantically similar samples (Equation 3), and selective scaling (shown as darker green) calibrates the current expert (Equation 4). The expert also learns a multivariate Gaussian from the vision embeddings (Equation 7). (c) expert architecture as adapters that support residual updates by caching frozen activations, enabling adapter updates without full backbone recomputation in order to enable efficient inference. (d) inference: MD offers sample-specific guidance (darker purple denotes weight) (Equation 8).

Citation

@inproceedings{forhad2026mixing, title={Mixing Expertise with Confidence: A Mixture of Experts Framework for Robust Multi-Modal Continual Learning}, author={Al Forhad, Md Abdullah and Zhu, Yuansheng and Acharya, Abhinab and Liu, Xumin and Yu, Qi and Shi, Weishi}, booktitle={Forty-Third International Conference on Machine Learning}, year={2026}, organization={PMLR} }