StyleExpert: Mixture of Style Experts for Diverse Image Stylization

Shihao Zhu¹ Ziheng Ouyang¹ Yijia Kang¹ Qilong Wang³
Mi Zhou⁴ Bo Li⁴ Ming-Ming Cheng^1,2 Qibin Hou^1,2

¹VCIP, CS, Nankai University ²NKIARI, Shenzhen Futian
³Tianjin University ⁴vivo BlueImage Lab

Paper Code

Model

StyleExpert-40K

Abstract

Diffusion-based stylization has advanced significantly, yet existing methods are limited to color-driven transformations, neglecting complex semantics and material details. We introduce StyleExpert, a semantic-aware framework based on Mixture of Experts (MoE). Our framework employs a unified style encoder, trained on our large-scale dataset of content-style-stylized triplets, to embed diverse styles into a consistent latent space. This embedding is then used to condition a similarity-aware gating mechanism, which dynamically routes styles to specialized experts within the MoE architecture. Leveraging this MoE architecture, our method adeptly handles diverse styles spanning multiple semantic levels, from shallow textures to deep semantics. Extensive experiments show that StyleExpert outperforms existing approaches in preserving semantics and material details, while generalizing to unseen styles.

Visual Comparison on Unseen Styles

Diverse Image Stylization Results

Method

Overview of the proposed StyleExpert. Our method comprises two training stages. First, a style encoder is trained with the InfoNCE loss to learn discriminative style representations for faster convergence. Second, the pre-trained encoder provides style priors to guide the router network in training MoE LoRA adapters, enabling each layer to dynamically select the most suitable experts for diverse styles.

Dataset Curation (StyleExpert-500K)

To overcome the color-semantic imbalance in existing datasets, we construct StyleExpert-500K, containing approximately 500,000 content-style-stylized triplets.

We leverage 209 high-quality, style-centric LoRA models from the Hugging Face community. Using a large Vision-Language Model (Qwen), we carefully rewrite image captions to remove confounding stylistic information, ensuring the prompts describe only the objective content. A rigorous VLM-based filtering process is then applied to prune results with poor stylization or layout degradation, yielding a highly coherent triplet dataset that strongly prioritizes semantic information (such as texture and material) over superficial color features.