StyleExpert: Mixture of Style Experts for Diverse Image Stylization

Shihao Zhu1 Ziheng Ouyang1 Yijia Kang1 Qilong Wang3
Mi Zhou4 Bo Li4 Ming-Ming Cheng1,2 Qibin Hou1,2

1VCIP, CS, Nankai University    2NKIARI, Shenzhen Futian
3Tianjin University    4vivo BlueImage Lab

Paper Code HF Model HF StyleExpert-40K

Teaser Image

Abstract

Diffusion-based stylization has advanced significantly, yet existing methods are limited to color-driven transformations, neglecting complex semantics and material details. We introduce StyleExpert, a semantic-aware framework based on Mixture of Experts (MoE). Our framework employs a unified style encoder, trained on our large-scale dataset of content-style-stylized triplets, to embed diverse styles into a consistent latent space. This embedding is then used to condition a similarity-aware gating mechanism, which dynamically routes styles to specialized experts within the MoE architecture. Leveraging this MoE architecture, our method adeptly handles diverse styles spanning multiple semantic levels, from shallow textures to deep semantics. Extensive experiments show that StyleExpert outperforms existing approaches in preserving semantics and material details, while generalizing to unseen styles.

Visual Comparison on Unseen Styles

Qualitative comparison

Diverse Image Stylization Results

Visual Result 1
Visual Result 2
Visual Result 3
Visual Result 4

Method

Method Overview

Overview of the proposed StyleExpert. Our method comprises two training stages. First, a style encoder is trained with the InfoNCE loss to learn discriminative style representations for faster convergence. Second, the pre-trained encoder provides style priors to guide the router network in training MoE LoRA adapters, enabling each layer to dynamically select the most suitable experts for diverse styles.

Dataset Curation (StyleExpert-500K)

Dataset Overview

To overcome the color-semantic imbalance in existing datasets, we construct StyleExpert-500K, containing approximately 500,000 content-style-stylized triplets.

We leverage 209 high-quality, style-centric LoRA models from the Hugging Face community. Using a large Vision-Language Model (Qwen), we carefully rewrite image captions to remove confounding stylistic information, ensuring the prompts describe only the objective content. A rigorous VLM-based filtering process is then applied to prune results with poor stylization or layout degradation, yielding a highly coherent triplet dataset that strongly prioritizes semantic information (such as texture and material) over superficial color features.

Dataset Pipeline