HypDAE: Hyperbolic Diffusion Autoencoders for Hierarchical Few-shot Image Generation


Few-shot image generation aims to generate diverse and high-quality images for an unseen class given only a few examples in that class. A key challenge in this task is balancing category consistency and image diversity, which often compete with each other. Moreover, existing methods offer limited control over the attributes of newly generated images. In this work, we propose Hyperbolic Diffusion Autoencoders (HypDAE), a novel approach that operates in hyperbolic space to capture hierarchical relationships among images from seen categories. By leveraging pre-trained foundation models, HypDAE generates diverse new images for unseen categories with exceptional quality by varying stochastic subcodes or semantic codes. Most importantly, the hyperbolic representation introduces an additional degree of control over semantic diversity through the adjustment of radii within the hyperbolic disk. Extensive experiments and visualizations demonstrate that HypDAE significantly outperforms prior methods by achieving a better balance between preserving category-relevant features and promoting image diversity with limited data. Furthermore, HypDAE offers a highly controllable and interpretable generation process.
Generative models have succeeded in generating high-fidelity and realistic images, partially thanks to a large amount of training data. However, such powerful models often struggle with generating diverse images for a novel category given only a few examples. This challenging task is known as few-shot image generation, which aims to synthesize images that preserve the category-level identity of the limited input samples.
Existing few-shot image generation methods are primarily GAN-based and fall into three categories: transfer-based approaches, which use meta-learning or domain adaptation for cross-category generalization but often face limited transferability; fusion-based approaches, which fuse features from multiple exemplars but may produce images with blended and unnatural appearances; and optimization-based approaches, which directly optimize on few examples but tend to overfit and lack diversity.
The core challenge in few-shot image generation lies in balancing category consistency and image diversity with limited data. Most existing methods struggle to achieve this balance - they either produce images that are too similar to the reference samples (limited diversity) or generate images that deviate significantly from the target category (poor consistency). Additionally, these methods offer limited controllability over the generation process, making it difficult to specify desired attributes in the generated images.
Illustration of the property of hyperbolic space on the Poincaré disk. Given two latent codes of similar images on the edge of Poincaré disk, the geodesic between these two points is the red curve rather than a straight line in Euclidean space. Therefore, their average latent code is calculated closer to the center, which can be viewed as the "parent" of the leaf nodes. One can generate diverse images without changing the category by moving the latent code from one child to another of the same parent in the hyperbolic space.
To generate diverse new images from a few reference images while preserving their identity, it is essential for our model to understand the hierarchical semantic relationships between different categories and within each category. HypDAE achieves this by operating in hyperbolic space, which is naturally suited for representing hierarchical structures.
Overview of HypDAE training pipeline. Our method consists of a hyperbolic autoencoder that captures hierarchical semantic relationships and a diffusion decoder that generates high-quality images. The hyperbolic space enables controllable generation through radius adjustment and semantic interpolation.
Unlike Euclidean spaces (zero curvature) or spherical spaces (positive curvature), hyperbolic spaces exhibit negative curvature, making them ideal for modeling hierarchical data. As a continuous analog of trees, hyperbolic space enables hierarchical representation through its exponential radius growth, facilitating structured modeling across different semantic levels.
We use the Poincaré disk model, defined by D^n = {x ∈ R^n : ||x|| < 1}, which is preferred in gradient-based learning. The key insight is that the geodesic distance in hyperbolic space naturally captures semantic similarity, with closer points representing more similar concepts and points nearer to the center representing higher-level semantic categories.
Our method consists of two main components: (1) A hyperbolic autoencoder that maps images to hierarchical representations in hyperbolic space, and (2) A diffusion decoder that generates high-quality images conditioned on these hyperbolic embeddings. The hyperbolic encoder learns to place semantically similar images close together in hyperbolic space, while maintaining the hierarchical structure where category centroids are positioned closer to the origin.
We demonstrate HypDAE's effectiveness on few-shot image generation across multiple datasets. Our method generates diverse and high-quality images while preserving category-specific features from just a few reference examples.
Few-shot image generation results. HypDAE generates diverse images that maintain category consistency while showing clear visual variations. The hyperbolic representation allows for controlled diversity through radius adjustment.
A key advantage of HypDAE is its ability to perform hierarchical image generation by leveraging the structure of hyperbolic space. By adjusting the radius in hyperbolic space, we can control the semantic level of generated variations.
Hierarchical generation results showing how HypDAE can generate images at different semantic levels by varying the hyperbolic radius. Smaller radii produce more diverse variations, while larger radii maintain stronger category consistency.
We compare HypDAE against existing few-shot image generation methods including WaveGAN, HAE, and SAGE. Our method achieves superior performance in terms of both image quality and diversity metrics.
Qualitative comparison with state-of-the-art few-shot image generation methods. HypDAE produces images with better quality and diversity while maintaining category consistency.
We conduct comprehensive ablation studies to analyze the contribution of each component in HypDAE. The studies validate the effectiveness of hyperbolic space representation and the importance of hierarchical modeling for few-shot image generation.
Ablation study results showing the impact of different components on generation quality and diversity. The full HypDAE model achieves the best balance between category consistency and image diversity.
We conducted an extensive user study with 30 participants to evaluate HypDAE against existing methods across multiple criteria including fidelity, quality, and diversity.
User study results demonstrating HypDAE's superior performance across all evaluation metrics. Participants consistently rated HypDAE higher in terms of image quality, category fidelity, and diversity.
HypDAE demonstrates excellent performance across diverse domains and shows strong generalization to out-of-distribution data.
Additional few-shot generation results across different categories, demonstrating the versatility and robustness of HypDAE.
@INPROCEEDINGS{Li25HypDAE,
title = {HypDAE: Hyperbolic Diffusion Autoencoders for Hierarchical Few-shot Image Generation},
author = {Lingxiao Li and Kaixuan Fan and Boqing Gong and Xiangyu Yue},
booktitle = {International Conference on Computer Vision (ICCV)},
year = {2025}
}
Acknowledgements: We thank DreamBooth for the page templates.