DistillFSS: Synthesizing Few-Shot Knowledge into a Lightweight Segmentation Model

1University of Bari Aldo Moro, 2Jheronimus Academy of Data Science (JADS), 3Eindhoven University of Technology (TU/e)
DistillFSS Framework

DistillFSS consists of a Teacher (fine-tuned on support images) and a Student (distilled to internalize support knowledge).
The student network replaces heavy attention blocks with lightweight layers, enabling support-free inference.

Abstract

Cross-Domain Few-Shot Semantic Segmentation (CD-FSS) seeks to segment unknown classes in unseen domains using only a few annotated examples. This setting is inherently challenging: source and target domains exhibit substantial distribution shifts, label spaces are disjoint, and support images are scarce—making standard episodic methods unreliable and computationally demanding at test time.

To address these constraints, we propose DistillFSS, a framework that embeds support-set knowledge directly into a model's parameters through a teacher-student distillation process. By internalizing few-shot reasoning into a dedicated layer within the student network, DistillFSS eliminates the need for support images at test time, enabling fast, lightweight inference, while allowing efficient extension to novel classes in unseen domains through rapid teacher-driven specialization.

Combined with fine-tuning, the approach scales efficiently to large support sets and significantly reduces computational overhead. To evaluate the framework under realistic conditions, we introduce a new CD-FSS benchmark spanning medical imaging, industrial inspection, and remote sensing, with disjoint label spaces and variable support sizes. Experiments show that DistillFSS matches or surpasses state-of-the-art baselines, particularly in multi-class and multi-shot scenarios, while offering substantial efficiency gains.

Qualitative Results


Visualizing segmentation results on the proposed CD-FSS benchmark (Medical, Industrial, Remote Sensing).


Lung Nodule (Medical)

Query
GT
DMTNet
DistillFSS

ISIC (Skin Lesion)

Query
GT
DMTNet
DistillFSS

KVASIR (Polyp)

Query
GT
DMTNet
DistillFSS

Nucleus (Microscopy)

Query
GT
DMTNet
DistillFSS

WeedMap (Remote Sensing)

Query
GT
DMTNet
DistillFSS

Industrial (Defect)

Query
GT
DMTNet
DistillFSS

Pothole (Road)

Query
GT
DMTNet
DistillFSS

The DistillFSS Framework DistillFSS Logo

DistillFSS shifts the focus from explicit support–query matching to embedding support-set knowledge directly into the model's parameters via knowledge distillation. The framework consists of two main components:

1. Teacher: Dense Cross-Attention

We employ a modified DCAMA backbone as the Teacher. It uses dense cross-attention to correlate query and support features. To handle domain shifts, we perform TransferFSS, fine-tuning the teacher's attention aggregation layers on the target domain's support set.

2. Student: Internalized Knowledge

The Student network shares the backbone but replaces the heavy cross-attention blocks with lightweight ConvDist modules. The student learns to mimic the teacher's attention maps solely from the query image, effectively internalizing the support information to predict attention without needing the support set at inference time.

Support Set
FSS Teacher
DistillFSS
Specialized Student

This two-stage process transforms explicit support–query interactions into an internal representation learned by the student, combining the adaptivity of few-shot learning with the efficiency of standard segmentation models.

Efficiency & Scalability

A major advantage of DistillFSS is its independence from the support set size at inference time. While traditional methods like DCAMA experience linear growth in computational cost as the number of shots ($K$) increases, DistillFSS maintains constant low latency and memory footprint.

Inference Speed

5-way Setting (Lower is better)

Method 5-shot 50-shot
DCAMA 435 ms 3130 ms
DMTNet 32,024 ms > 60,000 ms
DistillFSS 66 ms 66 ms

Peak Memory Usage

1-way Setting (Lower is better)

Method 1-shot 10-shot
DCAMA ~1.3 GiB 13.0 GiB
HDMNet ~1.5 GiB ~12.5 GiB
DistillFSS 1.2 GiB 1.2 GiB

DistillFSS creates a deployable student model that runs ~47x faster than DCAMA in 50-shot scenarios and fits on edge devices where heavy attention-based baselines would run out of memory.

Quantitative Benchmark

We evaluated DistillFSS on a new CD-FSS benchmark spanning 7 datasets across medical, industrial, and remote sensing domains. The tables below report mean Intersection over Union (mIoU) scores, demonstrating that DistillFSS consistently achieves state-of-the-art performance, especially in multi-shot settings.

Method 5-shot 10-shot 25-shot 50-shot
BAM 0.17 0.17 0.20 0.19
HDMNet 0.04 0.12 0.14 0.16
LabelAnything 0.05 0.04 0.04 0.04
PATNet 0.08 0.22 0.27 0.30
DMTNet 0.16 0.16 0.16 0.16
DCAMA (R50) 0.11 0.00 0.23 0.24
DCAMA (Swin) 0.18 0.27 0.21 0.18
DistillFSS (R50) 8.91 11.70 16.59 20.02
DistillFSS (Swin) 3.31 3.54 3.42 4.87

BibTeX

@misc{marinisDistillFSSSynthesizingFewShot2025,
    title = {{DistillFSS}: {Synthesizing} {Few}-{Shot} {Knowledge} into a {Lightweight} {Segmentation} {Model}},
    shorttitle = {{DistillFSS}},
    url = {http://arxiv.org/abs/2512.05613},
    doi = {10.48550/arXiv.2512.05613},
    publisher = {arXiv},
    author = {Marinis, Pasquale De and Blok, Pieter M. and Kaymak, Uzay and Brussee, Rogier and Vessio, Gennaro and Castellano, Giovanna},
    month = dec,
    year = {2025},
    note = {arXiv:2512.05613 [cs]},
}