DistillFSS: Synthesizing Few-Shot Knowledge into a Lightweight Segmentation Model

Pasquale De Marinis¹, Pieter M. Blok², Uzay Kaymak^2,3, Rogier Brussee², Gennaro Vessio¹, Giovanna Castellano¹

¹University of Bari Aldo Moro, ²Jheronimus Academy of Data Science (JADS), ³Eindhoven University of Technology (TU/e)

DistillFSS consists of a Teacher (fine-tuned on support images) and a Student (distilled to internalize support knowledge).
The student network replaces heavy attention blocks with lightweight layers, enabling support-free inference.

Abstract

Cross-Domain Few-Shot Semantic Segmentation (CD-FSS) seeks to segment unknown classes in unseen domains using only a few annotated examples. This setting is inherently challenging: source and target domains exhibit substantial distribution shifts, label spaces are disjoint, and support images are scarce—making standard episodic methods unreliable and computationally demanding at test time.

To address these constraints, we propose DistillFSS, a framework that embeds support-set knowledge directly into a model's parameters through a teacher-student distillation process. By internalizing few-shot reasoning into a dedicated layer within the student network, DistillFSS eliminates the need for support images at test time, enabling fast, lightweight inference, while allowing efficient extension to novel classes in unseen domains through rapid teacher-driven specialization.

Combined with fine-tuning, the approach scales efficiently to large support sets and significantly reduces computational overhead. To evaluate the framework under realistic conditions, we introduce a new CD-FSS benchmark spanning medical imaging, industrial inspection, and remote sensing, with disjoint label spaces and variable support sizes. Experiments show that DistillFSS matches or surpasses state-of-the-art baselines, particularly in multi-class and multi-shot scenarios, while offering substantial efficiency gains.

Qualitative Results

Visualizing segmentation results on the proposed CD-FSS benchmark (Medical, Industrial, Remote Sensing).

Lung Nodule (Medical)

Query

DMTNet

DistillFSS

ISIC (Skin Lesion)

Query

DMTNet

DistillFSS

KVASIR (Polyp)

Query

DMTNet

DistillFSS

Nucleus (Microscopy)

Query

DMTNet

DistillFSS

WeedMap (Remote Sensing)

Query

DMTNet

DistillFSS

Industrial (Defect)

Query

DMTNet

DistillFSS

Pothole (Road)

Query

DMTNet

DistillFSS

The DistillFSS Framework

DistillFSS shifts the focus from explicit support–query matching to embedding support-set knowledge directly into the model's parameters via knowledge distillation. The framework consists of two main components:

1. Teacher: Dense Cross-Attention

We employ a modified DCAMA backbone as the Teacher. It uses dense cross-attention to correlate query and support features. To handle domain shifts, we perform TransferFSS, fine-tuning the teacher's attention aggregation layers on the target domain's support set.

2. Student: Internalized Knowledge

The Student network shares the backbone but replaces the heavy cross-attention blocks with lightweight ConvDist modules. The student learns to mimic the teacher's attention maps solely from the query image, effectively internalizing the support information to predict attention without needing the support set at inference time.

Support Set

FSS Teacher

DistillFSS

Specialized Student

This two-stage process transforms explicit support–query interactions into an internal representation learned by the student, combining the adaptivity of few-shot learning with the efficiency of standard segmentation models.

Efficiency & Scalability

A major advantage of DistillFSS is its independence from the support set size at inference time. While traditional methods like DCAMA experience linear growth in computational cost as the number of shots ($K$) increases, DistillFSS maintains constant low latency and memory footprint.

Inference Speed

5-way Setting (Lower is better)

Method	5-shot	50-shot
DCAMA	435 ms	3130 ms
DMTNet	32,024 ms	> 60,000 ms
DistillFSS	66 ms	66 ms

Peak Memory Usage

1-way Setting (Lower is better)

Method	1-shot	10-shot
DCAMA	~1.3 GiB	13.0 GiB
HDMNet	~1.5 GiB	~12.5 GiB
DistillFSS	1.2 GiB	1.2 GiB

DistillFSS creates a deployable student model that runs ~47x faster than DCAMA in 50-shot scenarios and fits on edge devices where heavy attention-based baselines would run out of memory.

Quantitative Benchmark

We evaluated DistillFSS on a new CD-FSS benchmark spanning 7 datasets across medical, industrial, and remote sensing domains. The tables below report mean Intersection over Union (mIoU) scores, demonstrating that DistillFSS consistently achieves state-of-the-art performance, especially in multi-shot settings.

Lung
ISIC
Kvasir
Nucleus
WeedMap
Pothole
Industrial

Method	5-shot	10-shot	25-shot	50-shot
BAM	0.17	0.17	0.20	0.19
HDMNet	0.04	0.12	0.14	0.16
LabelAnything	0.05	0.04	0.04	0.04
PATNet	0.08	0.22	0.27	0.30
DMTNet	0.16	0.16	0.16	0.16
DCAMA (R50)	0.11	0.00	0.23	0.24
DCAMA (Swin)	0.18	0.27	0.21	0.18
DistillFSS (R50)	8.91	11.70	16.59	20.02
DistillFSS (Swin)	3.31	3.54	3.42	4.87

Method	9-shot	15-shot	30-shot	60-shot
BAM	9.67	8.34	8.47	8.69
HDMNet	9.16	7.82	8.20	7.77
LabelAnything	7.20	7.65	11.39	13.90
PATNet	16.01	13.46	15.95	16.97
DMTNet	15.00	15.07	15.55	17.35
DCAMA (R50)	3.36	8.80	6.97	7.59
DCAMA (Swin)	8.67	7.25	7.22	7.72
DistillFSS (R50)	11.24	12.48	12.36	12.22
DistillFSS (Swin)	13.31	15.95	19.32	23.41

Method	5-shot	10-shot	25-shot	50-shot
BAM	18.96	19.28	23.05	23.03
HDMNet	28.97	29.70	34.13	33.37
LabelAnything	27.76	27.78	27.75	27.67
PATNet	41.94	44.93	49.72	46.78
DMTNet	47.78	48.98	50.01	46.54
DCAMA (R50)	36.08	33.40	34.34	29.82
DCAMA (Swin)	29.42	28.65	28.37	28.10
DistillFSS (R50)	29.75	42.72	44.90	46.18
DistillFSS (Swin)	37.29	52.83	61.87	57.09

Method	5-shot	10-shot	25-shot	50-shot
BAM	11.03	11.32	12.07	11.05
HDMNet	19.64	19.92	21.17	21.77
LabelAnything	19.99	19.46	18.64	18.46
PATNet	37.74	32.01	33.38	33.43
DMTNet	19.24	19.76	21.50	23.53
DCAMA (R50)	19.88	23.21	19.41	19.70
DCAMA (Swin)	28.82	24.78	31.80	28.64
DistillFSS (R50)	12.76	20.71	25.22	21.12
DistillFSS (Swin)	69.57	71.16	68.19	79.96

Method	5-shot	10-shot	25-shot	50-shot
BAM	6.63	5.53	6.36	6.16
HDMNet	1.24	3.71	3.81	4.04
LabelAnything	2.28	2.21	3.69	3.74
PATNet	6.55	6.30	6.95	6.96
DMTNet	2.61	2.17	2.40	2.30
DCAMA (R50)	5.10	5.11	4.90	5.01
DCAMA (Swin)	4.70	6.13	4.38	3.78
DistillFSS (R50)	32.40	47.99	48.38	54.48
DistillFSS (Swin)	44.43	55.38	59.43	61.96

Method	5-shot	10-shot	25-shot	50-shot
BAM	1.46	1.53	3.90	2.23
HDMNet	3.49	3.46	3.35	3.51
LabelAnything	11.76	11.13	8.75	8.58
PATNet	8.85	15.35	9.91	8.80
DMTNet	10.69	8.43	7.81	8.16
DCAMA (R50)	12.42	12.14	10.92	8.62
DCAMA (Swin)	16.80	18.90	11.89	10.43
DistillFSS (R50)	8.06	10.11	10.98	14.26
DistillFSS (Swin)	17.01	17.51	14.47	31.96

Method	10-shot	20-shot	40-shot	80-shot
BAM	4.98	4.57	4.88	4.86
HDMNet	1.92	3.13	4.11	4.00
LabelAnything	1.83	2.16	1.94	2.12
PATNet	1.61	2.06	2.33	2.14
DMTNet	1.26	1.34	1.81	1.86
DCAMA (R50)	2.09	2.56	3.38	3.21
DCAMA (Swin)	0.84	1.14	1.21	1.16
DistillFSS (R50)	1.62	3.52	4.36	4.79
DistillFSS (Swin)	3.50	25.48	40.86	46.09

BibTeX

@misc{marinisDistillFSSSynthesizingFewShot2025,
    title = {{DistillFSS}: {Synthesizing} {Few}-{Shot} {Knowledge} into a {Lightweight} {Segmentation} {Model}},
    shorttitle = {{DistillFSS}},
    url = {http://arxiv.org/abs/2512.05613},
    doi = {10.48550/arXiv.2512.05613},
    publisher = {arXiv},
    author = {Marinis, Pasquale De and Blok, Pieter M. and Kaymak, Uzay and Brussee, Rogier and Vessio, Gennaro and Castellano, Giovanna},
    month = dec,
    year = {2025},
    note = {arXiv:2512.05613 [cs]},
}

DistillFSS: Synthesizing Few-Shot Knowledge into a Lightweight Segmentation Model

DistillFSS consists of a Teacher (fine-tuned on support images) and a Student (distilled to internalize support knowledge). The student network replaces heavy attention blocks with lightweight layers, enabling support-free inference.

Abstract

Qualitative Results

Lung Nodule (Medical)

ISIC (Skin Lesion)

KVASIR (Polyp)

Nucleus (Microscopy)

WeedMap (Remote Sensing)

Industrial (Defect)

Pothole (Road)

The DistillFSS Framework

1. Teacher: Dense Cross-Attention

2. Student: Internalized Knowledge

Efficiency & Scalability

Inference Speed

Peak Memory Usage

Quantitative Benchmark

BibTeX

DistillFSS consists of a Teacher (fine-tuned on support images) and a Student (distilled to internalize support knowledge).
The student network replaces heavy attention blocks with lightweight layers, enabling support-free inference.