Few-shot semantic segmentation aims to segment objects from previously unseen classes using only a limited number of labeled examples. In this paper, we introduce Label Anything, a novel transformer-based architecture designed for multi-prompt, multi-way few-shot semantic segmentation. Our approach leverages diverse visual prompts—points, bounding boxes, and masks—to create a highly flexible and generalizable framework that significantly reduces annotation burden while maintaining high accuracy. Label Anything makes three key contributions: (i) we introduce a new task formulation that relaxes conventional few-shot segmentation constraints by supporting various types of prompts, multi-class classification, and enabling multiple prompts within a single image; (ii) we propose a novel architecture based on transformers and attention mechanisms, eliminating dependency on convolutional networks; and (iii) we design a versatile training procedure allowing our model to operate seamlessly across different N-way K-shot and prompt-type configurations with a single trained model. Our extensive experimental evaluation on the widely used COCO-20i benchmark demonstrates that Label Anything achieves state-of-the-art performance among existing multi-way few-shot segmentation methods, while significantly outperforming leading single-class models when evaluated in multi-class settings. Code and trained models are available at our GitHub.
Label Anything learns to segment objects by learning a prototype for each class specified by a visual prompt. It first fuses support image features with dense and sparse prompt embeddings using a two-way attention mechanism, producing enriched class-specific representations. These are then pooled into class-example embeddings and refined via a self-attention mixer, which aggregates them into a single class prototype capturing the shared semantics across support examples, to guide segmentation in the query image.
Label Anything decodes segmentation masks by matching learned class prototypes to the query image features. A two-way attention mechanism enables mutual interaction between the prototypes and query features, allowing class-specific patterns to be transferred to the query representation. The query features are then upsampled, spatially refined, and projected to match the prototype dimensions, after which segmentation masks are generated via a dot product between the transformed query features and the class prototypes.
@incollection{LabelAnything,
title={RoWeeder: Unsupervised Weed Mapping through Crop-Row Detection},
author={Pasquale De Marinis, Nicola Fanelli, Raffaele Scaringi, Emanuele Colonna, Giuseppe Fiameni, Gennaro Vessio and Giovanna Castellano},
booktitle={ECAI 2025},
year={2025},
note={in press}
}