Label Anything

Abstract

Few-shot semantic segmentation aims to segment objects from previously unseen classes using only a limited number of labeled examples. In this paper, we introduce Label Anything, a novel transformer-based architecture designed for multi-prompt, multi-way few-shot semantic segmentation. Our approach leverages diverse visual prompts—points, bounding boxes, and masks—to create a highly flexible and generalizable framework that significantly reduces annotation burden while maintaining high accuracy. Label Anything makes three key contributions: (i) we introduce a new task formulation that relaxes conventional few-shot segmentation constraints by supporting various types of prompts, multi-class classification, and enabling multiple prompts within a single image; (ii) we propose a novel architecture based on transformers and attention mechanisms, eliminating dependency on convolutional networks; and (iii) we design a versatile training procedure allowing our model to operate seamlessly across different N-way K-shot and prompt-type configurations with a single trained model. Our extensive experimental evaluation on the widely used COCO-20i benchmark demonstrates that Label Anything achieves state-of-the-art performance among existing multi-way few-shot segmentation methods, while significantly outperforming leading single-class models when evaluated in multi-class settings. Code and trained models are available at our GitHub.

How it works

Prototype Learning through Prompting

Label Anything learns to segment objects by learning a prototype for each class specified by a visual prompt. It first fuses support image features with dense and sparse prompt embeddings using a two-way attention mechanism, producing enriched class-specific representations. These are then pooled into class-example embeddings and refined via a self-attention mixer, which aggregates them into a single class prototype capturing the shared semantics across support examples, to guide segmentation in the query image.

Masks Decoding

Label Anything decodes segmentation masks by matching learned class prototypes to the query image features. A two-way attention mechanism enables mutual interaction between the prototypes and query features, allowing class-specific patterns to be transferred to the query representation. The query features are then upsampled, spatially refined, and projected to match the prototype dimensions, after which segmentation masks are generated via a dot product between the transformed query features and the class prototypes.

BibTeX

@incollection{demarinisLabelAnythingMultiClass2025,
  title = {Label {Anything}: {Multi}-{Class} {Few}-{Shot} {Semantic} {Segmentation} with {Visual} {Prompts}},
  shorttitle = {Label {Anything}},
  url = {https://ebooks.iospress.nl/doi/10.3233/FAIA251289},
  language = {en},
  booktitle = {{ECAI} 2025},
  publisher = {IOS Press},
  author = {De Marinis, Pasquale and Fanelli, Nicola and Scaringi, Raffaele and Colonna, Emanuele and Fiameni, Giuseppe and Vessio, Gennaro and Castellano, Giovanna},
  year = {2025},
  doi = {10.3233/FAIA251289},
  pages = {4016--4023},
}

Label Anything: Multi-Class Few-Shot Semantic Segmentation with Visual Prompts

European Conference on Artificial Intelligence (ECAI) 2025

Label Anything performs few-shot semantic segmentation over query images, segmenting objects of interest specified by points, boxes, or masks over an arbitrarily large set of support images.

Abstract

Gallery

How it works

Prototype Learning through Prompting

Masks Decoding

BibTeX