Uncertainty-aware Fine-tuning of Segmentation Foundation Models

Segmentation with Uncertainty Model (SUM) improves SAM without forgetting to ''segment anything.''
Left: Both HQ-SAM and SUM show qualitative improvements over SAM, particularly in salient-object segmentation of complex structures (top row). HQ-SAM, however, struggles with background entities (middle row) and part segmentation (bottom row), often erroneously prioritizing objects in the foreground or entire objects.
Right: SUM consistently outperforms SAM and HQ-SAM in quantitative comparisons, achieving the highest mean boundary IoU across diverse evaluation sets and interactive segmentation rounds.

Abstract

The Segment Anything Model (SAM) is a large-scale foundation model that has revolutionized segmentation methodology. Despite its impressive generalization ability, the segmentation accuracy of SAM on images with intricate structures is often unsatisfactory. Recent works have proposed lightweight fine-tuning using high-quality annotated data to improve accuracy on such images. However, here we provide extensive empirical evidence that this strategy leads to forgetting how to "segment anything": these models lose the original generalization abilities of SAM, in the sense that they perform worse for segmentation tasks not represented in the annotated fine-tuning set.

To improve performance without forgetting, we introduce a novel framework that combines high-quality annotated data with a large unlabeled dataset. The framework relies on two methodological innovations. First, we quantify the uncertainty in the SAM pseudo labels associated with the unlabeled data and leverage it to perform uncertainty-aware fine-tuning. Second, we encode the type of segmentation task associated with each training example using a task prompt to reduce ambiguity.

We evaluated the proposed Segmentation with Uncertainty Model (SUM) on a diverse test set consisting of 14 public benchmarks, where it achieves state-of-the-art results. Notably, our method consistently surpasses SAM by 3-6 points in mean IoU and 4-7 in mean boundary IoU across point-prompt interactive segmentation rounds.

Framework

Framework of SUM : Top: When processing human-annotated examples, interactive prompts are sampled based on the binary-mask labels and fed iteratively into the model along with the image. Since this binary mask depends on the type of segmentation task desired by the user, SUM incorporates a task prompt that specifies the task relevant to each annotation (1 for salient-object segmentation and 2 for entity segmentation).
Bottom: For unlabeled images, the iterative prompts are sampled based on model-generated binary pseudo-labels, which may be inaccurate. SUM includes an uncertainty-quantification module that processes the pseudo-labels, generating an uncertainty map. This map is leveraged within an uncertainty-aware loss function used for training, and also informs how the interactive prompts are sampled. For all unlabeled data, the task prompt is set to 0.

Generation of Uncertainty Map

Generation of uncertainty maps: (1) The mask-refinement module receives as input the segmentation prediction produced by SAM. (2) The module produces a refined segmentation mask. (3) The uncertainty map equals the absolute difference between the SAM and refined predictions.

Better Quality

Comparative visualization of segmentation outcomes using single-box prompts.


Comparative visualization of segmentation outcomes using point prompts, where blue points signify positive prompts and red points indicate negative prompts. We adhere to the same point prompt sampling evaluation strategy as SAM.

Dataset

Fine-tuning under different human annotation budget: FT-Small, FT-Medium, FT-Large

Experiments

Comparison of HQ-SAM with Vanilla and SUM fine-tuned using the same lightweight scheme as HQ-SAM SUM matches HQ-SAM and outperforms Vanilla in salient-object segmentation and is superior in entity and part segmentation.


Comparison with other light-weight fine-tuning methods single point-prompt segmentation mIoU for SUM versus models fine-tuned using various strategies on the HQSeg-44K dataset. All competing models improve on the salient-object segmentation task associated with this dataset but deteriorate on other segmentation tasks.


Comparison with semi-supervised methods 3 point-prompt segmentation evaluation of models fine-tuned on FT-Small dataset with various strategies. SUM clearly outperforms all other strategies.


Comparison of SAM with SUM fine-tuned on Under Different Human Annotation Budget 5 point-prompt segmentation evaluation. SUM consistently outperforms SAM, showing even greater improvement as the budget of human-annotated data increases.


Additional evaluation To test the generalization ability of SUM to a broader range of segmentation tasks, we provided 8 additional datasets. The mIoU comparison results, reported in the following tables, confirm that SUM consistently outperforms SAM. For reproducibility, SUM is fine-tuned on the Public dataset FT-Medium.


Ablation study. This table reports interactive segmentation mean IoU of different ablated versions of SUM fine-tuned on FT-Medium, showing individual gains provided by uncertainty-aware fine-tuning and task prompts.

Acknowledgements

The authors acknowledge Markus Woodson for valuable discussions and feedback.

The website template was adapted from SegGen.