InstructSAM

About InstructSAM

Instruction-based object recognition has emerged as a powerful paradigm in computer vision. However, the lack of semantically diverse training data has limited the zero-shot performance of vision-language models in remote sensing. InstructSAM introduces a training-free framework for Instruction-Oriented Object Counting, Detection, and Segmentation (InstructCDS) tasks across open-vocabulary, open-ended, and open-subclass settings. By reformulating object detection as a counting-constrained mask-label matching problem, it enables confidence-free object recognition and achieves near-constant inference time regardless of object counts.

InstructCDS Tasks & EarthInstruct Benchmark

The EarthInstruct benchmark introduces three challenging instruction settings:

Open-Vocabulary: Recognition with user-specified categories (e.g., "soccer field", "football field", "parking lot").
Open-Ended: Recognition of all visible objects without specifying categories.
Open-Subclass: Recognition of objects within a super-category.

Beyond the basic three settings, we employ dataset-specific prompts that guide LVLMs to recognize objects according to specific annotation rules, addressing versatile user requirements and real-world dataset biases (examples shown below).

InstructSAM Framework

To tackle the challenges of limited training data and complex user instructions, InstructSAM decomposes instruction-oriented object detection into three tractable steps:

Step 1: Instruction-Oriented Object Counting
A large vision-language model (LVLM) interprets user instructions and predicts object categories and counts.

Step 2: Class-Agnostic Mask Generation
SAM2 automatically generates high-quality mask proposals in parallel with instruction processing.

Step 3: Counting-Constrained Matching
A remote sensing CLIP model computes semantic similarity between predicted categories and mask proposals. InstructSAM formulates object detection and segmentation as a mask-label matching problem, integrating semantic similarity with global counting constraints. A binary integer programming solver is used to solve the matching problem and obtain the final recognition results.

Visualization of the InstructSAM inference process.

Qualitative results across different settings.

Key Results & Performance Highlights

State-of-the-Art Performance: Matches or surpasses specialized baselines that trained on large-scale task-specific data on the EarthInstruct benchmark.
Training-Free & Confidence-Free: Requires no task-specific training or fine-tuning, and its matching process eliminates the need for confidence threshold filtering.
Efficient Inference: Achieves near-constant inference time regardless of the number of objects, significantly reducing output tokens and overall runtime compared to direct generation approaches.
Strong Generalization: Demonstrates generalization to natural images when equipped with generic CLIP models.

Open-Vocabulary Results

Open-Ended Results

Open-Subclass Results

Inference Time Analysis

InstructSAM exhibits nearly constant inference speed under open-ended setting, in contrast to other approaches whose runtime increases linearly with object count. Unlike methods that represent bounding boxes as natural language tokens, InstructSAM reduces output tokens by 89% and total inference time by 32% compared to Qwen2.5-VL. This advantage becomes more pronounced as model size scales up, highlighting the efficiency of our framework.

Generalization to Natural Images

When equipped with generic CLIP, InstructSAM can effectively recognize objects in natural images.

Analysis & Discussion

The Power of Foundation Models and Prompt Engineering

* The Faster-RCNN is trained on DIOR training set.

Providing GPT-4o with detailed annotation rules enables it to count objects as accurately as a close-set trained Faster-RCNN! This demonstrates the importance of proper prompt design in leveraging foundation model capabilities.

Confidence-Free vs. Confidence-Based Approaches

Traditional detectors rely on confidence scores and thresholds, which can be sensitive and difficult to tune, especially in zero-shot scenarios. InstructSAM's counting-constrained matching approach provides a robust alternative by dynamically adjusting assignments based on predicted counts from the LVLM.

Limitations & Future Directions

InstructSAM's performance depends on the capabilities of the underlying foundation models (LVLM, SAM2, CLIP). Future advancements in these models, particularly those trained on more semantically diverse remote sensing data, will likely enhance InstructSAM's capabilities further.

Getting Started

Ready to try InstructSAM? Check out our README for detailed installation and usage instructions.

Citation

@article{zheng2025instructsam,
    title={InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition}, 
    author={Yijie Zheng and Weijie Wu and Qingyun Li and Xuehui Wang and Xu Zhou and Aiai Ren and Jun Shen and Long Zhao and Guoqing Li and Xue Yang},
    year={2025},
    journal={arXiv preprint arXiv:2505.15818},
}