InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition

Yijie Zheng1,2Weijie Wu1,2Qingyun Li3Xuehui Wang4Xu Zhou5
Aiai Ren5Jun Shen5Long Zhao2Guoqing Li✉️2Xue Yang4
1University of Chinese Academy of Sciences  2Aerospace Information Research Institute 
3Harbin Institute of Technology  4Shanghai Jiao Tong University  5University of Wollongong 

About InstructSAM

Instruction-based object recognition has emerged as a powerful paradigm in computer vision. However, the lack of semantically diverse training data has limited the zero-shot performance of vision-language models in remote sensing. InstructSAM introduces a training-free framework for Instruction-Oriented Object Counting, Detection, and Segmentation (InstructCDS) tasks across open-vocabulary, open-ended, and open-subclass settings. By reformulating object detection as a counting-constrained mask-label matching problem, it enables confidence-free object recognition and achieves near-constant inference time regardless of object counts.

InstructCDS Tasks & EarthInstruct Benchmark

The EarthInstruct benchmark introduces three challenging instruction settings:

Task Settings

Beyond the basic three settings, we employ dataset-specific prompts that guide LVLMs to recognize objects according to specific annotation rules, addressing versatile user requirements and real-world dataset biases (examples shown below).

Dataset bias

InstructSAM Framework

InstructSAM Framework

To tackle the challenges of limited training data and complex user instructions, InstructSAM decomposes instruction-oriented object detection into three tractable steps:

Step 1: Instruction-Oriented Object Counting
A large vision-language model (LVLM) interprets user instructions and predicts object categories and counts.

Step 2: Class-Agnostic Mask Generation
SAM2 automatically generates high-quality mask proposals in parallel with instruction processing.

Step 3: Counting-Constrained Matching
A remote sensing CLIP model computes semantic similarity between predicted categories and mask proposals. InstructSAM formulates object detection and segmentation as a mask-label matching problem, integrating semantic similarity with global counting constraints. A binary integer programming solver is used to solve the matching problem and obtain the final recognition results.

InstructSAM Inference Process
Visualization of the InstructSAM inference process.

Results Visualization
Qualitative results across different settings.

Key Results & Performance Highlights

Open-Vocabulary Results

Open-Vocabulary Results

Open-Ended Results

Open-Ended Results

Open-Subclass Results

Open-Subclass Results

Inference Time Analysis

Inference Time Comparison
InstructSAM exhibits nearly constant inference speed under open-ended setting, in contrast to other approaches whose runtime increases linearly with object count. Unlike methods that represent bounding boxes as natural language tokens, InstructSAM reduces output tokens by 89% and total inference time by 32% compared to Qwen2.5-VL. This advantage becomes more pronounced as model size scales up, highlighting the efficiency of our framework.

Generalization to Natural Images

Natural Image Results
When equipped with generic CLIP, InstructSAM can effectively recognize objects in natural images.

Analysis & Discussion

The Power of Foundation Models and Prompt Engineering

Counting Performance Table * The Faster-RCNN is trained on DIOR training set.
Providing GPT-4o with detailed annotation rules enables it to count objects as accurately as a close-set trained Faster-RCNN! This demonstrates the importance of proper prompt design in leveraging foundation model capabilities.

Confidence-Free vs. Confidence-Based Approaches

Threshold Sensitivity Analysis

Traditional detectors rely on confidence scores and thresholds, which can be sensitive and difficult to tune, especially in zero-shot scenarios. InstructSAM's counting-constrained matching approach provides a robust alternative by dynamically adjusting assignments based on predicted counts from the LVLM.

Limitations & Future Directions

InstructSAM's performance depends on the capabilities of the underlying foundation models (LVLM, SAM2, CLIP). Future advancements in these models, particularly those trained on more semantically diverse remote sensing data, will likely enhance InstructSAM's capabilities further.

Getting Started

Ready to try InstructSAM? Check out our README for detailed installation and usage instructions.

Citation

@article{zheng2025instructsam,
    title={InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition}, 
    author={Yijie Zheng and Weijie Wu and Qingyun Li and Xuehui Wang and Xu Zhou and Aiai Ren and Jun Shen and Long Zhao and Guoqing Li and Xue Yang},
    year={2025},
    journal={arXiv preprint arXiv:2505.15818},
}