Description of Image

VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis

Chao Pang1*, Xingxing Weng1* , Jiang Wu2*, Jiayu Li1 , Yi Liu1 , Jiaxing Sun1 , Weijia Li3 , Shuai Wang4 , Litong Feng4 , Gui-Song Xia1 , Conghui He2,4
1. Wuhan University 2. Shanghai AI Lab 3. Sun Yat-Sen University 4. Sensetime Research
arXiv 2024

*Indicates Equal Contribution,
Indicates Corresponding authors

Abstract

This paper develops a Versatile and Honest vision language Model (VHM) for remote sensing image analysis. VHM is built on a large-scale remote sensing image-text dataset with rich-content captions (VersaD), and an honest instruction dataset comprising both factual and deceptive questions (HnstD). Unlike prevailing remote sensing image-text datasets, in which image captions focus on a few prominent objects and their relationships, VersaD captions provide detailed information about image properties, object attributes, and the overall scene. This comprehensive captioning enables VHM to thoroughly understand remote sensing images and perform diverse remote sensing tasks. Moreover, different from existing remote sensing instruction datasets that only include factual questions, HnstD contains additional deceptive questions stemming from the non-existence of objects. This feature prevents VHM from producing affirmative answers to nonsense queries, thereby ensuring its honesty. In our experiments, VHM significantly outperforms various vision language models on common tasks of scene classification, visual question answering, and visual grounding. Additionally, VHM achieves competent performance on several unexplored tasks, such as building vectorizing, multi-label classification and honest question answering.

VersaD for Pretrain

We utilized the gemini-1.0-pro-vision API to generate descriptions for images from multiple public RS datasets, thereby obtaining a dataset of image-text pairs to serve as the pre-training data for RSVLMs.



Examples of large-scale RS vision-language datasets. RS5M consists of two subsets: RS5M-RS3 and RS5M- PUB11. (a) VersaD (ours) captions provide detailed descriptions, including moving objects, rapidly changing information (i.e., the seasonal feature in vegetation) and rich attributes of objects. (b) RS5M-RS3's captions are short and lack detail. (c) RS5M-PUB11's images are not typical RS images. The captions are short, repetitive texts, often containing redundance (such as "two houses" in the figure). (d) SkyScript captions are from OpenStreetMap, therefore lacking moving objects and seasonal sensitive information; the descriptions are relatively short. (a) and (b) share the same image. We selected images of similar scene for (c) and (d).


Examples of VersaD. Black sentences represent descriptions that are entirely accurate, green sentences represent descriptions that are partially accurate, and red sentences represent descriptions that are entirely incorrect.

Datasets for Supervised Fine Tuning

The training instrutions during the Supervised Fine Tuning (SFT) stage comprises four parts, each with a specified quantity: the VersaD-Instruct dataset (30k), the HnstD dataset (44k), the RS-Specialized-Instruct dataset (29.8k), and the RS-ClsQaGrd-Instruct dataset (78k), summing up to a total of 180k

VHM Implement

We adopted the LLaVA model and continued to train it to obtain the VHM. It includes three main components: (1) A pretrained vision encoder using the CLIP-Large model, with a resolution of 336 × 336 and a patch size of 14, capable of converting input images into 576 tokens. (2) An LLM based on the open-source Vicuna-v1.5, originating from LLaMA2. We use the 7B-version throughout this paper. (3) It incorporates a projector, which is a multilayer perceptron composed of two layers, used to connect the vision encoder and the LLM.



Our training is divided into two stages: the first stage is pretraining, and the second stage is supervised fine-tuning.

RS Specialized Function of VHM

BibTeX

@misc{pang2024vhmversatilehonestvision,
        title={VHM: Versatile and Honest Vision Language Model for Remote Sensing Image Analysis}, 
        author={Chao Pang and Xingxing Weng and Jiang Wu and Jiayu Li and Yi Liu and Jiaxing Sun and Weijia Li and Shuai Wang and Litong Feng and Gui-Song Xia and Conghui He},
        year={2024},
        eprint={2403.20213},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2403.20213}, 
  }