SparseVLM : Visual Token Sparsification for Efficient Vision- Language Model Inference

Yuan Zhang1*  Chun-Kai Fan1*  Junpeng Ma2*  Wenzhao Zheng✉,3  Tao Huang4  Kuan Cheng1 Denis Gudovskiy5  Tomoyuki Okuno5  Yohei Nakata5  Kurt Keutzer3  Shanghang Zhang✉,1 

1School of Computer Science, Peking University  2Fudan University 

3UC Berkeley  4The University of Sydney  5Panasonic Holdings Corporation 

*Equal contribution, ✉Corresponding author



Abstract

In vision-language models (VLMs), visual tokens usually consume a significant amount of computational overhead, despite their sparser information density compared to text tokens. To address this, most existing methods learn a network to prune redundant visual tokens and require additional training data. Differently, we propose an efficient training-free token optimization mechanism dubbed SparseVLM without extra parameters or fine-tuning costs. Concretely, given that visual tokens complement text tokens in VLMs for linguistic reasoning, we select visual-relevant text tokens to rate the significance of vision tokens within the self-attention matrix extracted from the VLMs. Then we progressively prune irrelevant tokens.


Teaser Image

To maximize sparsity while retaining essential information, we introduce a rank-based strategy to adaptively determine the sparsification ratio for each layer, alongside a token recycling method that compresses pruned tokens into more compact representations. Experimental results show that our SparseVLM improves the efficiency of various VLMs across a range of image and video understanding tasks. In particular, LLaVA equipped with SparseVLM reduces 61% ~ 67% FLOPs with a compression ratio of 78% while maintaining 93% of the accuracy.

Sample prompts from four representative multimodal benchmarks

We show four representative cases where we compute the correlation between the prompt and the image. The darker the word, the greater its relationship to the image and the more valuable it is for reference. We see that some words are irrelevant to the vision domain (e.g., prepositions and pronouns) and should not be considered for visual sparsification. For example, case 3 highlights Tylenol, Advil, ibuprofen, while top, sticker, fridge in case 4 are significant, where a large proportion of question tokens in light red include little visual relevance.


Sample Image

Our Pipeline

  • Relevant Text Token Selection. Before LLM, we will first pre-select relevant text tokens as text raters. As we can see, from example prompts of four different benchmarks, it is not appropriate to use all text tokens as a reference for visual sparsification. Therefore, we calculate the similarity between the prompt and the image, and select candidates who exceed the mean of similarity values as the text raters.
  • Estimation of Visual Token Significance. In our case, we need to understand how relevant a visual token is to the textual tokens in order to determine whether it should be removed. Therefore, we naturally come up with reusing the self-attention logits in VLMs transformer layers as a reference, since they already contain language-to-vision query results.
  • Sparsification Level Adaptation. We further propose a rank-based strategy to adaptively determine the level of vision sparsification at each decoder layer. The difference between the dimension and rank of the self-attention logits reflects its redundancy.
  • Token Aggregation. We first recycle the pruned visual tokens hv with the top-k highest values in the self-attention logits from the deleted pool. Then, we group hv tokens with k-nearest neighbor density peak aggregation algorithm for adaptive token aggregation.
  • Token Reconstruction. Having performed token aggregation, the recycled tokens with similar semantics are classified into the same group.

Experiment Results

Image Understanding Tasks

Method GQA MMB MME POPE SQA VQAV2 VQAText ConB Avg.
Upper Bound, 576 Tokens (100%)
Vanilla 61.9
100%
64.7
100%
1862
100%
85.9
100%
69.5
100%
78.5
100%
58.2
100%
19.8
100%
100%
Retain 192 Tokens (↓ 66.7%)
ToMe (ICLR23) 54.3
87.7%
60.5
93.5%
1563
83.9%
72.4
84.3%
65.2
93.8%
68.0
86.6%
52.1
89.5%
17.4
87.9%
88.4%
FastV (ECCV24) 52.7
85.1%
61.2
94.6%
1612
86.6%
64.8
75.4%
67.3
96.8%
67.1
85.5%
52.5
90.2%
18.0
90.9%
88.1%
SparseVLM 57.6
93.1%
62.5
96.6%
1721
92.4%
83.6
97.3%
69.1
99.4%
75.6
96.3%
56.1
96.4%
18.8
94.9%
95.8%
↑ (7.4%)
Retain 128 Tokens (↓ 77.8%)
ToMe (ICLR23) 52.4
84.7%
53.3
82.4%
1343
72.1%
62.8
73.1%
59.6
85.8%
63.0
80.2%
49.1
84.4%
16.0
80.8%
80.4%
FastV (ECCV24) 49.6
80.1%
56.1
86.7%
1490
80.0%
59.6
69.4%
60.2
86.6%
61.8
78.7%
50.6
86.9%
17.1
86.4%
81.9%
SparseVLM 56.0
90.5%
60.0
92.7%
1696
91.1%
80.5
93.7%
67.1
96.5%
73.8
94.0%
54.9
94.3%
18.5
93.4%
93.3%
↑ (11.4%)
Retain 64 Tokens (↓ 88.9%)
ToMe (ICLR23) 48.6
78.5%
43.7
67.5%
1138
61.1%
52.5
61.1%
50.0
71.9%
57.1
72.7%
45.3
77.8%
14.0
70.7%
70.2%
FastV (ECCV24) 46.1
74.5%
48.0
74.2%
1256
67.5%
48.0
55.9%
55.1
73.5%
55.0
70.1%
47.8
82.1%
15.6
78.8%
72.1%
SparseVLM 52.7
85.1%
56.2
86.9%
1505
80.8%
75.1
87.4%
62.2
89.4%
68.2
86.9%
51.8
89.0%
17.7
89.4%
86.9%
↑ (14.8%)

Table 1: Performance of SparseLLaVA under different vision token configurations. The vanilla number of vision tokens is 576. The first line of each method is the raw accuracy of benchmarks, and the second line is the proportion relative to the upper limit. The last column is the average value.

Sample Image

Figure 1: Performance of MGM armed with SparseVLM on three multimodal benchmarks. The horizontal axis represents the remaining number of vision tokens, while the vertical axis means the accuracy after percentage normalization. FastV is included for comparison.

Video Understanding Tasks

Method TGIF MSVD MSRVTT ActivityNet Avg
Acc Score Acc Score Acc Score Acc Score Acc Score
Video-LLAVA 47.1 3.35 69.8 3.92 56.7 3.48 43.1 3.35 100.0% +0.00
FastV (ECCV24) 23.1
49.0%
2.47
-0.88
38.0
54.4%
2.71
-1.21
19.3
34.0%
2.02
1.46
30.6
71.0%
2.82
-0.53
52.1% -1.02
SparseVLM 44.7
94.9%
3.29
-0.06
68.2
97.7%
3.90
-0.02
31.0
54.7%
2.68
-0.80
42.6
98.8%
3.32
-0.03
86.5%
↑ (34.4%)
-0.17
↑ (0.85)

Table 2: The results of Video-LLaVA with SparseVLM on video question answering task. The original number of video tokens is 2048, while our experiment collectively prunes it down to 135 tokens. FastV is included for comparison. The GPT-3.5 turbo is adopted for assistive evaluation.

Visualization of SparseVLM on different VQA prompts

Contact

If you have any questions, please feel free to contact us:

BibTeX

        
          @article{zhang2024sparsevlm,
            title={SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference},
            author={Zhang, Yuan and Fan, Chun-Kai and Ma, Junpeng and Zheng, Wenzhao and Huang, Tao and Cheng, Kuan and Gudovskiy, Denis and Okuno, Tomoyuki and Nakata, Yohei and Keutzer, Kurt and others},
            journal={arXiv preprint arXiv:2410.04417},
            year={2024}
          }