SparseVLM

SparseVLM : Visual Token Sparsification for Efficient Vision- Language Model Inference

Yuan Zhang^1* Chun-Kai Fan^1* Junpeng Ma^2* Wenzhao Zheng^✉,3 Tao Huang⁴ Kuan Cheng¹ Denis Gudovskiy⁵ Tomoyuki Okuno⁵ Yohei Nakata⁵ Kurt Keutzer³ Shanghang Zhang^✉,1

¹School of Computer Science, Peking University ²Fudan University

³UC Berkeley ⁴The University of Sydney ⁵Panasonic Holdings Corporation

*Equal contribution, ✉Corresponding author

Abstract

In vision-language models (VLMs), visual tokens usually consume a significant amount of computational overhead, despite their sparser information density compared to text tokens. To address this, most existing methods learn a network to prune redundant visual tokens and require additional training data. Differently, we propose an efficient training-free token optimization mechanism dubbed SparseVLM without extra parameters or fine-tuning costs. Concretely, given that visual tokens complement text tokens in VLMs for linguistic reasoning, we select visual-relevant text tokens to rate the significance of vision tokens within the self-attention matrix extracted from the VLMs. Then we progressively prune irrelevant tokens.

Relevant Text Token Selection. Before LLM, we will first pre-select relevant text tokens as text raters. As we can see, from example prompts of four different benchmarks, it is not appropriate to use all text tokens as a reference for visual sparsification. Therefore, we calculate the similarity between the prompt and the image, and select candidates who exceed the mean of similarity values as the text raters.
Estimation of Visual Token Significance. In our case, we need to understand how relevant a visual token is to the textual tokens in order to determine whether it should be removed. Therefore, we naturally come up with reusing the self-attention logits in VLMs transformer layers as a reference, since they already contain language-to-vision query results.
Sparsification Level Adaptation. We further propose a rank-based strategy to adaptively determine the level of vision sparsification at each decoder layer. The difference between the dimension and rank of the self-attention logits reflects its redundancy.
Token Aggregation. We first recycle the pruned visual tokens h_v with the top-k highest values in the self-attention logits from the deleted pool. Then, we group h_v tokens with k-nearest neighbor density peak aggregation algorithm for adaptive token aggregation.
Token Reconstruction. Having performed token aggregation, the recycled tokens with similar semantics are classified into the same group.

Method	GQA	MMB	MME	POPE	SQA	VQA^V2	VQA_Text	ConB	Avg.
Upper Bound, 576 Tokens (100%)
Vanilla	61.9 100%	64.7 100%	1862 100%	85.9 100%	69.5 100%	78.5 100%	58.2 100%	19.8 100%	100%
Retain 192 Tokens (↓ 66.7%)
ToMe _(ICLR23)	54.3 87.7%	60.5 93.5%	1563 83.9%	72.4 84.3%	65.2 93.8%	68.0 86.6%	52.1 89.5%	17.4 87.9%	88.4%
FastV _(ECCV24)	52.7 85.1%	61.2 94.6%	1612 86.6%	64.8 75.4%	67.3 96.8%	67.1 85.5%	52.5 90.2%	18.0 90.9%	88.1%
SparseVLM	57.6 93.1%	62.5 96.6%	1721 92.4%	83.6 97.3%	69.1 99.4%	75.6 96.3%	56.1 96.4%	18.8 94.9%	95.8% ↑ (7.4%)
Retain 128 Tokens (↓ 77.8%)
ToMe _(ICLR23)	52.4 84.7%	53.3 82.4%	1343 72.1%	62.8 73.1%	59.6 85.8%	63.0 80.2%	49.1 84.4%	16.0 80.8%	80.4%
FastV _(ECCV24)	49.6 80.1%	56.1 86.7%	1490 80.0%	59.6 69.4%	60.2 86.6%	61.8 78.7%	50.6 86.9%	17.1 86.4%	81.9%
SparseVLM	56.0 90.5%	60.0 92.7%	1696 91.1%	80.5 93.7%	67.1 96.5%	73.8 94.0%	54.9 94.3%	18.5 93.4%	93.3% ↑ (11.4%)
Retain 64 Tokens (↓ 88.9%)
ToMe _(ICLR23)	48.6 78.5%	43.7 67.5%	1138 61.1%	52.5 61.1%	50.0 71.9%	57.1 72.7%	45.3 77.8%	14.0 70.7%	70.2%
FastV _(ECCV24)	46.1 74.5%	48.0 74.2%	1256 67.5%	48.0 55.9%	55.1 73.5%	55.0 70.1%	47.8 82.1%	15.6 78.8%	72.1%
SparseVLM	52.7 85.1%	56.2 86.9%	1505 80.8%	75.1 87.4%	62.2 89.4%	68.2 86.9%	51.8 89.0%	17.7 89.4%	86.9% ↑ (14.8%)

Method

GQA

MMB

MME

POPE

SQA

VQA^V2

VQA_Text

ConB

Avg.

Upper Bound, 576 Tokens (100%)

Vanilla

61.9
100%

64.7
100%

1862
100%

85.9
100%

69.5
100%

78.5
100%

58.2
100%

19.8
100%

100%

Retain 192 Tokens (↓ 66.7%)

ToMe _(ICLR23)

54.3
87.7%

60.5
93.5%

1563
83.9%

72.4
84.3%

65.2
93.8%

68.0
86.6%

52.1
89.5%

17.4
87.9%

88.4%

FastV _(ECCV24)

52.7
85.1%

61.2
94.6%

1612
86.6%

64.8
75.4%

67.3
96.8%

67.1
85.5%

52.5
90.2%

18.0
90.9%

88.1%

SparseVLM

57.6
93.1%

62.5
96.6%

1721
92.4%

83.6
97.3%

69.1
99.4%

75.6
96.3%

56.1
96.4%

18.8
94.9%

95.8%
↑ (7.4%)

Retain 128 Tokens (↓ 77.8%)

ToMe _(ICLR23)

52.4
84.7%

53.3
82.4%

1343
72.1%

62.8
73.1%

59.6
85.8%

63.0
80.2%

49.1
84.4%

16.0
80.8%

80.4%

FastV _(ECCV24)

49.6
80.1%

56.1
86.7%

1490
80.0%

59.6
69.4%

60.2
86.6%

61.8
78.7%

50.6
86.9%

17.1
86.4%

81.9%

SparseVLM

56.0
90.5%

60.0
92.7%

1696
91.1%

80.5
93.7%

67.1
96.5%

73.8
94.0%

54.9
94.3%

18.5
93.4%

93.3%
↑ (11.4%)

Retain 64 Tokens (↓ 88.9%)

ToMe _(ICLR23)

48.6
78.5%

43.7
67.5%

1138
61.1%

52.5
61.1%

50.0
71.9%

57.1
72.7%

45.3
77.8%

14.0
70.7%

70.2%

FastV _(ECCV24)

46.1
74.5%

48.0
74.2%

1256
67.5%

48.0
55.9%

55.1
73.5%

55.0
70.1%

47.8
82.1%

15.6
78.8%

72.1%

SparseVLM

52.7
85.1%

56.2
86.9%

1505
80.8%

75.1
87.4%

62.2
89.4%

68.2
86.9%

51.8
89.0%

17.7
89.4%

86.9%
↑ (14.8%)

Method	TGIF	MSVD	MSRVTT	ActivityNet	Avg
Video-LLAVA	47.1	3.35	69.8	3.92	56.7	3.48	43.1	3.35	100.0%	+0.00
FastV _(ECCV24)	23.1 49.0%	2.47 -0.88	38.0 54.4%	2.71 -1.21	19.3 34.0%	2.02 1.46	30.6 71.0%	2.82 -0.53	52.1%	-1.02
SparseVLM	44.7 94.9%	3.29 -0.06	68.2 97.7%	3.90 -0.02	31.0 54.7%	2.68 -0.80	42.6 98.8%	3.32 -0.03	86.5% ↑ (34.4%)	-0.17 ↑ (0.85)

Method

TGIF

MSVD

MSRVTT

ActivityNet

Avg

Acc

Score

Acc

Score

Acc

Score

Acc

Score

Acc

Score

Video-LLAVA

47.1

3.35

69.8

3.92

56.7

3.48

43.1

3.35

100.0%

+0.00

FastV _(ECCV24)

23.1
49.0%

2.47
-0.88

38.0
54.4%

2.71
-1.21

19.3
34.0%

2.02
1.46

30.6
71.0%

2.82
-0.53

52.1%

-1.02

SparseVLM

44.7
94.9%

3.29
-0.06

68.2
97.7%

3.90
-0.02

31.0
54.7%

2.68
-0.80

42.6
98.8%

3.32
-0.03

86.5%
↑ (34.4%)

-0.17
↑ (0.85)

Contact

If you have any questions, please feel free to contact us:

Yuan Zhang: zhangyuan@stu.pku.edu.cn

Chun-Kai Fan: chunkaifan@stu.pku.edu.cn

Junpeng Ma: jpma24@m.fudan.edu.cn

Wenzhao Zheng: wzzheng@berkeley.edu

Shanghang Zhang: shanghang@pku.edu.cn

BibTeX

@article{zhang2024sparsevlm, title={SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference}, author={Zhang, Yuan and Fan, Chun-Kai and Ma, Junpeng and Zheng, Wenzhao and Huang, Tao and Cheng, Kuan and Gudovskiy, Denis and Okuno, Tomoyuki and Nakata, Yohei and Keutzer, Kurt and others}, journal={arXiv preprint arXiv:2410.04417}, year={2024} }

SparseVLM : Visual Token Sparsification for Efficient Vision- Language Model Inference

Yuan Zhang^1* Chun-Kai Fan^1* Junpeng Ma^2* Wenzhao Zheng^✉,3 Tao Huang⁴ Kuan Cheng¹ Denis Gudovskiy⁵ Tomoyuki Okuno⁵ Yohei Nakata⁵ Kurt Keutzer³ Shanghang Zhang^✉,1

¹School of Computer Science, Peking University ²Fudan University

³UC Berkeley ⁴The University of Sydney ⁵Panasonic Holdings Corporation

*Equal contribution, ✉Corresponding author

Abstract

Sample prompts from four representative multimodal benchmarks

Our Pipeline

Experiment Results

Image Understanding Tasks

Video Understanding Tasks

Visualization of SparseVLM on different VQA prompts

Contact

BibTeX

SparseVLM : Visual Token Sparsification for Efficient Vision- Language Model Inference

Yuan Zhang1* Chun-Kai Fan1* Junpeng Ma2* Wenzhao Zheng✉,3 Tao Huang4 Kuan Cheng1 Denis Gudovskiy5 Tomoyuki Okuno5 Yohei Nakata5 Kurt Keutzer3 Shanghang Zhang✉,1

1School of Computer Science, Peking University 2Fudan University

3UC Berkeley 4The University of Sydney 5Panasonic Holdings Corporation

*Equal contribution, ✉Corresponding author

Abstract

Sample prompts from four representative multimodal benchmarks

Our Pipeline

Experiment Results

Image Understanding Tasks

Video Understanding Tasks

Visualization of SparseVLM on different VQA prompts

Contact

BibTeX

Yuan Zhang^1* Chun-Kai Fan^1* Junpeng Ma^2* Wenzhao Zheng^✉,3 Tao Huang⁴ Kuan Cheng¹ Denis Gudovskiy⁵ Tomoyuki Okuno⁵ Yohei Nakata⁵ Kurt Keutzer³ Shanghang Zhang^✉,1

¹School of Computer Science, Peking University ²Fudan University

³UC Berkeley ⁴The University of Sydney ⁵Panasonic Holdings Corporation