|
1School of Computer Science, Peking University 2Fudan University |
3UC Berkeley 4The University of Sydney 5Panasonic Holdings Corporation |
*Equal contribution, ✉Corresponding author |
In vision-language models (VLMs), visual tokens usually consume a significant amount of computational overhead, despite their sparser information density compared to text tokens. To address this, most existing methods learn a network to prune redundant visual tokens and require additional training data. Differently, we propose an efficient training-free token optimization mechanism dubbed SparseVLM without extra parameters or fine-tuning costs. Concretely, given that visual tokens complement text tokens in VLMs for linguistic reasoning, we select visual-relevant text tokens to rate the significance of vision tokens within the self-attention matrix extracted from the VLMs. Then we progressively prune irrelevant tokens.
To maximize sparsity while retaining essential information, we introduce a rank-based strategy to adaptively determine the sparsification ratio for each layer, alongside a token recycling method that compresses pruned tokens into more compact representations. Experimental results show that our SparseVLM improves the efficiency of various VLMs across a range of image and video understanding tasks. In particular, LLaVA equipped with SparseVLM reduces 61% ~ 67% FLOPs with a compression ratio of 78% while maintaining 93% of the accuracy.
We show four representative cases where we compute the correlation between the prompt and the image. The darker the word, the greater its relationship to the image and the more valuable it is for reference. We see that some words are irrelevant to the vision domain (e.g., prepositions and pronouns) and should not be considered for visual sparsification. For example, case 3 highlights Tylenol, Advil, ibuprofen, while top, sticker, fridge in case 4 are significant, where a large proportion of question tokens in light red include little visual relevance.
Method | GQA | MMB | MME | POPE | SQA | VQAV2 | VQAText | ConB | Avg. |
---|---|---|---|---|---|---|---|---|---|
Upper Bound, 576 Tokens (100%) | |||||||||
Vanilla | 61.9 100% |
64.7 100% |
1862 100% |
85.9 100% |
69.5 100% |
78.5 100% |
58.2 100% |
19.8 100% |
100% |
Retain 192 Tokens (↓ 66.7%) | |||||||||
ToMe (ICLR23) | 54.3 87.7% |
60.5 93.5% |
1563 83.9% |
72.4 84.3% |
65.2 93.8% |
68.0 86.6% |
52.1 89.5% |
17.4 87.9% |
88.4% |
FastV (ECCV24) | 52.7 85.1% |
61.2 94.6% |
1612 86.6% |
64.8 75.4% |
67.3 96.8% |
67.1 85.5% |
52.5 90.2% |
18.0 90.9% |
88.1% |
SparseVLM | 57.6 93.1% |
62.5 96.6% |
1721 92.4% |
83.6 97.3% |
69.1 99.4% |
75.6 96.3% |
56.1 96.4% |
18.8 94.9% |
95.8% ↑ (7.4%) |
Retain 128 Tokens (↓ 77.8%) | |||||||||
ToMe (ICLR23) | 52.4 84.7% |
53.3 82.4% |
1343 72.1% |
62.8 73.1% |
59.6 85.8% |
63.0 80.2% |
49.1 84.4% |
16.0 80.8% |
80.4% |
FastV (ECCV24) | 49.6 80.1% |
56.1 86.7% |
1490 80.0% |
59.6 69.4% |
60.2 86.6% |
61.8 78.7% |
50.6 86.9% |
17.1 86.4% |
81.9% |
SparseVLM | 56.0 90.5% |
60.0 92.7% |
1696 91.1% |
80.5 93.7% |
67.1 96.5% |
73.8 94.0% |
54.9 94.3% |
18.5 93.4% |
93.3% ↑ (11.4%) |
Retain 64 Tokens (↓ 88.9%) | |||||||||
ToMe (ICLR23) | 48.6 78.5% |
43.7 67.5% |
1138 61.1% |
52.5 61.1% |
50.0 71.9% |
57.1 72.7% |
45.3 77.8% |
14.0 70.7% |
70.2% |
FastV (ECCV24) | 46.1 74.5% |
48.0 74.2% |
1256 67.5% |
48.0 55.9% |
55.1 73.5% |
55.0 70.1% |
47.8 82.1% |
15.6 78.8% |
72.1% |
SparseVLM | 52.7 85.1% |
56.2 86.9% |
1505 80.8% |
75.1 87.4% |
62.2 89.4% |
68.2 86.9% |
51.8 89.0% |
17.7 89.4% |
86.9% ↑ (14.8%) |
Table 1: Performance of SparseLLaVA under different vision token configurations. The vanilla number of vision tokens is 576. The first line of each method is the raw accuracy of benchmarks, and the second line is the proportion relative to the upper limit. The last column is the average value.
Figure 1: Performance of MGM armed with SparseVLM on three multimodal benchmarks. The horizontal axis represents the remaining number of vision tokens, while the vertical axis means the accuracy after percentage normalization. FastV is included for comparison.
Method | TGIF | MSVD | MSRVTT | ActivityNet | Avg | |||||
---|---|---|---|---|---|---|---|---|---|---|
Acc | Score | Acc | Score | Acc | Score | Acc | Score | Acc | Score | |
Video-LLAVA | 47.1 | 3.35 | 69.8 | 3.92 | 56.7 | 3.48 | 43.1 | 3.35 | 100.0% | +0.00 |
FastV (ECCV24) | 23.1 49.0% |
2.47 -0.88 |
38.0 54.4% |
2.71 -1.21 |
19.3 34.0% |
2.02 1.46 |
30.6 71.0% |
2.82 -0.53 |
52.1% | -1.02 |
SparseVLM | 44.7 94.9% |
3.29 -0.06 |
68.2 97.7% |
3.90 -0.02 |
31.0 54.7% |
2.68 -0.80 |
42.6 98.8% |
3.32 -0.03 |
86.5% ↑ (34.4%) |
-0.17 ↑ (0.85) |
Table 2: The results of Video-LLaVA with SparseVLM on video question answering task. The original number of video tokens is 2048, while our experiment collectively prunes it down to 135 tokens. FastV is included for comparison. The GPT-3.5 turbo is adopted for assistive evaluation.
If you have any questions, please feel free to contact us:
@article{zhang2024sparsevlm,
title={SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference},
author={Zhang, Yuan and Fan, Chun-Kai and Ma, Junpeng and Zheng, Wenzhao and Huang, Tao and Cheng, Kuan and Gudovskiy, Denis and Okuno, Tomoyuki and Nakata, Yohei and Keutzer, Kurt and others},
journal={arXiv preprint arXiv:2410.04417},
year={2024}
}