CapRecover: A Cross-Modality Feature Inversion Attack Framework on Vision Language Models
                    Kedong Xiu & Sai Qian Zhang
                    New York University
                
                    Kedong Xiu & Sai Qian Zhang
                    New York University
                
In modern "split deployment" AI applications, the vision encoder runs on-device, sending only intermediate features to the cloud. While efficient, this paradigm introduces severe privacy risks. Previous attacks focused on image reconstruction with limited success. Our work poses a critical question: can we bypass image reconstruction entirely and directly recover high-level semantic information, such as captions or class labels, from these intermediate features?
We propose CAPRECOVER, the first general cross-modality feature inversion framework that directly recovers semantics without image reconstruction.
We validate the attack's effectiveness across multiple datasets and models, revealing a strong correlation between semantic leakage and network depth.
We propose a simple, efficient, and training-free noise-based defense mechanism that can be deployed on edge devices to protect user privacy.
 
                Figure 1: The CAPRECOVER attack scenario.
To achieve direct semantic recovery, we designed the CAPRECOVER framework, which consists of three core modules:
 
                        Figure 2: The CAPRECOVER framework comprises three core modules.
Our first set of experiments on caption reconstruction revealed that semantic leakage is strongly correlated with network depth. As shown, heatmaps from shallow layers (Layer 1) focus on low-level edges, while deep layers (Layer 4) focus on high-level semantic regions like the "person" and "snowy field".
The table below quantifies this trend: as the layer depth increases, all metrics improve significantly, proving that deeper features leak more recoverable semantic information.
 
                     Figure 3: Heatmaps from ResNet50 layers and their generated captions.
| Middle Layer | BLEU-1 | CIDEr | Cosine Similarity | 
|---|---|---|---|
| layer1 | 0.24 | 0.19 | 0.00% | 
| layer2 | 0.51 | 0.31 | 43.76% | 
| layer3 | 0.58 | 0.55 | 31.42% | 
| layer4 | 0.62 | 0.68 | 85.64% | 
| base (final layer) | 0.70 | 0.90 | 90.52% | 
In our second set of experiments, we applied CAPRECOVER to label recovery tasks. The results were even more striking. When attacking a CLIP ViT model, we achieved a Top-1 accuracy of 92.71% on CIFAR-10.
As the confusion matrix shows, predictions align almost perfectly with the true labels, demonstrating that even classification-oriented features contain enough information to compromise privacy with near-perfect accuracy.
 
                     Figure 4: Confusion matrix of label recovery results on the CIFAR-10 test set.
We propose a simple and highly effective defense: inject random noise into the features on the client-side, and subtract it before the next computation.
This method has zero communication overhead, requires no model retraining, and effectively thwarts the attack.
The table below presents the results of our attack on ResNet50 with and without the noise-based defense. The BLEU-1 scores, which measure caption quality, drop dramatically for deeper layers (layer2 to layer4) when noise is applied, demonstrating the defense's effectiveness.
| Defense Status | layer1 | layer2 | layer3 | layer4 | 
|---|---|---|---|---|
| Without Noise | 0.24 | 0.51 | 0.58 | 0.62 | 
| With Noise | 0.49 | 0.03 | 0.02 | 0.05 | 
@article{image2caption_attack2025acmmm,
    title={CapRecover: A Cross-Modality Feature Inversion Attack Framework on Vision Language Models},
    author={Kedong Xiu and Sai Qian Zhang},
    journal={arXiv preprint arXiv:2507.22828},
    year={2025}
}
@inproceedings{xiu2025caprecover,
    title     = {CAPRECOVER: A Cross-Modality Feature Inversion Attack Framework on Vision Language Models},
    author    = {Kedong Xiu and Sai Qian Zhang},
    booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia (MM)},
    year      = {2025},
}