ACM Multimedia 2025

CapRecover: A Cross-Modality Feature Inversion Attack Framework on Vision Language Models

Kedong Xiu & Sai Qian Zhang
New York University

Overview

In modern "split deployment" AI applications, the vision encoder runs on-device, sending only intermediate features to the cloud. While efficient, this paradigm introduces severe privacy risks. Previous attacks focused on image reconstruction with limited success. Our work poses a critical question: can we bypass image reconstruction entirely and directly recover high-level semantic information, such as captions or class labels, from these intermediate features?

Core Contributions

A Novel Attack Framework

We propose CAPRECOVER, the first general cross-modality feature inversion framework that directly recovers semantics without image reconstruction.

Comprehensive Evaluation

We validate the attack's effectiveness across multiple datasets and models, revealing a strong correlation between semantic leakage and network depth.

An Efficient Defense

We propose a simple, efficient, and training-free noise-based defense mechanism that can be deployed on edge devices to protect user privacy.

Cross-modality feature inversion attack scenario

Figure 1: The CAPRECOVER attack scenario.

Our Method

The CAPRECOVER Framework

To achieve direct semantic recovery, we designed the CAPRECOVER framework, which consists of three core modules:

  • Feature Projection Module: Maps intermediate features of various forms into a unified feature space.
  • Feature-Text Alignment Module: Uses a Q-Former to establish a semantic correspondence between visual features and text descriptions.
  • Description Generation Module: Employs a frozen Large Language Model (LLM) to generate the final text output, greatly improving training efficiency.
The CAPRECOVER framework diagram

Figure 2: The CAPRECOVER framework comprises three core modules.

Key Results

Caption Reconstruction: Leakage vs. Depth

Our first set of experiments on caption reconstruction revealed that semantic leakage is strongly correlated with network depth. As shown, heatmaps from shallow layers (Layer 1) focus on low-level edges, while deep layers (Layer 4) focus on high-level semantic regions like the "person" and "snowy field".

The table below quantifies this trend: as the layer depth increases, all metrics improve significantly, proving that deeper features leak more recoverable semantic information.

Heatmaps and generated captions from different network layers

Figure 3: Heatmaps from ResNet50 layers and their generated captions.

Table 1: Semantic recovery performance from different ResNet50 layers (COCO2017)

Middle Layer BLEU-1 CIDEr Cosine Similarity
layer1 0.24 0.19 0.00%
layer2 0.51 0.31 43.76%
layer3 0.58 0.55 31.42%
layer4 0.62 0.68 85.64%
base (final layer) 0.70 0.90 90.52%

Label Reconstruction: High-Fidelity Recovery

In our second set of experiments, we applied CAPRECOVER to label recovery tasks. The results were even more striking. When attacking a CLIP ViT model, we achieved a Top-1 accuracy of 92.71% on CIFAR-10.

As the confusion matrix shows, predictions align almost perfectly with the true labels, demonstrating that even classification-oriented features contain enough information to compromise privacy with near-perfect accuracy.

Confusion matrix for label recovery on CIFAR-10

Figure 4: Confusion matrix of label recovery results on the CIFAR-10 test set.

How to Defend?

We propose a simple and highly effective defense: inject random noise into the features on the client-side, and subtract it before the next computation.

$F^{(i+1)} = g((F^{(i)} + \epsilon^{(i)}) - \epsilon^{(i)}) = g(F^{(i)})$

This method has zero communication overhead, requires no model retraining, and effectively thwarts the attack.

Defense Effectiveness

The table below presents the results of our attack on ResNet50 with and without the noise-based defense. The BLEU-1 scores, which measure caption quality, drop dramatically for deeper layers (layer2 to layer4) when noise is applied, demonstrating the defense's effectiveness.

Table 2: Attack performance (BLEU-1) on ResNet50 w/ and w/o noise (COCO2017)

Defense Status layer1 layer2 layer3 layer4
Without Noise 0.24 0.51 0.58 0.62
With Noise 0.49 0.03 0.02 0.05

Citation


@article{image2caption_attack2025acmmm,
    title={CapRecover: A Cross-Modality Feature Inversion Attack Framework on Vision Language Models},
    author={Kedong Xiu and Sai Qian Zhang},
    journal={arXiv preprint arXiv:2507.22828},
    year={2025}
}

@inproceedings{xiu2025caprecover,
    title     = {CAPRECOVER: A Cross-Modality Feature Inversion Attack Framework on Vision Language Models},
    author    = {Kedong Xiu and Sai Qian Zhang},
    booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia (MM)},
    year      = {2025},
}