Enhancing AI Safety: Improving Adversarial Robustness in Vision Language Models

The Research Question

How can we improve the adversarial robustness of Vision Language Models (VLMs) to ensure their safe deployment in critical applications? This question drives our exploration into focused adversarial training techniques that improve the security of these models without excessive computational costs.

Adversarial Robustness and AI Safety

Adversarial attacks involve subtle manipulations of input data designed to deceive machine learning models into making incorrect predictions. In the context of VLMs, these attacks can have severe implications, especially when these models are deployed in sensitive areas such as autonomous driving, healthcare, and content moderation.

Enhancing the adversarial robustness of VLMs is crucial for AI safety. Robust models can withstand adversarial inputs, ensuring reliable performance and preventing malicious exploitation. Our research focuses on a novel approach to achieve this robustness by selectively re-training components of the multimodal architecture.

Our Approach

Traditional methods to improve model robustness often involve adversarial training, which integrates adversarial examples into the training process. However, this can be computationally intensive, particularly for complex models like VLMs that process images and text.

Our study introduces a more efficient strategy: adversarially re-training only the language model component of the VLM. This targeted approach leverages the Fast Gradient Sign Method (FGSM) to generate adversarial examples and incorporates them into the training of the text decoder. We maintain computational efficiency by keeping the image encoder fixed while significantly enhancing the model’s overall robustness.

Key Findings

  1. Adversarial Training Efficiency: Adversarially re-training only the language model yields robustness comparable to full adversarial training, with reduced computational demands.
  2. Selective Training Impact: Freezing the image encoder and focusing on the text decoder maintains high performance and robustness. In contrast, training only the image encoder results in a significant performance drop.
  3. Benchmark Results: Experiments on the Flickr8k and COCO datasets demonstrate that our selective adversarial training approach effectively mitigates the impact of adversarial attacks, as evidenced by improved BLEU scores and model performance under adversarial conditions.

Implications for Ethical AI

Our findings support the development of more robust and secure AI systems, which is crucial for ethical AI deployment. By focusing on adversarial robustness, we contribute to the broader goal of AI safety, ensuring that multimodal models can be trusted in real-world applications.

For a detailed exploration of our methodology and findings, read the full paper pre-print: https://arxiv.org/abs/2407.21174

References

  • Rashid, M.B., & Rivas, P. (2024). AI Safety in Practice: Enhancing Adversarial Robustness in Multimodal Image Captioning. 3rd Workshop on Ethical Artificial Intelligence: Methods and Applications, ACM SIGKDD’24. https://arxiv.org/abs/2407.21174

About the Author

Maisha Binte Rashid is a Ph.D. student at Baylor University, specializing in AI safety and multimodal machine learning.

Evaluating Robustness of Reconstruction Models with Adversarial Networks

K. Sooksatra, G. Bejarano, and P. Rivas, “Evaluating Robustness of Reconstruction Models with Adversarial Networks,” Procedia Computer Science, vol. 222, pp. 353-366, 2023. https://doi.org/10.1016/j.procs.2023.08.174.

In the ever-evolving landscape of artificial intelligence, our lab has made a significant breakthrough with our latest publication featured in Procedia Computer Science. This research, spearheaded by Korn Sooksatra, delves into the critical domain of adversarial robustness, mainly focusing on reconstruction models, which, until now, have been a less explored facet of adversarial research. This paper was accepted initially into IJCNN and chosen to be added to the INNS workshop and published as a journal article.

Key Takeaways:

  1. Innovative Frameworks: The team introduced two novel frameworks for assessing adversarial robustness: the standard framework, which perturbs input images to deceive reconstruction models, and the universal-attack framework, which generates adversarial perturbations from a dataset’s distribution.
  2. Outperforming Benchmarks: Through rigorous testing on MNIST and Cropped Yale Face datasets, these frameworks demonstrated superior performance in altering image reconstruction and classification, surpassing existing state-of-the-art adversarial attacks.
  3. Enhancing Model Resilience: A pivotal aspect of the study was using these frameworks to retrain reconstruction models, significantly improving their defense against adversarial perturbations and showcasing an ethical application of adversarial networks.
  4. Latent Space Analysis: The research also included a thorough examination of the latent space, ensuring that adversarial attacks do not compromise the underlying features that are crucial for reconstruction integrity.

Broader Impact:

The implications of this research are profound for the AI community. It not only presents a method to evaluate and enhance the robustness of reconstruction models but also opens avenues for applying these frameworks to other image-to-image applications. The lab’s work is a call to the AI research community to prioritize the development of robust AI systems that can withstand adversarial threats, ensuring the security and reliability of AI applications across various domains.

Future Directions:

While the frameworks developed are groundbreaking, the team acknowledges the need for reduced preprocessing time to enhance practicality. Future work aims to refine these frameworks and extend their application to other domains, such as video keypoint interpretation, anomaly detection, and graph prediction.

The result of our standard framework without the discriminator on the left is from the VAE, and on the right is from the VAEGAN. The images 1) in the first column are clean; 2) in the second column are the reconstructed images for the images in the first column; 3) in the third column are adversarial examples concerning the images in the first column; 4) in the last column are the reconstructed images for the adversarial examples.

On Adversarial Examples for Text Classification by Perturbing Latent Representations

Recently, with the advancement of deep learning, several applications in text classification have advanced significantly. However, this improvement comes with a cost because deep learning is vulnerable to adversarial examples. This weakness indicates that deep learning is not very robust.

Fortunately, a couple of students in our lab, Korn Sooksatra and Bikram Khanal, noticed that the input of a text classifier is discrete. Hence, it can prevent the classifier from state-of-the-art attacks. Nonetheless, previous works have generated black-box attacks that successfully manipulate the discrete values of the input to find adversarial examples. Therefore, instead of changing the discrete values, they transform the input into its embedding vector containing real values to perform the state-of-the-art white-box attacks. Then, they convert the perturbed embedding vector back into a text and name it an adversarial example. In summary, PhD candidates, Sooksatra and Khanal, create a framework that measures the robustness of a text classifier by using the gradients of the classifier.

This paper was accepted for presentation and publication at the LXAI workshop at NeurIPS 2022 in New Orleans, LA. Download the paper here: [ bib |  .pdf ]

SICEM: A Sensitivity-Inspired Constrained Evaluation Method for Adversarial Attacks on Classifiers with Occluded Input Data

In the rapidly evolving field of artificial intelligence, understanding the sensitivity of models to adversarial attacks is crucial. In our recent paper, Korn Sooksatra introduces the Sensitivity-inspired constrained evaluation method (SICEM) to address this concern.

Sooksatra, K., Rivas, P. Evaluation of adversarial attacks sensitivity of classifiers with occluded input data. Neural Comput & Applic 34, 17615–17632 (2022). https://doi.org/10.1007/s00521-022-07387-y

Understanding SICEM

Our proposed method, SICEM, evaluates the vulnerability of an incomplete input against an adversarial attack in comparison to a complete one. This is achieved by leveraging the Jacobian matrix concept. The sensitivity of the target classifier’s output to each attribute of the input is calculated, providing a comprehensive understanding of how changes in the input can affect the output.

    \[ s(x,y)_i =  \left|\min \left(0, \frac{\partial Z(x)_y}{\partial x_i} \cdot \left(\sum_{y^{'} \neq y} \frac{\partial Z(x)_{y^{'}}}{\partial x_i}\right) \cdot C(y, 1, 0)_i\right)\right| \]

This sensitivity score gives us an insight into how much each attribute of the input contributes to the output’s sensitivity. The score is then used to estimate the overall sensitivity of the given input and its mask.

    \[ S(x, M)_y = \sum_{i=0}^{n-1} (s(x, y)_i \cdot M_i) \]

For a complete input, the sensitivity ratio provides a comparative measure of how sensitive the classifier’s output is for an incomplete input versus a complete one.

Results and Implications

Our focus was on an automobile image from the CIFAR-10 dataset. Interestingly, adversarial examples generated by FGSM and IGSM required the same value of \epsilon, which was significantly lower than for other images. This can be attributed to the layer-wise linearity of the classifier. Larger inputs, like the automobile image, require a smaller \epsilon to create an adversarial example. However, JSMA required a higher \epsilon due to the metric of L_0 norm.

Understanding the sensitivity of AI models is paramount in ensuring their robustness against adversarial attacks. The SICEM method provides a comprehensive tool to ensure safer and more reliable AI systems. Read the full paper here [ bib |  .pdf ].

Evaluating Accuracy and Adversarial Robustness of Quanvolutional Neural Networks

A combination of a quantum circuit and a convolutional neural network (CNN) can have better results over a classic CNN in some cases. In our recent article, we show an example of such a case, using accuracy and adversarial examples as measures of performance and robustness. Check it out: [ bib | pdf ]

Enhancing Adversarial Examples on Deep QNetworks with Previous Information

This work finds strong adversarial examples for Deep Q Networks which are famous deep reinforcement learning models. We combine two subproblems of finding adversarial examples in deep reinforcement learning: finding states to perturb and determining how much to perturb. Therefore, the attack can jointly optimize this problem. Further, we trained Deep Q Networks to play Atari games: Breakout and Space Invader. Then, we used our attack to find adversarial examples on those games. As a result, we can achieve state-of-the-art results and showed that our attack is natural and stealthy. Paper: [ bib | pdf ]

An Adversarial Neural Cryptography Approach to Integrity Checking: Learning to Secure Data Communications

Securing communications is an increasingly challenging problem. While communication channels can be secured using strong ciphers, attackers who gain access to the channel can still perform certain types of attacks. One way to mitigate such attacks is to verify the integrity of exchanging messages between two parties or more. While there are robust integrity check mechanisms currently, these lack variety, and very few are based on machine learning. This paper presents a methodology for performing an integrity check inspired by recent advances in neural cryptography. We provide formal, mathematical functions and an optimization problem for training an adversarial neural cryptography architecture. The proposed neural architectures can adequately solve the problem. In our experiments, a receiver can verify if incoming messages are authentic or altered with an accuracy greater than 99%. This work expands the repertoire of integrity checking methodologies, provides a unique perspective based on neural networks and facilitates data security and privacy. Paper: [ bib , pdf ]

Training model for integrity check