Samyak Jain

I am a research fellow at Microsoft Research India, where I am advised by Navin Goyal. Before joining MSR, I did research internships with David Krueger and Puneet Dokania at Cambridge University and Five AI Oxford, where my work focussed on developing a better understanding of (safety) fine-tuning using mechanistic interpretability. Even before this, I completed my Bachelors and Masters in CSE at Indian Institute of Technology (BHU) Varanasi and during this time, I worked with Venkatesh Babu and Sravanti Addepalli at Indian Institute of Science, Bangalore. Here, my research focussed on developing adversarially robust defences.

I am very interested in developing a better understanding of learning dynamics of neural networks, which can help us explain different intriguing observations: phase transitions, neural collapse, simplicity bias, etc. For this I aim to combine tools from learning theory and mechanistic interpretability. I believe in using toy setups to develop hypotheses explaining real world observations and later test these hypotheses at scale. I am always motivated to deliver impactful work enhancing our scientific understanding while being useful in practice.

I am also interested in domains related to AI Safety, like cooperative alignment, adversarial robustness, reward hacking and safety fine-tuning.

Email  /  CV  /  Google Scholar  /  Github /  Twitter

profile photo
News
[09/2024] Paper on identifying how jailbreaks bypass safety mechanisms accepted at NeurIPS, 2024.
[06/2024] Paper on identifying how jailbreaks bypass safety mechanisms accepted as spotlight at ICML-MI workshop, 2024.
[2/2024] Paper on highlighting cause of gradient masking in vision transformers accepted to CVPR, 2024.
[01/2024] Paper on mechanistically analyzing effects of fine-tuning accepted to ICLR, 2024.
[3/2023] Paper on improving domain generalization via exploring the loss landscape accepted to CVPR, 2023.
[6/2023] Glad to be recognized as Outstanding Reviewer (top 250 reviewers) at CVPR 2023.
[9/2022] Paper on effectively utilizing augmentations for adversarial training accepted to NeurIPS, 2022.
[8/2022] Paper on scaling adversarial robustness beyond standard threat models accepted to ECCV, 2022.
[7/2022] Glad to be recognized as Outstanding Reviewer (top 200 reviewers) at CVPR 2022.
[4/2022] Glad to be recognized as Highlighted Reviewer at ICLR 2022.
Publications (* denotes equal contribution)
SFT and Jailbreaks What Makes and Breaks Safety Fine-tuning? A Mechanistic Study
Samyak Jain, Ekdeep Singh Lubana, Kemal Oksuz, Tom Joy, Philip H.S. Torr, Amartya Sanyal, and Puneet K. Dokania
Advances in Neural Information Processing Systems (NeurIPS), 2024
ICML workshop on Mechanistic Interpretability , 2024 (Spotlight)
bibtex / arXiv

We use fomral languages as a model system to identify the mechanistic changes induced by safety fine-tuning, and how jailbreaks bypass these mechanisms, verifying our claims on Llama models.

Mechanistic fine-tuning Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
Samyak Jain*, Robert Kirk*, Ekdeep Singh* Robert P. Dick, Hidenori Tanaka, Edward Grefenstette, Tim Rocktaschel, and David Krueger
International Conference on Learning Representations (ICLR), 2024
bibtex / arXiv

We show fine-tuning leads to learning of minimal transformations of a pretrained model's capabilities, like a "wrapper", by using procedural tasks defined using Tracr, PCFGs, and TinyStories.

DART Dart: Diversify-aggregate-repeat training improves generalization of neural networks
Samyak Jain*, Sravanti Addepalli*, Pawan Kumar Sahu, Priyam Dey, and R. Venkatesh Babu
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
bibtex / arXiv

We demonstrate state of the art domain generalization performance on domainbed by effectively traversing the loss landscape using multiple copies of a model and later merging them to converge onto a flatter minima

DAJAT Efficient and effective augmentation strategy for adversarial training
Sravanti Addepalli*, Samyak Jain*, and R. Venkatesh Babu
Advances in Neural Information Processing Systems (NeurIPS), 2022
bibtex / arXiv

Propose to counter the distribution shift caused on using augmentations in adversarial training via multiple batch normalization layers specialized for weak and strong augmentations.

DAJAT Scaling adversarial training to large perturbation bounds
Sravanti Addepalli*, Samyak Jain*, Gaurang Sriramanan and R. Venkatesh Babu
European Conference on Computer Vision (ECCV), 2022
bibtex / arXiv

Demonstrated that standard adversarial training method cannot generalize to larger perturbation bounds due to flipping of the oracle labels of some images. We propose oracle aligned adversarial training to overcome this issue.

CCAT Boosting adversarial robustness using feature level stochastic smoothing
Sravanti Addepalli*, Samyak Jain*, Gaurang Sriramanan* and R. Venkatesh Babu
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), 2022
bibtex / arXiv

Proposed a stochastic classifier, which aims to learn smoother class boundaries by sampling noise multiple times in its latent space during inference. This results in improved robustness along with better calibration.

AASAT Towards Understanding and Improving Adversarial Robustness of Vision Transformers
Samyak Jain, and Tanima Dutta
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
bibtex / arXiv

Demonstrated that softmax in self-attention of vision transformers causes floating point errors which leads to gradient masking, thereby highlighting the requirement of adaptive attacks to analyze true robustness.