Samyak Jain

I am a research fellow at Microsoft Research India, where I am advised by Navin Goyal. Before joining MSR, I did research internships with David Krueger and Puneet Dokania at Cambridge University and Five AI Oxford, where my work focussed on developing a better understanding of (safety) fine-tuning using mechanistic interpretability. Even before this, I completed my Bachelors and Masters in CSE at Indian Institute of Technology (BHU) Varanasi and during this time, I worked with Venkatesh Babu and Sravanti Addepalli at Indian Institute of Science, Bangalore. Here, my research focussed on developing adversarially robust defences.

I am very interested in developing a better understanding of learning dynamics of neural networks, which can help us explain different intriguing observations: phase transitions, neural collapse, simplicity bias, etc. For this I aim to combine tools from learning theory and mechanistic interpretability. I believe in using toy setups to develop hypotheses explaining real world observations and later test these hypotheses at scale. I am always motivated to deliver impactful work enhancing our scientific understanding while being useful in practice.

I am also interested in domains related to AI Safety, like cooperative alignment, adversarial robustness, reward hacking and safety fine-tuning.

Email / CV / Google Scholar / Github / Twitter

News

[09/2024]	Paper on identifying how jailbreaks bypass safety mechanisms accepted at NeurIPS, 2024.
[06/2024]	Paper on identifying how jailbreaks bypass safety mechanisms accepted as spotlight at ICML-MI workshop, 2024.
[2/2024]	Paper on highlighting cause of gradient masking in vision transformers accepted to CVPR, 2024.
[01/2024]	Paper on mechanistically analyzing effects of fine-tuning accepted to ICLR, 2024.
[3/2023]	Paper on improving domain generalization via exploring the loss landscape accepted to CVPR, 2023.
[6/2023]	Glad to be recognized as Outstanding Reviewer (top 250 reviewers) at CVPR 2023.
[9/2022]	Paper on effectively utilizing augmentations for adversarial training accepted to NeurIPS, 2022.
[8/2022]	Paper on scaling adversarial robustness beyond standard threat models accepted to ECCV, 2022.
[7/2022]	Glad to be recognized as Outstanding Reviewer (top 200 reviewers) at CVPR 2022.
[4/2022]	Glad to be recognized as Highlighted Reviewer at ICLR 2022.

Publications (* denotes equal contribution)

	What Makes and Breaks Safety Fine-tuning? A Mechanistic Study Samyak Jain, Ekdeep Singh Lubana, Kemal Oksuz, Tom Joy, Philip H.S. Torr, Amartya Sanyal, and Puneet K. Dokania Advances in Neural Information Processing Systems (NeurIPS), 2024 ICML workshop on Mechanistic Interpretability , 2024 (Spotlight) bibtex / arXiv We use fomral languages as a model system to identify the mechanistic changes induced by safety fine-tuning, and how jailbreaks bypass these mechanisms, verifying our claims on Llama models.
	Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks Samyak Jain^, Robert Kirk^, Ekdeep Singh^* Robert P. Dick, Hidenori Tanaka, Edward Grefenstette, Tim Rocktaschel, and David Krueger International Conference on Learning Representations (ICLR), 2024 bibtex / arXiv We show fine-tuning leads to learning of minimal transformations of a pretrained model's capabilities, like a "wrapper", by using procedural tasks defined using Tracr, PCFGs, and TinyStories.
	Dart: Diversify-aggregate-repeat training improves generalization of neural networks Samyak Jain^, Sravanti Addepalli^, Pawan Kumar Sahu, Priyam Dey, and R. Venkatesh Babu Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023 bibtex / arXiv We demonstrate state of the art domain generalization performance on domainbed by effectively traversing the loss landscape using multiple copies of a model and later merging them to converge onto a flatter minima
	Efficient and effective augmentation strategy for adversarial training Sravanti Addepalli^, Samyak Jain^, and R. Venkatesh Babu Advances in Neural Information Processing Systems (NeurIPS), 2022 bibtex / arXiv Propose to counter the distribution shift caused on using augmentations in adversarial training via multiple batch normalization layers specialized for weak and strong augmentations.
	Scaling adversarial training to large perturbation bounds Sravanti Addepalli^, Samyak Jain^, Gaurang Sriramanan and R. Venkatesh Babu European Conference on Computer Vision (ECCV), 2022 bibtex / arXiv Demonstrated that standard adversarial training method cannot generalize to larger perturbation bounds due to flipping of the oracle labels of some images. We propose oracle aligned adversarial training to overcome this issue.
	Boosting adversarial robustness using feature level stochastic smoothing Sravanti Addepalli^, Samyak Jain^, Gaurang Sriramanan^* and R. Venkatesh Babu Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), 2022 bibtex / arXiv Proposed a stochastic classifier, which aims to learn smoother class boundaries by sampling noise multiple times in its latent space during inference. This results in improved robustness along with better calibration.
	Towards Understanding and Improving Adversarial Robustness of Vision Transformers Samyak Jain, and Tanima Dutta Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024 bibtex / arXiv Demonstrated that softmax in self-attention of vision transformers causes floating point errors which leads to gradient masking, thereby highlighting the requirement of adaptive attacks to analyze true robustness.