CSCI 699: Privacy-Preserving Machine Learning

Course Description and Objectives

This course focuses on the foundations of privacy-preserving machine learning. Extremely personal data is being collected at an unprecedented scale by ML companies. While training ML models on such confidential data can be highly beneficial, it also comes with huge privacy risks. This course addresses the dual challenge of maximizing the utility of machine learning models while protecting individual privacy. We will cover the following topics: differential privacy; private training of ML models; privacy attacks and audits; federated and decentralized machine learning.

This course will prepare you to rigorously identify, reason about, and manage privacy risks in machine learning. You will learn to design algorithms that protect sensitive information, and to analyze the privacy leakage of any ML system. Additionally, the course will introduce you to cutting-edge research and practical applications. By the end of the course, you will be well-equipped to undertake research and address real-world privacy challenges in machine learning.

For providing anonymous feedback at any point in the course, please use this anonymous form.

Prerequisites

While there are no official prerequisites, knowledge of advanced probability (at the level of MATH 505a), linear algebra and multi-variable calculus (at the level of MATH 225), analysis of algorithms (at the level of CSCI 570), introductory statistics and hypothesis testing (at the level of MATH 308), and machine learning (at the level of CSCI 567) is recommended.

Syllabus

Week	Topics/Daily Activities	Additional Readings	Deliverables
Week 1	Theory: Introduction to anonymity and data privacy; Data anonymization techniques; De-anonymization attacks; Linkage and Reconstruction attacks. Practical: Implement some linkage attacks (bring laptop).	Zhang et al. 2020. The Secret Revealer: Generative Model-Inversion Attacks Against Deep Neural Networks Haim et al. 2022. Reconstructing Training Data from Trained Neural Networks Carlini et al. 2021. Is Private Learning Possible with Instance Encoding? Orekondy et al. 2019. Knockoff Nets: Stealing Functionality of Black-Box Models	Lab-1a (solution) Lab-1b (solution) Week 1 slides
Week 2	Theory: Differential Privacy; Randomized response; Laplace mechanism; Hypothesis testing interpretation.	Dong et al. 2019. Gaussian Differential Privacy	Annotated week 2 slides Homework 1 (due Sep 20 on Brightspace)
Week 3	Theory: ML training; gradient descent; SGD.	Reddi et al. 2019. On the Convergence of Adam and Beyond Zhang et al. 2020. Why are Adaptive Methods Good for Attention Models? Li et al. 2022. Private Adaptive Optimization with Side Information	Week 3 slides
Week 4	Theory: Private ML training; DP-SGD; Gaussian DP; Sub-sampling; Composition. Practical: Opacus Library for private deep learning (bring laptop).	Abadi et al. 2016. Deep Learning with Differential Privacy Kairouz et al. 2021. Practical and Private (Deep) Learning Without Sampling or Shuffling Denisov et al 2022. Improved Differential Privacy for SGD via Optimal Private Linear Operators on Adaptive Streams Cohen et al. 2024. Data Reconstruction: When You See It and When You Don't	HW 1 due before class. Week 4 slides Annotated slides (due Sep 27) Lab 3 HW 2 (due Sep 27) HW practical
Week 5	Theory: Practical Privacy auditing; Designing powerful membership inference attacks; Measuring the influence of training data. Presentations	Tramer et al. 2022. Debugging Differential Privacy: A Case Study for Privacy Auditing. Steinke et al. 2024. Privacy Auditing with One (1) Training Run. Lesci et al. 2024. Causal Estimation of Memorisation Profiles Aerni et al. 2024. Evaluations of Machine Learning Privacy Defenses are Misleading	HW 2 due before class. Week 5 slides Annotated slides
Week 6	Theory: Privacy in LLMs; RLHF/prompt engineering for privacy; Data stealing attacks; private in-context learning.	Shi et al. 2024. Detecting Pretraining Data from LLMs Nasr et al. 2023. Scalable Extraction of Training Data from (Production) Language Models Yu et al. 2024. Privacy-Preserving Instructions for Aligning Large Language Models Wu et al. 2023. Privacy-Preserving In-Context Learning for Large Language Models Debenedetti et al. 2024. Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition	Week 6 slides Annotated slides
Fall break
Week 7	Theory: Unlearning algorithms; guarantees; Model editing and correcting. Practical: Implement unlearning (bring laptop).	Izzo et al 2021. Approximate Data Deletion from Machine Learning Models. Sekhari et al. 2021. Remember What You Want to Forget: Algorithms for Machine Unlearning Zhang et al. 2024. Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning Meng et al. 2022. Locating and Editing Factual Associations in GPT Pawelczyk et al. 2024. In-Context Unlearning: Language Models as Few Shot Unlearners	Decide project topic. HW postponed. Week 7 slides Annotated slides
Week 8	Theory: Decentralized privacy, Local DP Confidential Computing: Guest lecture by Mengyuan Li Practical: Comparing local vs. central DP (bring laptop).	Erlingsson et al. 2014. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response Kasiviswanathan et al. 2011. What Can We Learn Privately? Dwork et al. 2006. Our Data, Ourselves: Privacy via Distributed Noise Generation Eichner et al. 2024. Confidential Federated Computations
Week 9	Theory: Federated learning; challenges due to data heterogeneity, communication compression; Privacy attacks in FL.	Wang et al. 2021. Field Guide to Federated Optimization. Geiping et al. 2020. Inverting Gradients - How easy is it to break privacy in federated learning? Fowl et al. 2022. Robbing the Fed: Directly Obtaining Private Data in Federated Learning with Modified Models	Week 9 slides
Week 10	Theory: Privacy in FL; Secure aggregation; Quantized DP.	Bonawitz et al. 2022. Federated Learning and Privacy Bonawitz et al. 2016. Practical Secure Aggregation for Federated Learning on User-Held Data Chen et al. 2022. The Poisson binomial mechanism for secure and private federated learning (Amplified) Banded Matrix Factorization: A unified approach to private training	Week 10 slides
Week 11	Theory: Privacy in Practice; Incentives; Relation to Copyright law.	Brown et al. 2022. What Does it Mean for a Language Model to Preserve Privacy? NY Times 2024. Consent in Crisis: The Data That Powers A.I. Is Disappearing Fast Wei et al. 2024. Proving membership in LLM pretraining data via data watermarks Duarte et al. 2024. DE-COP: Detecting Copyrighted Content in Language Models Training Data Elkin-Koren et al. 2024. Can Copyright be Reduced to Privacy?	Week 11 slides
Weeks 12	Student presentations		Week 12 slides
Final	Final project report	Report due on the university-scheduled date of the final exam.

Grading

3 assignments worth 30% of the grade. Collaboration is allowed but must be stated. Grades are based on correctness. The theory part should be written in Latex and the coding part in Jupyter Python notebooks.
Course Presentation and Project (55% of the grade):
- Presentations (25%): Students will be assigned a paper based on their interest and will present it in class for 30 minutes.
- Project (30%): Students will write a 4-page report on 1-2 papers, which could either be on the paper they presented, supplemented by related readings, or on a different paper(s) of their choice. Pursuing a personal research topic is strongly encouraged.
Discussions and participation will count for 15%. This will involve reviewing, commenting, and discussing each other's presentations and projects using the role-playing reading group format.

Resources

There are no required textbooks. The following writeups are excellent supplemental readings and may be used as references.

C. Dwork and A. Roth. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science, 2014. pdf. Reference for DP.
Nissim et al. Differential Privacy: A Primer for a Non-technical Audience. Journal of Entertainment & Technology Law, 2018. pdf. Great read with many examples tying legal definitions and privacy in practice.
Kairouz et al. Advances and Open Problems in Federated Learning. Community survey on federated learning. pdf.

This course builds on several related courses which can serve as valuable additional references:

Privacy-Preserving Machine Learning by Aurelien Bellet at Inria (link)
Trustworthy Machine Learning by Reza Shokri at NUS (link)
Federated and Collaborative Learning by Virginia Smith at CMU (link)
Large Scale Optimization for Machine Learning (ISE 633) by Meisam Razaviyayn at USC (link)
Digital Privacy by Vitaly Shmatikov at Cornell (link)