CSCI 599: Optimization for Machine Learning

Course Description and Objectives

Fast optimization algorithms which scale to massive datasets have been powering the rapid progress in machine learning. We will learn the nuts and bolts of these algorithms and establishes their theoretical foundations, with a particular focus on the challenges arising in modern large-scale ML. Some of the questions we will answer:

what is convexity and why is it useful (convex optimization)
why can we train using only mini-batches (stochastic optimization)
why Adam is typically preferred over SGD (preconditioning and adaptivity)
how to train robust ML models (min-max optimization)
explaining feature learning in neural networks (non-convex optimization)
how to train privately on distributed datasets (federated optimization)

Note: this course is heavy on theory, though plenty of practical parts are mixed in.

Objectives: The course will prepare you to view machine learning through the formalism of optimization. At the end of the course, you will be able to analyze optimization algorithms and implement them. You will learn the plethora of algorithms used in modern ML, and understand the tradeoffs each of them is designed to achieve.

Prerequisites

While there are no official prerequisites, knowledge of Probability (at the level of MATH 505a), Linear Algebra, Multi-Variable Calculus (at the level of MATH 225), Analysis of Algorithms (at the level of CSCI 570), and Machine Learning (at the level of CSCI 567) is recommended.

Grading

Final Exam (50%): Closed book exam consisting of theoretical questions similar to exercises. You are allowed to bring one cheat sheet (A4 size paper, both sides can be used).
Team Course Project (40%): The course includes a major project where you will present and reproduce a paper on optimization for machine learning in teams of 2-3.
- Presentation (15%): You will closely read a paper related to the course and make a 20-minute presentation. See this note for advice by Yiling Chen and evaluation criteria.
- Report (25%): You will then reproduce the main experiment/theorem of the paper. If it is a theoretical paper, you will rederive and present the proof of the main claims as you understand them. If it is an experimental paper, you will follow the ML reproducibility challenge where you will attempt to reproduce the experiment in the paper.
Discussion and participation in exercise sessions (10%): There will be a weekly exercise session consisting of a mix of theoretical and practical Python exercises for each corresponding topic. Coming prepared and actively participating in the exercise session discussions counts for 10% of the grade.

Resources

We will use Piazza for all discussions: https://piazza.com/usc/spring2025/csci599

We will mainly use the course lecture notes adapted from Martin Jaggi and Nicolas Flammarion.

Additional resources which you will likely find useful:

Bubeck, Sébastien. "Convex Optimization: Algorithms and Complexity." Foundations and Trends® in Machine Learning, 8.3-4 (2015): 231-357. [link]
Bach, Francis. "Learning Theory from First Principles." MIT Press, 2024. [link]
Google's Handbook on Practical Deep Learning. [link]

There are several related courses and materials which could be useful supplementary references:

Course Schedule

Week	Topics/Daily Activities	Recommended Reading	Deliverables
Week 1	Lecture: Course overview and objectives Convex sets and convex functions Fenchel conjugate and optimality conditions Exercise Session: Getting started with numpy programming	Chapter 1 of lecture notes Lecture 1 slides Lecture 1 annotated Jan 13 recording Jan 15 recording	Exercise 0
Week 2	Lecture: Smoothness and Strong Convexity Gradient descent and its convergence properties Exercise Session: Theoretical problems on properties of convex sets and functions	Chapter 2 of lecture notes Lecture 2 slides Lecture 2 annotated Jan 22 recording Jan 27 recording	Exercise 1
Week 3	Lecture: Subgradient descent Proximal Gradient Descent Exercise Session: Implementing gradient descent Theoretical problems from chapter 2	Chapter 3.4 Chapter 4 of lecture notes Lecture 3 slides Lecture 3 annotated Lecture 4 annotated Jan 30 recording Feb 03 recording Feb 05 recording	Exercise 2
Week 4	Lecture: Stochastic Gradient Descent (SGD) Non-convexity and local optimality Convergence to critical points Exercise Session: Optimizing Lasso Theoretical problems from chapters 3 & 4	Chapter 5 6.1 of lecture notes Lecture 5 slides Feb 5 recording Feb 10 recording Feb 12 recording Feb 19 recording	Ex. 3 Ex. 4 Ex. 5
Week 5	Lecture: Nesterov Acceleration Momentum as smoothing non-convexity Exercise Session: Comparing mini-batch and full batch methods	Goh, Gabriel. "Why momentum really works." Distill 2.4 (2017): e6. Chapter 2.6 of lecture notes Feb 24 recording	Exercise 6
Week 6	Lecture: Newton’s method Adaptive preconditioning: AdaGrad, Adam (L0, L1)-smoothness and clipped SGD Exercise Session: Understanding momentum methods in practice	Chapter 7 of lecture notes Blog post by Sebastian Ruder on optimizers for deep learning Why gradient clipping accelerates training [Zhang et al. ICLR 2020] Feb 26 recording Mar 3 recording	Project paper selection due date. Exercise 7 released
Week 8	Lecture: Linear neural networks Gradient flows Exercise Session: Adversarial robust training Discuss projects	Chapter 6 of lecture notes Min et al. "On the explicit role of initialization on the convergence and implicit bias of overparametrized linear networks." ICML 2021. Mar 5 recording Mar 10 recording Mar 12 recording	Exercise 7 released
Material below will not be part of exam unless explicitly stated otherwise.
Week 9	Lecture: Implicit Regularization of Gradient Descent Exercise Session: Discuss projects	Lecture 5 and Section 2 of Lecture 6 by Yingyu Liang at Wisconsin-Madison. Mar 24 recording
Week 11	Lecture: Neural Tangent Kernel Exercise Session: Discuss projects	Lecture 11 and Lecture 12 by Yingyu Liang at Wisconsin-Madison. Mar 26 recording
Week 10	Lecture: Neural Tangent Kernel Exercise Session: Discuss projects	Lecture 11 and Lecture 12 on NTK by Yingyu Liang at Wisconsin-Madison. Global convergence and implicit bias in feature learning / active training (blog posts by Francis Bach). Mar 26 recording Mar 31 recording Apr 2 recording
Week 12	Lecture: Overview of LLM training Zeroth-order (memory-efficient) optimization methods Communication compression Byzantine robustness Exercise Session: Discuss projects	Chapter 9 of lecture notes Karimireddy et al. "Learning from history for Byzantine robust optimization." ICML 2021 Vogels et al. "PowerSGD: Practical low-rank gradient compression for distributed optimization." NeurIPS 2019.
Weeks 13-15	Student presentations: In-class presentations Option to schedule earlier in the semester
Final	Final exam		Project report due. Exam on university-scheduled date.