About Me
I got my PhD in Computer Science from UC Berkeley, advised by Ion Stoica. Before that, I received my B.S. in Computer Science from Peking University, advised by Liwei Wang and Di He.
My interest lies in the intersection of machine learning and distributed systems. I use insights from different domains to improve the performance (accuracy, efficiency, and interpretability) of current machine learning models.
I co-created and co-lead the development of vLLM, an open-source LLM serving engine focusing on high throughput that got widely adopted across the industry.
Projects
vLLM: A high-throughput and memory-efficient serving engine for large language models, accelerated with
PagedAttention.
Vicuna: An open-source chatbot impressing GPT-4 with 90% ChatGPT quality.
AlpaServe: Use model parallelism to accelerate deep learning serving, even when models fit a single GPU.
Alpa: Automate model parallel training with just a few lines of code.
Education
University of California, Berkeley
Ph.D. in Computer Science
2019 - 2024
Peking University
B.S. in Computer Science (Summa Cum Laude)
2015 - 2019
Experience
Google Brain / Google Deepmind
Research Intern
Hosts: Yanping Huang, Yuanzhong Xu, and Zhifeng Chen
May 2021 - April 2024
Anyscale
Software Engineer Intern
May 2020 - August 2020
Microsoft Research Asia
Research Intern
Hosts: Di He and Tao Qin
June 2017 - March 2019
Publications
-
Fairness in Serving Large Language Models
Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, Ion Stoica
OSDI 2024
-
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
Lianmin Zheng*, Wei-Lin Chiang*, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric Xing, Joseph E. Gonzalez, Ion Stoica, Hao Zhang
ICLR 2024
-
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng*, Wei-Lin Chiang*, Ying Sheng*, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica
NeurIPS 2023 Datasets and Benchmarks Track
-
Efficient Memory Management for Large Language Model Serving with PagedAttention
Woosuk Kwon*, Zhuohan Li*, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica
SOSP 2023
-
FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher RĂ©, Ion Stoica, Ce Zhang
ICML 2023
-
Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality
Wei-Lin Chiang*, Zhuohan Li*, Zi Lin*, Ying Sheng*, Zhanghao Wu*, Hao Zhang*, Lianmin Zheng*, Siyuan Zhuang*, Yonghao Zhuang*, Joseph E. Gonzalez, Ion Stoica, Eric P. Xing
-
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving
Zhuohan Li*, Lianmin Zheng*, Yinmin Zhong*, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, Ion Stoica
OSDI 2023
-
On Optimizing the Communication of Model Parallelism
Yonghao Zhuang*, Hexu Zhao*, Lianmin Zheng, Zhuohan Li, Eric P. Xing, Qirong Ho, Joseph E. Gonzalez, Ion Stoica, Hao Zhang
MLSys 2022
-
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
Lianmin Zheng*, Zhuohan Li*, Hao Zhang*, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica
OSDI 2022
-
Rearchitecting In-Memory Object Stores for Low Latency
Danyang Zhuo, Kaiyuan Zhang, Zhuohan Li, Siyuan Zhuang, Stephanie Wang, Ang Chen, Ion Stoica
VLDB 2022
-
TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models
Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, Ion Stoica
ICML 2021
-
Hoplite: Efficient and Fault-Tolerant Collective Communication for Task-Based Distributed Systems
Siyuan Zhuang*, Zhuohan Li*, Danyang Zhuo, Stephanie Wang, Eric Liang, Robert Nishihara, Philipp Moritz, Ion Stoica
SIGCOMM 2020
-
Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers
Zhuohan Li*, Eric Wallace*, Sheng Shen*, Kevin Lin*, Kurt Keutzer, Dan Klein, Joseph E. Gonzalez
ICML 2020
-
Fast Structured Decoding for Sequence Models
Zhiqing Sun*, Zhuohan Li*, Haoqing Wang, Zi Lin, Di He, Zhi-Hong Deng
NeurIPS 2019
-
Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View
Yiping Lu*, Zhuohan Li*, Di He, Zhiqing Sun, Bin Dong, Tao Qin, Liwei Wang, Tie-Yan Liu
NeurIPS 2019 Workshop on Machine Learning and the Physical Sciences
ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations
-
Hint-Based Training for Non-Autoregressive Machine Translation
Zhuohan Li, Zi Lin, Di He, Fei Tian, Tao Qin, Liwei Wang, Tie-Yan Liu
EMNLP 2019
-
Efficient Training of BERT by Progressively Stacking
Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, Tie-Yan Liu
ICML 2019
-
Towards Binary-Valued Gates for Robust LSTM Training
Zhuohan Li, Di He, Fei Tian, Wei Chen, Tao Qin, Liwei Wang, Tie-Yan Liu
ICML 2018
* denotes equal contribution.
Tutorials
-
Welcome to the “Big Model” Era: Techniques and Systems to Train and Serve Bigger Models
with Hao Zhang, Lianmin Zheng, and Ion Stoica
ICML 2022 Tutorial
-
Simple and Automatic Distributed Machine Learning on Ray
with Hao Zhang, Lianmin Zheng, and Ion Stoica
KDD 2021 Tutorial