Zhuohan Li

About Me

I got my PhD in Computer Science from UC Berkeley, advised by Ion Stoica. Before that, I received my B.S. in Computer Science from Peking University, advised by Liwei Wang and Di He.

My interest lies in the intersection of machine learning and distributed systems. I use insights from different domains to improve the performance (accuracy, efficiency, and interpretability) of current machine learning models.

I co-created and co-lead the development of vLLM, an open-source LLM serving engine focusing on high throughput that got widely adopted across the industry.

Projects

vLLM: A high-throughput and memory-efficient serving engine for large language models, accelerated with PagedAttention.

Vicuna: An open-source chatbot impressing GPT-4 with 90% ChatGPT quality.

AlpaServe: Use model parallelism to accelerate deep learning serving, even when models fit a single GPU.

Alpa: Automate model parallel training with just a few lines of code.

Education

University of California, Berkeley
Ph.D. in Computer Science
2019 - 2024

Peking University
B.S. in Computer Science (Summa Cum Laude)
2015 - 2019

Experience

Google Brain / Google Deepmind
Research Intern
Hosts: Yanping Huang, Yuanzhong Xu, and Zhifeng Chen
May 2021 - April 2024

Anyscale
Software Engineer Intern
May 2020 - August 2020

Microsoft Research Asia
Research Intern
Hosts: Di He and Tao Qin
June 2017 - March 2019

Publications

Fairness in Serving Large Language Models

Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, Ion Stoica

OSDI 2024

Paper

LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

Lianmin Zheng*, Wei-Lin Chiang*, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric Xing, Joseph E. Gonzalez, Ion Stoica, Hao Zhang

ICLR 2024

Paper

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng*, Wei-Lin Chiang*, Ying Sheng*, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica

NeurIPS 2023 Datasets and Benchmarks Track

Paper / Code

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon*, Zhuohan Li*, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica

SOSP 2023

Paper / Code / Blog

FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, Ce Zhang

ICML 2023

Paper / Code

Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality

Wei-Lin Chiang*, Zhuohan Li*, Zi Lin*, Ying Sheng*, Zhanghao Wu*, Hao Zhang*, Lianmin Zheng*, Siyuan Zhuang*, Yonghao Zhuang*, Joseph E. Gonzalez, Ion Stoica, Eric P. Xing

Blog / Demo / Code

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving

Zhuohan Li*, Lianmin Zheng*, Yinmin Zhong*, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, Ion Stoica

OSDI 2023

Paper / Code

On Optimizing the Communication of Model Parallelism

Yonghao Zhuang*, Hexu Zhao*, Lianmin Zheng, Zhuohan Li, Eric P. Xing, Qirong Ho, Joseph E. Gonzalez, Ion Stoica, Hao Zhang

MLSys 2022

Paper / Code

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Lianmin Zheng*, Zhuohan Li*, Hao Zhang*, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica

OSDI 2022

Paper / Code / Blog

Rearchitecting In-Memory Object Stores for Low Latency

Danyang Zhuo, Kaiyuan Zhang, Zhuohan Li, Siyuan Zhuang, Stephanie Wang, Ang Chen, Ion Stoica

VLDB 2022

Paper / Code

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, Ion Stoica

ICML 2021

Paper / Code

Hoplite: Efficient and Fault-Tolerant Collective Communication for Task-Based Distributed Systems

Siyuan Zhuang*, Zhuohan Li*, Danyang Zhuo, Stephanie Wang, Eric Liang, Robert Nishihara, Philipp Moritz, Ion Stoica

SIGCOMM 2020

Paper / Code

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Zhuohan Li*, Eric Wallace*, Sheng Shen*, Kevin Lin*, Kurt Keutzer, Dan Klein, Joseph E. Gonzalez

ICML 2020

Paper / Blog

Fast Structured Decoding for Sequence Models

Zhiqing Sun*, Zhuohan Li*, Haoqing Wang, Zi Lin, Di He, Zhi-Hong Deng

NeurIPS 2019

Paper / Code

Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View

Yiping Lu*, Zhuohan Li*, Di He, Zhiqing Sun, Bin Dong, Tao Qin, Liwei Wang, Tie-Yan Liu

NeurIPS 2019 Workshop on Machine Learning and the Physical Sciences
ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations

Paper / Code

Hint-Based Training for Non-Autoregressive Machine Translation

Zhuohan Li, Zi Lin, Di He, Fei Tian, Tao Qin, Liwei Wang, Tie-Yan Liu

EMNLP 2019

Paper / Code

Efficient Training of BERT by Progressively Stacking

Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, Tie-Yan Liu

ICML 2019

Paper / Code

Towards Binary-Valued Gates for Robust LSTM Training

Zhuohan Li, Di He, Fei Tian, Wei Chen, Tao Qin, Liwei Wang, Tie-Yan Liu

ICML 2018

Paper / Code / Blog (Chinese)

* denotes equal contribution.

Tutorials

Welcome to the “Big Model” Era: Techniques and Systems to Train and Serve Bigger Models

with Hao Zhang, Lianmin Zheng, and Ion Stoica

ICML 2022 Tutorial

Link

Simple and Automatic Distributed Machine Learning on Ray

with Hao Zhang, Lianmin Zheng, and Ion Stoica

KDD 2021 Tutorial

Link / Proceeding