Zhuohan Li

zhuohan [at] cs.berkeley.edu

About Me

I got my PhD in Computer Science from UC Berkeley, advised by Ion Stoica. Before that, I received my B.S. in Computer Science from Peking University, advised by Liwei Wang and Di He.

My interest lies in the intersection of machine learning and distributed systems. I use insights from different domains to improve the performance (accuracy, efficiency, and interpretability) of current machine learning models.

I co-created and co-lead the development of vLLM, an open-source LLM serving engine focusing on high throughput that got widely adopted across the industry.


vLLM: A high-throughput and memory-efficient serving engine for large language models, accelerated with PagedAttention.

Vicuna: An open-source chatbot impressing GPT-4 with 90% ChatGPT quality.

AlpaServe: Use model parallelism to accelerate deep learning serving, even when models fit a single GPU.

Alpa: Automate model parallel training with just a few lines of code.


University of California, Berkeley
Ph.D. in Computer Science
2019 - 2024

Peking University
B.S. in Computer Science (Summa Cum Laude)
2015 - 2019


Google Brain / Google Deepmind
Research Intern
Hosts: Yanping Huang, Yuanzhong Xu, and Zhifeng Chen
May 2021 - April 2024

Software Engineer Intern
May 2020 - August 2020

Microsoft Research Asia
Research Intern
Hosts: Di He and Tao Qin
June 2017 - March 2019


  1. Fairness in Serving Large Language Models
    Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, Ion Stoica
    OSDI 2024

  2. LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
    Lianmin Zheng*, Wei-Lin Chiang*, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric Xing, Joseph E. Gonzalez, Ion Stoica, Hao Zhang
    ICLR 2024

  3. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
    Lianmin Zheng*, Wei-Lin Chiang*, Ying Sheng*, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica
    NeurIPS 2023 Datasets and Benchmarks Track

  4. Efficient Memory Management for Large Language Model Serving with PagedAttention
    Woosuk Kwon*, Zhuohan Li*, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica
    SOSP 2023

  5. FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU
    Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher RĂ©, Ion Stoica, Ce Zhang
    ICML 2023

  6. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality
    Wei-Lin Chiang*, Zhuohan Li*, Zi Lin*, Ying Sheng*, Zhanghao Wu*, Hao Zhang*, Lianmin Zheng*, Siyuan Zhuang*, Yonghao Zhuang*, Joseph E. Gonzalez, Ion Stoica, Eric P. Xing

  7. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving
    Zhuohan Li*, Lianmin Zheng*, Yinmin Zhong*, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, Ion Stoica
    OSDI 2023

  8. On Optimizing the Communication of Model Parallelism
    Yonghao Zhuang*, Hexu Zhao*, Lianmin Zheng, Zhuohan Li, Eric P. Xing, Qirong Ho, Joseph E. Gonzalez, Ion Stoica, Hao Zhang
    MLSys 2022

  9. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
    Lianmin Zheng*, Zhuohan Li*, Hao Zhang*, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica
    OSDI 2022

  10. Rearchitecting In-Memory Object Stores for Low Latency
    Danyang Zhuo, Kaiyuan Zhang, Zhuohan Li, Siyuan Zhuang, Stephanie Wang, Ang Chen, Ion Stoica
    VLDB 2022

  11. TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models
    Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, Ion Stoica
    ICML 2021

  12. Hoplite: Efficient and Fault-Tolerant Collective Communication for Task-Based Distributed Systems
    Siyuan Zhuang*, Zhuohan Li*, Danyang Zhuo, Stephanie Wang, Eric Liang, Robert Nishihara, Philipp Moritz, Ion Stoica
    SIGCOMM 2020

  13. Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers
    Zhuohan Li*, Eric Wallace*, Sheng Shen*, Kevin Lin*, Kurt Keutzer, Dan Klein, Joseph E. Gonzalez
    ICML 2020

  14. Fast Structured Decoding for Sequence Models
    Zhiqing Sun*, Zhuohan Li*, Haoqing Wang, Zi Lin, Di He, Zhi-Hong Deng
    NeurIPS 2019

  15. Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View
    Yiping Lu*, Zhuohan Li*, Di He, Zhiqing Sun, Bin Dong, Tao Qin, Liwei Wang, Tie-Yan Liu
    NeurIPS 2019 Workshop on Machine Learning and the Physical Sciences
    ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations

  16. Hint-Based Training for Non-Autoregressive Machine Translation
    Zhuohan Li, Zi Lin, Di He, Fei Tian, Tao Qin, Liwei Wang, Tie-Yan Liu
    EMNLP 2019

  17. Efficient Training of BERT by Progressively Stacking
    Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, Tie-Yan Liu
    ICML 2019

  18. Towards Binary-Valued Gates for Robust LSTM Training
    Zhuohan Li, Di He, Fei Tian, Wei Chen, Tao Qin, Liwei Wang, Tie-Yan Liu
    ICML 2018

  19. * denotes equal contribution.


  1. Welcome to the “Big Model” Era: Techniques and Systems to Train and Serve Bigger Models
    with Hao Zhang, Lianmin Zheng, and Ion Stoica
    ICML 2022 Tutorial

  2. Simple and Automatic Distributed Machine Learning on Ray
    with Hao Zhang, Lianmin Zheng, and Ion Stoica
    KDD 2021 Tutorial