About Me
I am an AI Research Scientist at Meta, focusing on inference. Previously, I was a Member of Technical Staff at OpenAI, where I led infra and inference work for gpt-oss and built training and inference systems for reasoning models.
I got my PhD in Computer Science from UC Berkeley, advised by Ion Stoica. Before that, I received my B.S. in Computer Science from Peking University, advised by Liwei Wang and Di He.
I co-created and co-lead the development of vLLM, the most popular open-source LLM serving engine. My work on vLLM has been recognized with a16z Open Source AI Grant and Sequoia Open Source Fellowship. Join our vLLM Slack to get connected and start building together!
Projects
vLLM: A high-throughput and memory-efficient serving engine for large language models, accelerated with
PagedAttention.
gpt-oss: OpenAI's flagship open-weight reasoning models.
OpenAI o1: The world's first reasoning model.
Chatbot Arena: The leading crowdsourced LLM benchmark.
Vicuna: An open-source chatbot impressing GPT-4 with 90% ChatGPT quality.
Alpa: Automate model parallel training with just a few lines of code.
Education
University of California, Berkeley
Ph.D. in Computer Science
2019 - 2024
Peking University
B.S. in Computer Science (Summa Cum Laude)
2015 - 2019
Experience
Meta
AI Research Scientist
August 2025 - Present
OpenAI
Member of Technical Staff
June 2024 - August 2025
Google Brain / Google Deepmind
Research Intern
Hosts: Yanping Huang, Yuanzhong Xu, and Zhifeng Chen
May 2021 - April 2024
Anyscale
Software Engineer Intern
May 2020 - August 2020
Microsoft Research Asia
Research Intern
Hosts: Di He and Tao Qin
June 2017 - March 2019
Publications
-
gpt-oss-120b & gpt-oss-20b model card
OpenAI
Technical Report, 2025
-
Jenga: Effective Memory Management for Serving LLM with Heterogeneity
Chen Zhang, Kuntai Du, Shu Liu, Woosuk Kwon, Xiangxi Mo, Yufeng Wang, Xiaoxuan Liu, Kaichao You, Zhuohan Li, Mingsheng Long, Jidong Zhai, Joseph Gonzalez, Ion Stoica
SOSP 2025
-
OpenAI o1 System Card
OpenAI
Technical Report, 2024
-
Optimizing speculative decoding for serving large language models using goodput
Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang
Preprint, 2024
-
Fairness in Serving Large Language Models
Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E. Gonzalez, Ion Stoica
OSDI 2024
-
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
Lianmin Zheng*, Wei-Lin Chiang*, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric Xing, Joseph E. Gonzalez, Ion Stoica, Hao Zhang
ICLR 2024
-
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng*, Wei-Lin Chiang*, Ying Sheng*, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, Ion Stoica
NeurIPS 2023 Datasets and Benchmarks Track
-
Efficient Memory Management for Large Language Model Serving with PagedAttention
Woosuk Kwon*, Zhuohan Li*, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica
SOSP 2023
-
FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark Barrett, Joseph E. Gonzalez, Percy Liang, Christopher RĂ©, Ion Stoica, Ce Zhang
ICML 2023
-
Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality
Wei-Lin Chiang*, Zhuohan Li*, Zi Lin*, Ying Sheng*, Zhanghao Wu*, Hao Zhang*, Lianmin Zheng*, Siyuan Zhuang*, Yonghao Zhuang*, Joseph E. Gonzalez, Ion Stoica, Eric P. Xing
-
AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving
Zhuohan Li*, Lianmin Zheng*, Yinmin Zhong*, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, Ion Stoica
OSDI 2023
-
On Optimizing the Communication of Model Parallelism
Yonghao Zhuang*, Hexu Zhao*, Lianmin Zheng, Zhuohan Li, Eric P. Xing, Qirong Ho, Joseph E. Gonzalez, Ion Stoica, Hao Zhang
MLSys 2022
-
Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
Lianmin Zheng*, Zhuohan Li*, Hao Zhang*, Yonghao Zhuang, Zhifeng Chen, Yanping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P. Xing, Joseph E. Gonzalez, Ion Stoica
OSDI 2022
-
Rearchitecting In-Memory Object Stores for Low Latency
Danyang Zhuo, Kaiyuan Zhang, Zhuohan Li, Siyuan Zhuang, Stephanie Wang, Ang Chen, Ion Stoica
VLDB 2022
-
TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models
Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, Ion Stoica
ICML 2021
-
Hoplite: Efficient and Fault-Tolerant Collective Communication for Task-Based Distributed Systems
Siyuan Zhuang*, Zhuohan Li*, Danyang Zhuo, Stephanie Wang, Eric Liang, Robert Nishihara, Philipp Moritz, Ion Stoica
SIGCOMM 2020
-
Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers
Zhuohan Li*, Eric Wallace*, Sheng Shen*, Kevin Lin*, Kurt Keutzer, Dan Klein, Joseph E. Gonzalez
ICML 2020
-
Fast Structured Decoding for Sequence Models
Zhiqing Sun*, Zhuohan Li*, Haoqing Wang, Zi Lin, Di He, Zhi-Hong Deng
NeurIPS 2019
-
Understanding and Improving Transformer From a Multi-Particle Dynamic System Point of View
Yiping Lu*, Zhuohan Li*, Di He, Zhiqing Sun, Bin Dong, Tao Qin, Liwei Wang, Tie-Yan Liu
NeurIPS 2019 Workshop on Machine Learning and the Physical Sciences
ICLR 2020 Workshop on Integration of Deep Neural Models and Differential Equations
-
Hint-Based Training for Non-Autoregressive Machine Translation
Zhuohan Li, Zi Lin, Di He, Fei Tian, Tao Qin, Liwei Wang, Tie-Yan Liu
EMNLP 2019
-
Efficient Training of BERT by Progressively Stacking
Linyuan Gong, Di He, Zhuohan Li, Tao Qin, Liwei Wang, Tie-Yan Liu
ICML 2019
-
Towards Binary-Valued Gates for Robust LSTM Training
Zhuohan Li, Di He, Fei Tian, Wei Chen, Tao Qin, Liwei Wang, Tie-Yan Liu
ICML 2018
* denotes equal contribution.
Tutorials
-
Welcome to the “Big Model” Era: Techniques and Systems to Train and Serve Bigger Models
with Hao Zhang, Lianmin Zheng, and Ion Stoica
ICML 2022 Tutorial
-
Simple and Automatic Distributed Machine Learning on Ray
with Hao Zhang, Lianmin Zheng, and Ion Stoica
KDD 2021 Tutorial