Qwen2.5-VL
In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL. Key Enhancements: Powerful Document Parsing Capabilities: Upgrade text recognition to omnidocument parsing, excelling in processing multi-scene, multilingual, and various built-in (handwriting, tables, charts, chemical formulas, and music sheets) documents.
vLLM V1
We are thrilled to announce the alpha release of vLLM V1, a major upgrade to vLLM’s core architecture. Based on lessons we learned over the past 1.5 years of vLLM development, we revisited key design decisions, consolidated various features, and simplified the codebase to enhance flexibility and scalability. V1 already achieves state-of-the-art performance and is set to gain even more optimizations.
DeepSeek-R1
We introduce DeepSeek’s first-generation reasoning models: DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning. DeepSeek-R1 incorporates cold-start data before RL, and achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks.
DeepSeek-V3
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance.

SGLang v0.4
We’re excited to announce the release of SGLang v0.4, featuring significant performance improvements and new features: Zero-overhead batch scheduler: 1.1x increase in throughput. Cache-aware load balancer: up to 1.9x increase in throughput with 3.8x higher cache hit rate. Data parallelism attention for DeepSeek models: up to 1.9x decoding throughput improvement. Fast structured outputs with xgrammar: up to 10x faster.
LLM Course
The LLM course is divided into three parts: 🧩 LLM Fundamentals is optional and covers fundamental knowledge about mathematics, Python, and neural networks. 🧑🔬 The LLM Scientist focuses on building the best possible LLMs using the latest techniques. 👷 The LLM Engineer focuses on creating LLM-based applications and deploying them.
Coder EvalPlus
EvalPlus is a rigorous evaluation framework for LLM4Code, with: ✨ HumanEval+: 80x more tests than the original HumanEval! ✨ MBPP+: 35x more tests than the original MBPP! ✨ EvalPerf: evaluating the efficiency of LLM-generated code! ✨ Framework: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.
Speculative Decoding in vLLM
Speculative decoding in vLLM is a powerful technique that accelerates token generation by leveraging both small and large models in tandem. In this blog, we’ll break down speculative decoding in vLLM, how it works, and the performance improvements it brings. This content is based on a session from our bi-weekly vLLM Office Hours, where we discuss techniques and updates to optimize vLLM performance.