Large-Scale Expert Parallelism

DeepSeek is a popular open-source large language model (LLM) praised for its strong performance. However, its large size and unique architecture, which uses Multi-head Latent Attention (MLA) and Mixture of Experts (MoE), require an advanced system for efficient serving at scale. In this blog, we explain how we match DeepSeek’s inference system performance using prefill-decode disaggregation and large-scale expert parallelism (EP) with SGLang.

Qwen3

We are excited to announce the release of Qwen3, the latest addition to the Qwen family of large language models. These models represent our most advanced and intelligent systems to date, improving from our experience in building QwQ and Qwen2.5. We are making the weights of Qwen3 available to the public, including both dense and Mixture-of-Expert (MoE) models.

Qwen2.5-Omni

We release Qwen2.5-Omni, the new flagship end-to-end multimodal model in the Qwen series. Designed for comprehensive multimodal perception, it seamlessly processes diverse inputs including text, images, audio, and video, while delivering real-time streaming responses through both text generation and natural speech synthesis.

Qwen2.5-VL

In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. We are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL. Key Enhancements: Powerful Document Parsing Capabilities: Upgrade text recognition to omnidocument parsing, excelling in processing multi-scene, multilingual, and various built-in (handwriting, tables, charts, chemical formulas, and music sheets) documents.

vLLM V1

We are thrilled to announce the alpha release of vLLM V1, a major upgrade to vLLM’s core architecture. Based on lessons we learned over the past 1.5 years of vLLM development, we revisited key design decisions, consolidated various features, and simplified the codebase to enhance flexibility and scalability. V1 already achieves state-of-the-art performance and is set to gain even more optimizations.

DeepSeek-R1

We introduce DeepSeek’s first-generation reasoning models: DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrated remarkable performance on reasoning. DeepSeek-R1 incorporates cold-start data before RL, and achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks.

DeepSeek-V3

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance.

SGLang v0.4

We’re excited to announce the release of SGLang v0.4, featuring significant performance improvements and new features: Zero-overhead batch scheduler: 1.1x increase in throughput. Cache-aware load balancer: up to 1.9x increase in throughput with 3.8x higher cache hit rate. Data parallelism attention for DeepSeek models: up to 1.9x decoding throughput improvement. Fast structured outputs with xgrammar: up to 10x faster.