Blog

DeepSeek-V3

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance.

SGLang v0.4

We’re excited to announce the release of SGLang v0.4, featuring significant performance improvements and new features: Zero-overhead batch scheduler: 1.1x increase in throughput. Cache-aware load balancer: up to 1.9x increase in throughput with 3.8x higher cache hit rate. Data parallelism attention for DeepSeek models: up to 1.9x decoding throughput improvement. Fast structured outputs with xgrammar: up to 10x faster.

LLM Course

The LLM course is divided into three parts: 🧩 LLM Fundamentals is optional and covers fundamental knowledge about mathematics, Python, and neural networks. 🧑‍🔬 The LLM Scientist focuses on building the best possible LLMs using the latest techniques. 👷 The LLM Engineer focuses on creating LLM-based applications and deploying them.

Coder EvalPlus

EvalPlus is a rigorous evaluation framework for LLM4Code, with: ✨ HumanEval+: 80x more tests than the original HumanEval! ✨ MBPP+: 35x more tests than the original MBPP! ✨ EvalPerf: evaluating the efficiency of LLM-generated code! ✨ Framework: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks.

Speculative Decoding in vLLM

Speculative decoding in vLLM is a powerful technique that accelerates token generation by leveraging both small and large models in tandem. In this blog, we’ll break down speculative decoding in vLLM, how it works, and the performance improvements it brings. This content is based on a session from our bi-weekly vLLM Office Hours, where we discuss techniques and updates to optimize vLLM performance.

vLLM v0.6

vLLM achieves 2.7x higher throughput and 5x faster TPOT (time per output token) on Llama 8B model, and 1.8x higher throughput and 2x less TPOT on Llama 70B model. A month ago, we released our performance roadmap committing to performance as our top priority. We will start by diagnosing the performance bottleneck in vLLM previously. Then we will describe the solution we implemented and landed in the past month. Finally, we will showcase the benchmarks of the latest vLLM release v0.6.0 other inference engines.

SGLang v0.3

We’re excited to announce the release of SGLang v0.3, which brings significant performance enhancements and expanded support for novel model architectures. Here are the key updates: Up to 7x higher throughput for DeepSeek Multi-head Latent Attention (MLA). Up to 1.5x lower latency with torch.compile on small batch sizes. Support for interleaved text and multi-image/video in LLaVA-OneVision. Support for interleaved window attention and 2x longer context length in Gemma-2.

Qwen2-VL

Qwen2-VL is the latest version of the vision language models in the Qwen model families. SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc. Understanding videos of 20min+: with the online streaming capabilities, Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.

⌂ • Engineering • Research • Leaderboard • Learning