Chatbot Arena

This leaderboard is based on the following three benchmarks. Chatbot Arena - a crowdsourced, randomized battle platform for large language models (LLMs). We use 2.2M+ user votes to compute Elo ratings. MT-Bench - a set of challenging multi-turn questions. We use GPT-4 to grade model responses. MMLU (5-shot) - a test to measure a model’s multitask accuracy on 57 tasks.

Qwen2.5-Coder

Today, we are excited to open source the “Powerful”, “Diverse”, and “Practical” Qwen2.5-Coder series, dedicated to continuously promoting the development of Open CodeLLMs. Powerful: Qwen2.5-Coder-32B-Instruct has become the current SOTA open-source code model, matching the coding capabilities of GPT-4o. While demonstrating strong and comprehensive coding abilities, it also possesses good general and mathematical skills.

Coder EvalPlus

EvalPlus is a rigorous evaluation framework for LLM4Code, with: ✨ HumanEval+: 80x more tests than the original HumanEval! ✨ MBPP+: 35x more tests than the original MBPP! ✨ Evaluation framework: our packages/images/tools can easily and safely evaluate LLMs on above benchmarks. File a request to add your models on our leaderboard!

Speculative Decoding in vLLM

Speculative decoding in vLLM is a powerful technique that accelerates token generation by leveraging both small and large models in tandem. In this blog, we’ll break down speculative decoding in vLLM, how it works, and the performance improvements it brings. This content is based on a session from our bi-weekly vLLM Office Hours, where we discuss techniques and updates to optimize vLLM performance.

Llama 3.2

We’ve been excited by the impact the Llama 3.1 herd of models have made in the two months since we announced them, including the 405B—the first open frontier-level AI model. Today, we’re releasing Llama 3.2, which includes small and medium-sized vision LLMs (11B and 90B), and lightweight, text-only models (1B and 3B) that fit onto select edge and mobile devices.

Qwen2.5

In the past three months since Qwen2’s release, numerous developers have built new models on the Qwen2 language models, providing us with valuable feedback. During this period, we have focused on creating smarter and more knowledgeable language models. Today, we are excited to introduce the latest addition to the Qwen family.

vLLM v0.6

vLLM achieves 2.7x higher throughput and 5x faster TPOT (time per output token) on Llama 8B model, and 1.8x higher throughput and 2x less TPOT on Llama 70B model. A month ago, we released our performance roadmap committing to performance as our top priority. We will start by diagnosing the performance bottleneck in vLLM previously.

SGLang v0.3

We’re excited to announce the release of SGLang v0.3, which brings significant performance enhancements and expanded support for novel model architectures. Here are the key updates: Up to 7x higher throughput for DeepSeek Multi-head Latent Attention (MLA). Up to 1.5x lower latency with torch.compile on small batch sizes. Support for interleaved text and multi-image/video in LLaVA-OneVision.