Chatbot Arena +

This leaderboard is based on the following benchmarks. Chatbot Arena - a crowdsourced, randomized battle platform for large language models (LLMs). We use 3.8M+ user votes to compute Elo ratings. AAII - Artificial Analysis Intelligence Index aggregating 8 challenging evaluations. ARC-AGI - Artificial General Intelligence benchmark v2 to measure fluid intelligence.

GPT OSS

Welcome to the gpt-oss series, OpenAI’s open-weight models designed for powerful reasoning, agentic tasks, and versatile developer use cases. We’re releasing two flavors of these open models: gpt-oss-120b — for production, general purpose, high reasoning use cases that fit into a single 80GB GPU (like NVIDIA H100 or AMD MI300X) (117B parameters with 5.1B active parameters) gpt-oss-20b — for lower latency, and local or specialized use cases (21B parameters with 3.6B active parameters)

SWE-bench +

SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub. Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem. SWE-bench Verified is a human-validated subset that more reliably evaluates AI models’ ability to solve issues. International Olympiad in Informatics (IOI) competition features standardized and automated grading.

GLM-4.5

The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications.

Qwen3-Coder

We’re announcing Qwen3-Coder, our most agentic code model to date. Qwen3-Coder is available in multiple sizes, but we’re excited to introduce its most powerful variant first: Qwen3-Coder-480B-A35B-Instruct — a 480B-parameter Mixture-of-Experts model with 35B active parameters, offering exceptional performance in both coding and agentic tasks. It sets new state-of-the-art results among open models on Agentic Coding, Agentic Browser-Use, and Agentic Tool-Use.

MTP in SGLang

SGLang is the first and only open-source serving framework to support Multiple Token Prediction (MTP) in combination with Large-Scale Expert Parallelism (EP) and Prefill-Decode disaggregation. This integration delivers up to 60% higher output throughput through a new decoding paradigm, better parallelism, and more efficient resource utilization without sacrificing generation quality.

slime

slime is an LLM post-training framework for RL scaling, providing two core capabilities: High-Performance Training – Supports efficient training in various modes by connecting Megatron with SGLang; Flexible Data Generation – Enables arbitrary training data generation workflows through custom data generation interfaces and server-based engines.

Open R1

A fully open reproduction of DeepSeek-R1. The goal of this repo is to build the missing pieces of the R1 pipeline such that everybody can reproduce and build on top of it.