Chatbot Arena +
❖ This leaderboard is based on the following benchmarks. Chatbot Arena - a crowdsourced, randomized battle platform for large language models (LLMs). We use 6M+ user votes to compute Elo ratings. AAII - Artificial Analysis Intelligence Index v3 aggregating 10 challenging evaluations. ARC-AGI - Artificial General Intelligence benchmark v2 to measure fluid intelligence.
DeepSeek-V4
❖ We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models — DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) — both supporting a context length of one million tokens.
Qwen3.6
❖ Over recent months, we have intensified our focus on developing foundation models that deliver exceptional utility and performance. Qwen3.6 represents a significant leap forward, integrating breakthroughs in multimodal learning, architectural efficiency, reinforcement learning scale, and global accessibility to empower developers and enterprises with unprecedented capability and efficiency.
Kimi K2.6
❖ Kimi K2.6 is an open-source, native multimodal agentic model built through continual pretraining on approximately 15 trillion mixed visual and text tokens atop Kimi-K2-Base. It seamlessly integrates vision and language understanding with advanced agentic capabilities, instant and thinking modes, as well as conversational and agentic paradigms.
GLM-5.1
❖ GLM-5.1 is our next-generation flagship model for agentic engineering, with significantly stronger coding capabilities than its predecessor. It achieves state-of-the-art performance on SWE-Bench Pro and leads GLM-5 by a wide margin on NL2Repo (repo generation) and Terminal-Bench 2.0 (real-world terminal tasks).
SWE-bench +
❖ SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub. Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem. SWE-bench Verified is a human-validated subset that more reliably evaluates AI models’ ability to solve issues. International Olympiad in Informatics (IOI) competition features standardized and automated grading.
Attention Residuals
❖ This is the introduction of Attention Residuals (AttnRes), a drop-in replacement for standard residual connections in Transformers that enables each layer to selectively aggregate earlier representations via learned, input-dependent attention over depth.
GLM-4.7 with SGLang
❖ The GLM-4.x series models are foundation models designed for intelligent agents. GLM-4.7 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.x models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications.