Chatbot Arena +
❖ This leaderboard is based on the following benchmarks. Arena - a crowdsourced, randomized battle platform for large language models (LLMs). We use 6M+ user votes to compute Elo ratings. AAII - Artificial Analysis Intelligence Index v3 aggregating 10 challenging evaluations. ARC-AGI - Artificial General Intelligence benchmark v2 to measure fluid intelligence.
GLM-5.2
❖ GLM-5.2 is Z.ai’s latest flagship model for coding and long-horizon tasks. It marks a substantial leap in long-horizon task capability over its predecessor GLM-5.1 and delivers that capability on a solid 1M-token context. It is pure open with an MIT open-source license — no regional limits, technical access without borders.
Kimi K2.7
❖ Kimi K2.7 Code is a coding-focused agentic model built upon Kimi K2.6. With substantial improvements on real-world long-horizon coding tasks, it strengthens end-to-end task completion across complex software engineering workflows while improving token efficiency, reducing thinking-token usage by approximately 30% compared with Kimi K2.6.
DeepSeek-V4
❖ We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models — DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) — both supporting a context length of one million tokens.
Qwen3.6
❖ Over recent months, we have intensified our focus on developing foundation models that deliver exceptional utility and performance. Qwen3.6 represents a significant leap forward, integrating breakthroughs in multimodal learning, architectural efficiency, reinforcement learning scale, and global accessibility to empower developers and enterprises with unprecedented capability and efficiency.
SWE-bench +
❖ SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub. Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem. SWE-bench Verified is a human-validated subset that more reliably evaluates AI models’ ability to solve issues. International Olympiad in Informatics (IOI) competition features standardized and automated grading.
Attention Residuals
❖ This is the introduction of Attention Residuals (AttnRes), a drop-in replacement for standard residual connections in Transformers that enables each layer to selectively aggregate earlier representations via learned, input-dependent attention over depth.
GLM-4.7 with SGLang
❖ The GLM-4.x series models are foundation models designed for intelligent agents. GLM-4.7 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.x models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications.