We present a preview version of DeepSeek-V4 series, including two strong Mixture-of-Experts (MoE) language models — DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) — both supporting a context length of one million tokens.

Homepage Chat Hugging Face
Discord Wechat Twitter Follow
License

Technical Report👁️

Introduction

DeepSeek-V4 series incorporate several key upgrades in architecture and optimization:

  1. Hybrid Attention Architecture: We design a hybrid attention mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to dramatically improve long-context efficiency. In the 1M-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2.
  2. Manifold-Constrained Hyper-Connections (mHC): We incorporate mHC to strengthen conventional residual connections, enhancing stability of signal propagation across layers while preserving model expressivity.
  3. Muon Optimizer: We employ the Muon optimizer for faster convergence and greater training stability.

We pre-train both models on more than 32T diverse and high-quality tokens, followed by a comprehensive post-training pipeline. The post-training features a two-stage paradigm: independent cultivation of domain-specific experts (through SFT and RL with GRPO), followed by unified model consolidation via on-policy distillation, integrating distinct proficiencies across diverse domains into a single model.

DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, significantly advances the knowledge capabilities of open-source models, firmly establishing itself as the best open-source model available today. It achieves top-tier performance in coding benchmarks and significantly bridges the gap with leading closed-source models on reasoning and agentic tasks. Meanwhile, DeepSeek-V4-Flash-Max achieves comparable reasoning performance to the Pro version when given a larger thinking budget, though its smaller parameter scale naturally places it slightly behind on pure knowledge tasks and the most complex agentic workflows.

Model Downloads

Model#Total Params#Activated ParamsContext LengthPrecisionDownload
DeepSeek-V4-Flash-Base284B13B1MFP8 MixedHuggingFace | ModelScope
DeepSeek-V4-Flash284B13B1MFP4 + FP8 Mixed*HuggingFace | ModelScope
DeepSeek-V4-Pro-Base1.6T49B1MFP8 MixedHuggingFace | ModelScope
DeepSeek-V4-Pro1.6T49B1MFP4 + FP8 Mixed*HuggingFace | ModelScope

*FP4 + FP8 Mixed: MoE expert parameters use FP4 precision; most other parameters use FP8.

Evaluation Results

Base Model

Benchmark (Metric)# ShotsDeepSeek-V3.2-BaseDeepSeek-V4-Flash-BaseDeepSeek-V4-Pro-Base
Architecture-MoEMoEMoE
# Activated Params-37B13B49B
# Total Params-671B284B1.6T
World Knowledge
AGIEval (EM)0-shot80.182.683.1
MMLU (EM)5-shot87.888.790.1
MMLU-Redux (EM)5-shot87.589.490.8
MMLU-Pro (EM)5-shot65.568.373.5
MMMLU (EM)5-shot87.988.890.3
C-Eval (EM)5-shot90.492.193.1
CMMLU (EM)5-shot88.990.490.8
MultiLoKo (EM)5-shot38.742.251.1
Simple-QA verified (EM)25-shot28.330.155.2
SuperGPQA (EM)5-shot45.046.553.9
FACTS Parametric (EM)25-shot27.133.962.6
TriviaQA (EM)5-shot83.382.885.6
Language & Reasoning
BBH (EM)3-shot87.686.987.5
DROP (F1)1-shot88.288.688.7
HellaSwag (EM)0-shot86.485.788.0
WinoGrande (EM)0-shot78.979.581.5
CLUEWSC (EM)5-shot83.582.285.2
Code & Math
BigCodeBench (Pass@1)3-shot63.956.859.2
HumanEval (Pass@1)0-shot62.869.576.8
GSM8K (EM)8-shot91.190.892.6
MATH (EM)4-shot60.557.464.5
MGSM (EM)8-shot81.385.784.4
CMath (EM)3-shot92.693.690.9
Long Context
LongBench-V2 (EM)1-shot40.244.751.5

Instruct Model

DeepSeek-V4-Pro and DeepSeek-V4-Flash both support three reasoning effort modes:

Reasoning ModeCharacteristicsTypical Use CasesResponse Format
Non-thinkFast, intuitive responsesRoutine daily tasks, low-risk decisions</think> summary
Think HighConscious logical analysis, slower but more accurateComplex problem-solving, planning<think> thinking </think> summary
Think MaxPush reasoning to its fullest extentExploring the boundary of model reasoning capabilitySpecial system prompt + <think> thinking </think> summary

DeepSeek-V4-Pro-Max vs Frontier Models

Benchmark (Metric)Opus-4.6 MaxGPT-5.4 xHighGemini-3.1-Pro HighK2.6 ThinkingGLM-5.1 ThinkingDS-V4-Pro Max
Knowledge & Reasoning
MMLU-Pro (EM)89.187.591.087.186.087.5
SimpleQA-Verified (Pass@1)46.245.375.636.938.157.9
Chinese-SimpleQA (Pass@1)76.476.885.975.975.084.4
GPQA Diamond (Pass@1)91.393.094.390.586.290.1
HLE (Pass@1)40.039.844.436.434.737.7
LiveCodeBench (Pass@1)88.8-91.789.6-93.5
Codeforces (Rating)-31683052--3206
HMMT 2026 Feb (Pass@1)96.297.794.792.789.495.2
IMOAnswerBench (Pass@1)75.391.481.086.083.889.8
Apex (Pass@1)34.554.160.924.011.538.3
Apex Shortlist (Pass@1)85.978.189.175.572.490.2
Long Context
MRCR 1M (MMR)92.9-76.3--83.5
CorpusQA 1M (ACC)71.7-53.8--62.0
Agentic
Terminal Bench 2.0 (Acc)65.475.168.566.763.567.9
SWE Verified (Resolved)80.8-80.680.2-80.6
SWE Pro (Resolved)57.357.754.258.658.455.4
SWE Multilingual (Resolved)77.5--76.773.376.2
BrowseComp (Pass@1)83.782.785.983.279.383.4
HLE w/ tools (Pass@1)53.152.051.654.050.448.2
GDPval-AA (Elo)161916741314148215351554
MCPAtlas Public (Pass@1)73.867.269.266.671.873.6
Toolathlon (Pass@1)47.254.648.850.040.751.8

Comparison across Modes

Benchmark (Metric)V4-Flash Non-ThinkV4-Flash HighV4-Flash MaxV4-Pro Non-ThinkV4-Pro HighV4-Pro Max
Knowledge & Reasoning
MMLU-Pro (EM)83.086.486.282.987.187.5
SimpleQA-Verified (Pass@1)23.128.934.145.046.257.9
Chinese-SimpleQA (Pass@1)71.573.278.975.877.784.4
GPQA Diamond (Pass@1)71.287.488.172.989.190.1
HLE (Pass@1)8.129.434.87.734.537.7
LiveCodeBench (Pass@1)55.288.491.656.889.893.5
Codeforces (Rating)-28163052-29193206
HMMT 2026 Feb (Pass@1)40.891.994.831.794.095.2
IMOAnswerBench (Pass@1)41.985.188.435.388.089.8
Apex (Pass@1)1.019.133.00.427.438.3
Apex Shortlist (Pass@1)9.372.185.79.285.590.2
Long Context
MRCR 1M (MMR)37.576.978.744.783.383.5
CorpusQA 1M (ACC)15.559.360.535.656.562.0
Agentic
Terminal Bench 2.0 (Acc)49.156.656.959.163.367.9
SWE Verified (Resolved)73.778.679.073.679.480.6
SWE Pro (Resolved)49.152.352.652.154.455.4
SWE Multilingual (Resolved)69.770.273.369.874.176.2
BrowseComp (Pass@1)-53.573.2-80.483.4
HLE w/ tools (Pass@1)-40.345.1-44.748.2
MCPAtlas (Pass@1)64.067.469.069.474.273.6
GDPval-AA (Elo)--1395--1554
Toolathlon (Pass@1)40.743.547.846.349.051.8

Chat Template

This release does not include a Jinja-format chat template. Instead, we provide a dedicated encoding folder with Python scripts and test cases demonstrating how to encode messages in OpenAI-compatible format into input strings for the model, and how to parse the model’s text output.

A brief example:

from encoding_dsv4 import encode_messages, parse_message_from_completion_text

messages = [
    {"role": "user", "content": "hello"},
    {"role": "assistant", "content": "Hello! I am DeepSeek.", "reasoning_content": "thinking..."},
    {"role": "user", "content": "1+1=?"}
]

# messages -> string
prompt = encode_messages(messages, thinking_mode="thinking")

# string -> tokens
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V4-Pro")
tokens = tokenizer.encode(prompt)

How to Run Locally

For local deployment, we recommend setting the sampling parameters to temperature = 1.0, top_p = 1.0. For the Think Max reasoning mode, we recommend setting the context window to at least 384K tokens.

License

This repository and the model weights are licensed under the MIT License.

Citation

@misc{deepseekai2026deepseekv4,
      title={DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence},
      author={DeepSeek-AI},
      year={2026},
}

Contact

If you have any questions, please raise an issue or contact us at service@deepseek.com.