Unified model lineup

DeepSeek official models list

Choose the right model for the job. V4, V4 Flash, and V4 Pro are now live alongside V3.1, R1, Math-7B, Janus-Pro-7B, and VL2.

Official status

V4 rollout coverage is active as of April 24, 2026. This list reflects the currently published lineup and tracked model status.

Release notes status Latest V4 news

The lineup is designed around clear job-to-model matches: V3.1 for general chat and long-context workflows, R1 for deliberate reasoning, Math-7B for cost-efficient numeric accuracy, Janus-Pro-7B for multimodal generation, and VL2 for OCR and document understanding. All models share the same access pattern, so you can switch with a single parameter change.

Series overview at a glance:

V1 dense stack: 7B and 67B, 4K context, ~2T tokens.
V2 MoE + latent attention: 236B total, ~21B active, 128K context.
V2-Lite trims to ~16B total and ~2.4B active for smaller clusters.
V3 MoE: 671B total, ~37B active, 256 experts, ~14T tokens.
R1 keeps the V3 backbone and adds reinforcement learning for reasoning.
Tracks cover base/chat, coder, math/prover, and vision (VL2/Janus).

The charts below summarize the series-level data from the research report. These figures are reported and can evolve as official documentation updates, so treat them as directional guidance rather than fixed guarantees.

Base model milestones

Series-level shifts summarized from the comparison report.

V1 dense stack: 7B and 67B, 4K context, ~2T tokens.
V2 MoE: 236B total, ~21B active, 128K context, ~8T tokens.
V3 MoE: 671B total, ~37B active, 256 experts, ~14T tokens.

Specialized lines

Task-focused variants that extend the base family.

Coder V1/V2 for code generation and debugging tasks.
R1 for reinforcement-learning reasoning and math logic.
Math and Prover models for formal reasoning workflows.
VL2 and Janus for vision, OCR, and multimodal generation.
Lite and 16B MoE for constrained deployments.

Deployment cues

Practical guidance for choosing the right tier.

Use the smallest model that clears your quality bar.
MoE active parameters matter more than total count.
Long context requires memory and latency planning.
Keep prompts and eval sets consistent across models.
Swap models via one parameter when needed.

Parameter scale

Total vs active parameters in billions (B). MoE models activate a smaller subset per token.

V1-7B7B total / 7B active

V1-67B67B total / 67B active

V2236B total / 21B active

V3671B total / 37B active

R1671B total / 37B active

Context window growth

Reported maximum context sizes in thousands of tokens (K).

128K

Prover-V2

163K

Training highlights

Recurring themes across the reported series progressions.

Data scale rises from ~2T tokens (V1) to ~8T (V2) and ~14T (V3).
MoE routing and load balancing improve expert utilization.
FP8-friendly training is noted for later-stage efficiency.
Long-context stability becomes a first-class optimization target.

Evaluation focus

What teams typically compare before choosing a tier.

General reasoning tasks and long-context retrieval.
Code generation and repo-scale comprehension.
Math accuracy and step-by-step verification.
Vision OCR, charts, and document understanding.

Selection guide

Start with the model that matches your workload, then iterate.

General chat and long documents: DeepSeek V3.1.
Multi-step reasoning and verification: DeepSeek R1.
Math tutoring and numeric workflows: Math-7B.
Text-to-image and multimodal generation: Janus-Pro-7B.
OCR, charts, and document QA: DeepSeek VL2.

What to expect from V4

Summary of flagship V4 architecture, behavior, and rollout cues.

Public analysis points to trillion-scale capacity with sparse activation, long-context ambitions, and expanded multimodal capabilities. Teams should validate on production eval sets before scaling traffic and choose Flash or Pro by workload.

Track V4 status

Series architecture notes

Highlights distilled from the comparison report.

The family's progression is defined by architecture shifts. V1 models used dense Transformer stacks at 7B and 67B with 4K context windows and roughly 2T training tokens. V2 introduced Mixture-of-Experts plus multi-head latent attention, enabling 236B total parameters while activating ~21B per token and stretching context to 128K.

V3 expanded to 671B total parameters with ~37B active per token and a 256-expert MoE layout (six experts activated per token), trained on roughly 14T tokens. R1 built on the V3 backbone but added reinforcement learning for deeper reasoning. The coder line reused the MoE stack in both 16B-lite and 236B variants, while math and prover models specialized for formal reasoning.

Model family map

How the series splits by task focus.