deepseek-ai/DeepSeek-V2

📊 Model Parameters

Total Parameters 235,741,434,880
Context Length 163,840
Hidden Size 5120
Layers 60
Attention Heads 128
KV Heads 128

💾 Memory Requirements

FP32 (Full) 878.21 GB
FP16 (Half) 439.10 GB
INT8 (Quantized) 219.55 GB
INT4 (Quantized) 109.78 GB

🔑 KV Cache (Inference)

Per Token (FP16) 1.23 MB
Max Context FP32 375.00 GB
Max Context FP16 187.50 GB
Max Context INT8 93.75 GB

⚙️ Model Configuration

Core Architecture

Vocabulary Size102,400
Hidden Size5,120
FFN Intermediate Size12,288
Number of Layers60
Attention Heads128
KV Heads128

Context & Position

Max Context Length163,840
RoPE Base Frequency10,000
RoPE Scaling{...} (7 fields)

Attention Configuration

Attention BiasNo
Attention Dropout0%
Tied EmbeddingsNo

Multi-Head Latent Attention

KV LoRA Rank512
Query LoRA Rank1,536
QK RoPE Head Dimension64
Value Head Dimension128
QK Non-RoPE Head Dimension128

Mixture of Experts

Expert FFN Size1,536
Shared Experts2
Number of Experts160
Routing Scale Factor16.0
TopK Methodgroup_limited_greedy
Expert Groups8
Groups per Token3
Experts per Token6
MoE Layer Frequency1
Dense Initial Layers1
Normalize TopK ProbabilitiesNo
Router Scoring Functionsoftmax

Activation & Normalization

Activation Functionsilu
RMSNorm Epsilon1e-06

Special Tokens

BOS Token ID100,000
Pad Token IDNot set
EOS Token ID100001

Data Type

Model Dtypebfloat16
Layer Types:
Attention
MLP/FFN
Normalization
Embedding