deepseek-ai/DeepSeek-R1

📊 Model Parameters

Total Parameters 671,026,404,352
Context Length 163,840
Hidden Size 7168
Layers 61
Attention Heads 128
KV Heads 128

💾 Memory Requirements

FP32 (Full) 2499.77 GB
FP16 (Half) 1249.88 GB
INT8 (Quantized) 624.94 GB
INT4 (Quantized) 312.47 GB

🔑 KV Cache (Inference)

Per Token (FP16) 2.00 MB
Max Context FP32 610.00 GB
Max Context FP16 305.00 GB
Max Context INT8 152.50 GB

⚙️ Model Configuration

Core Architecture

Vocabulary Size129,280
Hidden Size7,168
FFN Intermediate Size18,432
Number of Layers61
Attention Heads128
Head Dimension64
KV Heads128

Context & Position

Max Context Length163,840

Attention Configuration

Attention BiasNo
Attention Dropout0%
Tied EmbeddingsNo

Multi-Head Latent Attention

KV LoRA Rank512
Query LoRA Rank1,536
QK RoPE Head Dimension64
Value Head Dimension128
QK Non-RoPE Head Dimension128

Mixture of Experts

Expert FFN Size2,048
Shared Experts1
Number of Experts256
Routing Scale Factor2.5
Expert Groups8
Groups per Token4
Experts per Token8
Dense Initial Layers3
Normalize TopK ProbabilitiesYes
MoE Layer Frequency1
Router Scoring Functionsigmoid
TopK Methodnoaux_tc

Speculative Decoding

Next-N Prediction Layers1

Activation & Normalization

Activation Functionsilu
RMSNorm Epsilon1e-06

Special Tokens

Pad Token IDNot set
BOS Token ID0
EOS Token ID1

Data Type

Model Dtypebfloat16
Layer Types:
Attention
MLP/FFN
Normalization
Embedding