Training LLMs from Scratch: Data, Compute, and Methodology
Introduction
Training Large Language Models (LLMs) from scratch adalah salah satu proyek AI paling ambisius dan kompleks saat ini. Artikel ini akan mengupas tuntas seluruh proses mulai dari persiapan data hingga evaluasi model final, memberikan panduan praktis dan insights dari pengalaman training model-model state-of-the-art.
Overview: The Scale of Modern LLMs
Model Size Evolution
- GPT-1 (2018): 117M parameters
- GPT-2 (2019): 1.5B parameters
- GPT-3 (2020): 175B parameters
- PaLM (2022): 540B parameters
- GPT-4: Estimated 1.7T parameters (multi-modal)
Resource Requirements Reality Check
Training GPT-3 scale model memerlukan:
- Compute: ~3,000 V100-years (sekitar $4.6M dalam compute cost)
- Data: ~300B tokens dari internet-scale text
- Time: 34 hari dengan 1,024 A100 GPUs
- Energy: Estimasi 1,287 MWh (setara listrik 120 rumah selama setahun)
Part 1: Data – The Foundation of Everything
Data Collection Strategy
1. Web Crawling at Scale
Common Crawl adalah sumber utama data text internet:
# Typical data sources hierarchy:
Common Crawl (petabytes) →
Filtered Text (hundreds of TB) →
Deduplicated (tens of TB) →
High-quality subset (TBs)
Web crawling considerations:
- Robots.txt compliance: Menghormati crawling policies
- Rate limiting: Menghindari server overload
- Content filtering: Menyaring spam, boilerplate, low-quality content
- Language detection: Memisahkan bahasa untuk training focused
2. Curated Text Sources
High-quality datasets:
- Books: Project Gutenberg, OpenLibrary, publisher partnerships
- Academic papers: ArXiv, PubMed, academic publishers
- Reference materials: Wikipedia, encyclopedias, dictionaries
- News articles: Dengan proper licensing agreements
- Code repositories: GitHub, GitLab, Bitbucket
3. Data Composition Strategy
Optimal mixing ratios (berdasarkan research terbaru):
- Web text: 60-70% (diverse, conversational)
- Books: 15-20% (coherent long-form text)
- Academic: 5-10% (technical knowledge)
- News: 5-8% (current events, factual)
- Reference: 3-5% (structured knowledge)
- Code: 5-10% (logical reasoning)
Data Preprocessing Pipeline
1. Quality Filtering
Content-based filters:
def quality_filter(text):
# Length filters
if len(text) < 100 or len(text) > 100000:
return False
# Language detection
if detect_language(text) != target_language:
return False
# Repetition detection
if repetition_ratio(text) > 0.3:
return False
# Symbol/word ratio
if symbol_ratio(text) > 0.1:
return False
return True
Advanced quality metrics:
- Perplexity scoring: Menggunakan pre-trained LM untuk score naturalness
- Topic diversity: Memastikan coverage topik yang seimbang
- Readability scores: Flesch-Kincaid, SMOG, dll
- Toxicity detection: Menggunakan Perspective API atau similar tools
2. Deduplication Strategies
Exact deduplication:
# Hash-based exact matching
import hashlib
def get_content_hash(text):
return hashlib.sha256(text.encode()).hexdigest()
# Near-duplicate detection using MinHash
from datasketch import MinHashLSH, MinHash
def near_duplicate_detection(documents, threshold=0.8):
lsh = MinHashLSH(threshold=threshold, num_perm=128)
minhashes = {}
for doc_id, doc in enumerate(documents):
m = MinHash(num_perm=128)
for word in doc.split():
m.update(word.encode('utf8'))
lsh.insert(doc_id, m)
minhashes[doc_id] = m
return lsh
Advanced deduplication:
- Semantic deduplication: Using embedding similarity
- Cross-lingual deduplication: Detecting translations
- Temporal deduplication: Handling news article updates
3. Privacy and Safety Filtering
PII (Personally Identifiable Information) removal:
import re
def remove_pii(text):
# Email addresses
text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', text)
# Phone numbers
text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)
# Social Security Numbers
text = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', text)
return text
Content safety:
- Toxicity filtering: Removing hate speech, harassment
- Adult content: NSFW content detection and removal
- Violence/harmful content: Detecting potentially harmful instructions
- Bias mitigation: Addressing demographic and cultural biases
Tokenization Strategy
Subword Tokenization
BPE (Byte-Pair Encoding) advantages:
- Handles OOV words gracefully
- Language-agnostic approach
- Efficient vocabulary usage
SentencePiece implementation:
import sentencepiece as spm
# Training tokenizer
spm.SentencePieceTrainer.train(
input='training_data.txt',
model_prefix='tokenizer',
vocab_size=50000,
character_coverage=0.995,
model_type='bpe'
)
# Using trained tokenizer
sp = smp.SentencePieceProcessor()
sp.load('tokenizer.model')
tokens = sp.encode_as_pieces("Hello world!")
Vocabulary Considerations
Optimal vocabulary size:
- Small models (< 1B params): 30K-50K tokens
- Large models (> 10B params): 50K-100K tokens
- Multilingual models: 100K-250K tokens
Special tokens design:
SPECIAL_TOKENS = {
'<pad>': 0, # Padding token
'<unk>': 1, # Unknown token
'<bos>': 2, # Beginning of sequence
'<eos>': 3, # End of sequence
'<mask>': 4, # Masking token
'<sep>': 5, # Separator token
}
Part 2: Compute Infrastructure and Scaling
Hardware Requirements
GPU Selection and Configuration
Training hardware tiers:
Tier 1: Research/Small Models
- Hardware: 8x RTX 4090 (24GB each)
- Model size: Up to 7B parameters
- Training time: Weeks to months
- Cost: ~$20K setup
Tier 2: Mid-scale Training
- Hardware: 64x A100 (80GB each)
- Model size: 20B-70B parameters
- Training time: Weeks
- Cost: ~$2M cluster
Tier 3: Large-scale Training
- Hardware: 1000+ H100 GPUs
- Model size: 100B+ parameters
- Training time: Months
- Cost: $10M+ infrastructure
Memory and Storage Requirements
Memory hierarchy optimization:
# Memory usage breakdown for 70B parameter model:
Model parameters: 70B × 2 bytes (fp16) = 140GB
Gradients: 70B × 2 bytes = 140GB
Optimizer states: 70B × 8 bytes (Adam) = 560GB
Activations: Varies by batch size and sequence length
Total: ~1TB+ per training step
Storage infrastructure:
- High-speed NVMe: For active training data
- Distributed storage: HDFS, GlusterFS for data pipeline
- Checkpointing: Regular model state saves
- Data loading: Optimized data pipeline to prevent GPU starvation
Distributed Training Strategies
1. Data Parallelism
Concept: Same model replicated across GPUs, different data batches
# PyTorch DistributedDataParallel example
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
def setup_distributed():
dist.init_process_group("nccl")
torch.cuda.set_device(local_rank)
model = LLMModel()
model = DDP(model, device_ids=[local_rank])
Gradient synchronization:
- AllReduce: Synchronize gradients across all GPUs
- Gradient compression: Reduce communication overhead
- Asynchronous updates: For very large clusters
2. Model Parallelism
Pipeline parallelism:
# Different layers on different GPUs
class PipelinedLLM(nn.Module):
def __init__(self):
self.embedding = nn.Embedding().to('cuda:0')
self.layers_1_6 = TransformerLayers().to('cuda:1')
self.layers_7_12 = TransformerLayers().to('cuda:2')
self.output = nn.Linear().to('cuda:3')
def forward(self, x):
x = self.embedding(x) # GPU 0
x = x.to('cuda:1')
x = self.layers_1_6(x) # GPU 1
x = x.to('cuda:2')
x = self.layers_7_12(x) # GPU 2
x = x.to('cuda:3')
return self.output(x) # GPU 3
Tensor parallelism:
- Megatron-style: Split attention heads and FFN across GPUs
- Communication: Reduce memory usage per GPU
- Complexity: Requires careful implementation
3. Hybrid Approaches
3D Parallelism (Data + Pipeline + Tensor):
# Example configuration for 1024 GPUs:
Data parallel: 8 groups
Pipeline parallel: 16 stages
Tensor parallel: 8 GPUs per layer
Total: 8 × 16 × 8 = 1024 GPUs
Training Infrastructure Management
Fault Tolerance
Checkpointing strategy:
def save_checkpoint(model, optimizer, step, loss):
checkpoint = {
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'step': step,
'loss': loss,
'rng_state': torch.get_rng_state()
}
torch.save(checkpoint, f'checkpoint_step_{step}.pt')
Automatic restart mechanisms:
- Node failure detection: Monitor GPU health
- Dynamic node replacement: Swap failed nodes
- Gradient state recovery: Resume from last checkpoint
Monitoring and Observability
Key metrics to track:
# Training metrics
loss_curves = {
'train_loss': [],
'validation_loss': [],
'perplexity': [],
'gradient_norm': [],
'learning_rate': []
}
# System metrics
system_metrics = {
'gpu_utilization': [],
'memory_usage': [],
'disk_io': [],
'network_bandwidth': [],
'temperature': []
}
Part 3: Training Methodology
Architecture Design Decisions
Model Architecture Components
Core Transformer modifications:
class LLMTransformerBlock(nn.Module):
def __init__(self, d_model, n_heads, d_ff, dropout):
super().__init__()
# Pre-norm vs Post-norm
self.ln1 = nn.LayerNorm(d_model) # Pre-norm for stability
self.attention = MultiHeadAttention(d_model, n_heads)
self.ln2 = nn.LayerNorm(d_model)
self.ffn = FeedForward(d_model, d_ff)
# Residual dropout
self.dropout = nn.Dropout(dropout)
def forward(self, x):
# Pre-norm architecture
residual = x
x = self.ln1(x)
x = self.attention(x)
x = self.dropout(x) + residual
residual = x
x = self.ln2(x)
x = self.ffn(x)
x = self.dropout(x) + residual
return x
Key architectural choices:
- Layer normalization: Pre-norm vs post-norm (pre-norm more stable)
- Activation functions: GELU, SwiGLU, ReLU variants
- Positional encoding: Learned vs sinusoidal vs RoPE
- Attention patterns: Full attention vs sparse variants
Scaling Laws and Model Sizing
Chinchilla scaling laws:
- Optimal ratio: ~20 tokens per parameter
- 70B model: Needs ~1.4T tokens for optimal training
- Compute allocation: Balance between model size and data
Parameter allocation:
def calculate_parameters(vocab_size, d_model, n_layers, n_heads):
# Embedding parameters
embedding_params = vocab_size * d_model
# Transformer block parameters
attention_params = n_layers * (4 * d_model * d_model) # Q,K,V,O projections
ffn_params = n_layers * (8 * d_model * d_model) # Assuming 4x expansion
# Layer norm parameters
ln_params = n_layers * 2 * d_model # Two layer norms per block
total = embedding_params + attention_params + ffn_params + ln_params
return total
Training Hyperparameters
Learning Rate Scheduling
Warmup + Cosine Decay:
def get_lr_schedule(step, warmup_steps, max_steps, max_lr, min_lr):
if step < warmup_steps:
# Linear warmup
return max_lr * step / warmup_steps
else:
# Cosine decay
progress = (step - warmup_steps) / (max_steps - warmup_steps)
return min_lr + (max_lr - min_lr) * 0.5 * (1 + math.cos(math.pi * progress))
Typical hyperparameters:
- Learning rate: 1e-4 to 3e-4 (large models need smaller LR)
- Warmup steps: 2000-5000 steps
- Beta1, Beta2: 0.9, 0.95 (Adam optimizer)
- Weight decay: 0.1
- Gradient clipping: 1.0
Batch Size and Context Length
Global batch size scaling:
# Rule of thumb: larger models need larger batch sizes
model_size_to_batch_size = {
'125M': 256,
'350M': 512,
'1.3B': 1024,
'6.7B': 2048,
'13B': 4096,
'30B': 8192,
}
Context length considerations:
- Computational cost: O(n²) attention complexity
- Memory usage: Activations scale with sequence length
- Quality trade-off: Longer context = better coherence
Training Phases and Curriculum
Phase 1: Foundation Pre-training
Duration: 80-90% of total training Objective: Next-token prediction on diverse text
def next_token_loss(logits, targets):
# Shift targets for causal prediction
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = targets[..., 1:].contiguous()
loss = F.cross_entropy(
shift_logits.view(-1, shift_logits.size(-1)),
shift_labels.view(-1),
ignore_index=-100
)
return loss
Training stability techniques:
- Gradient accumulation: Simulate larger batch sizes
- Mixed precision: FP16/BF16 for memory efficiency
- Gradient checkpointing: Trade compute for memory
Phase 2: Instruction Tuning (Optional)
Supervised Fine-Tuning (SFT):
# Instruction-response format
instruction_template = """Below is an instruction. Write a response.
### Instruction:
{instruction}
### Response:
{response}"""
Data requirements:
- High-quality examples: 10K-100K instruction-response pairs
- Diverse tasks: QA, summarization, creative writing, reasoning
- Safety filtering: Remove harmful or biased examples
Phase 3: Alignment (Advanced)
RLHF (Reinforcement Learning from Human Feedback):
- Reward model training: Learn human preferences
- PPO training: Optimize policy using reward model
- Safety filtering: Constitutional AI, red-teaming
Optimization Techniques
Memory Optimization
ZeRO (Zero Redundancy Optimizer):
# DeepSpeed ZeRO Stage 3 example
from deepspeed import initialize
model_engine, optimizer, _, _ = initialize(
model=model,
config_params={
"zero_optimization": {
"stage": 3, # Partition optimizer states, gradients, and parameters
"cpu_offload": True,
"contiguous_gradients": True,
},
"fp16": {"enabled": True},
"gradient_clipping": 1.0,
}
)
Activation checkpointing:
# Trade compute for memory
def checkpoint_forward(layer, *args):
return torch.utils.checkpoint.checkpoint(layer, *args)
Communication Optimization
Gradient compression:
def compress_gradients(gradients, compression_ratio=0.1):
# Top-k sparsification
flat_grad = torch.cat([g.flatten() for g in gradients])
k = int(len(flat_grad) * compression_ratio)
_, indices = torch.topk(torch.abs(flat_grad), k)
compressed = torch.zeros_like(flat_grad)
compressed[indices] = flat_grad[indices]
return compressed
Evaluation and Validation
Intrinsic Evaluation Metrics
Perplexity Tracking
def calculate_perplexity(model, dataloader):
model.eval()
total_loss = 0
total_tokens = 0
with torch.no_grad():
for batch in dataloader:
logits = model(batch['input_ids'])
loss = F.cross_entropy(
logits.view(-1, logits.size(-1)),
batch['labels'].view(-1),
ignore_index=-100
)
total_loss += loss.item() * batch['labels'].numel()
total_tokens += batch['labels'].numel()
avg_loss = total_loss / total_tokens
perplexity = math.exp(avg_loss)
return perplexity
Learning Curves Analysis
Key patterns to watch:
- Training vs validation gap: Indicates overfitting
- Loss plateaus: May need learning rate adjustment
- Instability spikes: Check for data quality issues or numerical instability
Downstream Task Evaluation
Standard Benchmarks
Language Understanding:
- GLUE/SuperGLUE: General language understanding
- MMLU: Massive multitask language understanding
- HellaSwag: Commonsense reasoning
Generation Quality:
- BLEU/ROUGE: For summarization and translation
- Human evaluation: Gold standard for generation quality
- Factual accuracy: Knowledge probing tasks
Custom Evaluation Suites
def evaluate_model_capabilities(model, tokenizer):
evaluations = {}
# Reasoning capability
evaluations['reasoning'] = evaluate_reasoning_tasks(model, tokenizer)
# Knowledge retention
evaluations['knowledge'] = evaluate_knowledge_tasks(model, tokenizer)
# Safety and bias
evaluations['safety'] = evaluate_safety_tasks(model, tokenizer)
# Instruction following
evaluations['instruction'] = evaluate_instruction_tasks(model, tokenizer)
return evaluations
Cost Analysis and Resource Planning
Training Cost Breakdown
Compute Costs
Cloud training costs (AWS/GCP/Azure):
def estimate_training_cost(
model_params,
tokens,
gpu_type='A100',
hourly_rate=3.0, # per GPU hour
flops_per_token_per_param=6 # Forward + backward pass
):
# FLOPs calculation
total_flops = model_params * tokens * flops_per_token_per_param
# GPU specs (A100 example)
gpu_flops = 312e12 # 312 TFLOPS for A100
gpu_utilization = 0.5 # Realistic utilization
# Time calculation
effective_flops = gpu_flops * gpu_utilization
gpu_hours = total_flops / effective_flops / 3600
# Cost calculation
total_cost = gpu_hours * hourly_rate
return {
'total_flops': total_flops,
'gpu_hours': gpu_hours,
'estimated_cost': total_cost
}
# Example: 7B parameter model
cost_estimate = estimate_training_cost(
model_params=7e9,
tokens=1e12,
gpu_type='A100',
hourly_rate=3.0
)
print(f"Estimated cost: ${cost_estimate['estimated_cost']:,.2f}")
Infrastructure Costs
Additional cost factors:
- Storage: $0.02-0.1 per GB-month for training data
- Networking: Data transfer between regions
- Personnel: ML engineers, researchers, DevOps
- Power consumption: Electricity costs for on-premise
ROI and Business Considerations
When to Train vs Fine-tune
Train from scratch when:
- Need specialized domain knowledge
- Have unique data distribution
- Require custom tokenization
- Need complete control over training process
Fine-tune existing models when:
- Limited computational resources
- Standard use cases
- Need faster time-to-market
- Want to leverage existing capabilities
Common Pitfalls and Troubleshooting
Training Instabilities
Loss Spikes and Divergence
Symptoms:
- Sudden loss increases during training
- NaN values in gradients or activations
- Model performance degradation
Solutions:
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Learning rate scheduling
if loss > previous_loss * 1.5:
lr_scheduler.step_back() # Reduce learning rate
# Mixed precision scaling
scaler = torch.cuda.amp.GradScaler()
if scaler.is_enabled():
scaler.scale(loss).backward()
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optimizer)
scaler.update()
Memory Issues
Common problems:
- Out of memory errors
- Memory fragmentation
- Inefficient memory usage
Solutions:
- Gradient accumulation for smaller batch sizes
- Activation checkpointing
- Model parallelism
- Mixed precision training
Data Quality Issues
Detecting Poor Quality Data
def detect_data_issues(batch):
issues = []
# Check for repetitive text
if detect_repetition(batch['text']) > 0.5:
issues.append('high_repetition')
# Check for encoding issues
if has_encoding_errors(batch['text']):
issues.append('encoding_error')
# Check for length anomalies
if len(batch['text'].split()) < 10:
issues.append('too_short')
return issues
Future Trends and Considerations
Emerging Training Paradigms
1. Mixture of Experts (MoE)
Concept: Activate only subset of parameters per token Benefits:
- Scale model capacity without proportional compute increase
- Specialized experts for different domains
- Better parameter efficiency
class MoELayer(nn.Module):
def __init__(self, d_model, num_experts, expert_capacity):
super().__init__()
self.gate = nn.Linear(d_model, num_experts)
self.experts = nn.ModuleList([
ExpertMLP(d_model) for _ in range(num_experts)
])
def forward(self, x):
gate_scores = F.softmax(self.gate(x), dim=-1)
# Route to top-k experts
top_k_gates, top_k_indices = torch.topk(gate_scores, k=2)
# ... expert routing logic
2. Retrieval-Augmented Training
Concept: Combine parametric knowledge with external retrieval Applications:
- Real-time knowledge updates
- Factual accuracy improvements
- Reduced hallucination
3. Multimodal Foundation Models
Integration challenges:
- Cross-modal attention mechanisms
- Unified tokenization strategies
- Balanced training objectives
Efficiency Improvements
Model Compression During Training
Techniques:
- Structured pruning: Remove entire attention heads or layers
- Knowledge distillation: Train smaller models using larger teacher
- Quantization-aware training: Train with reduced precision from start
Green AI Initiatives
Sustainability considerations:
- Carbon footprint tracking and reduction
- Renewable energy for compute centers
- Efficient model architectures
- Training methodology improvements
Conclusion
Training LLMs from scratch adalah undertaking yang kompleks yang memerlukan perencanaan matang dalam tiga aspek kritis: data, compute, dan methodology. Key takeaways dari artikel ini:
Data Strategy
- Quality over quantity: Invest heavily in data curation and filtering
- Diversity matters: Balance different data sources and domains
- Privacy and safety: Implement robust filtering for PII and harmful content
- Tokenization: Choose appropriate vocabulary size and tokenization strategy
Compute Infrastructure
- Scale planning: Start with clear understanding of resource requirements
- Distributed training: Master data, model, and pipeline parallelism
- Fault tolerance: Build robust checkpointing and recovery systems
- Cost optimization: Balance performance with economic constraints
Training Methodology
- Architecture decisions: Understand trade-offs in model design
- Hyperparameter tuning: Start with proven baselines and iterate
- Training stability: Implement monitoring and automatic recovery
- Evaluation framework: Establish comprehensive evaluation from day one
Practical Recommendations
For researchers and small teams:
- Start with smaller models (1B-7B parameters) to validate methodology
- Use existing frameworks (Megatron, DeepSpeed) rather than building from scratch
- Focus on data quality and domain specialization
- Leverage cloud computing for cost-effective experimentation
For organizations considering large-scale training:
- Invest in dedicated ML infrastructure team
- Plan for 6-12 month training cycles
- Budget 2-3x initial estimates for unexpected costs
- Build comprehensive monitoring and observability from start
For the future:
- Keep up with efficiency improvements (Flash Attention, MoE, etc.)
- Consider environmental impact and sustainable practices
- Prepare for multimodal and retrieval-augmented architectures
- Invest in evaluation and safety research alongside model development
Training LLMs from scratch tetap menjadi frontier challenge dalam AI, tetapi dengan pemahaman yang solid tentang data, compute, dan methodology, proyek seperti ini bisa berhasil dan menghasilkan breakthrough dalam kemampuan AI.
Disclaimer: Estimasi cost dan resource requirements dalam artikel ini berdasarkan informasi publik dan dapat bervariasi tergantung pada implementasi spesifik, hardware choices, dan kondisi pasar.
Leave a Reply