Training LLMs from Scratch: Data, Compute, and Methodology

Introduction

Training Large Language Models (LLMs) from scratch adalah salah satu proyek AI paling ambisius dan kompleks saat ini. Artikel ini akan mengupas tuntas seluruh proses mulai dari persiapan data hingga evaluasi model final, memberikan panduan praktis dan insights dari pengalaman training model-model state-of-the-art.

Overview: The Scale of Modern LLMs

Model Size Evolution

  • GPT-1 (2018): 117M parameters
  • GPT-2 (2019): 1.5B parameters
  • GPT-3 (2020): 175B parameters
  • PaLM (2022): 540B parameters
  • GPT-4: Estimated 1.7T parameters (multi-modal)

Resource Requirements Reality Check

Training GPT-3 scale model memerlukan:

  • Compute: ~3,000 V100-years (sekitar $4.6M dalam compute cost)
  • Data: ~300B tokens dari internet-scale text
  • Time: 34 hari dengan 1,024 A100 GPUs
  • Energy: Estimasi 1,287 MWh (setara listrik 120 rumah selama setahun)

Part 1: Data – The Foundation of Everything

Data Collection Strategy

1. Web Crawling at Scale

Common Crawl adalah sumber utama data text internet:

# Typical data sources hierarchy:
Common Crawl (petabytes) → 
Filtered Text (hundreds of TB) → 
Deduplicated (tens of TB) → 
High-quality subset (TBs)

Web crawling considerations:

  • Robots.txt compliance: Menghormati crawling policies
  • Rate limiting: Menghindari server overload
  • Content filtering: Menyaring spam, boilerplate, low-quality content
  • Language detection: Memisahkan bahasa untuk training focused

2. Curated Text Sources

High-quality datasets:

  • Books: Project Gutenberg, OpenLibrary, publisher partnerships
  • Academic papers: ArXiv, PubMed, academic publishers
  • Reference materials: Wikipedia, encyclopedias, dictionaries
  • News articles: Dengan proper licensing agreements
  • Code repositories: GitHub, GitLab, Bitbucket

3. Data Composition Strategy

Optimal mixing ratios (berdasarkan research terbaru):

  • Web text: 60-70% (diverse, conversational)
  • Books: 15-20% (coherent long-form text)
  • Academic: 5-10% (technical knowledge)
  • News: 5-8% (current events, factual)
  • Reference: 3-5% (structured knowledge)
  • Code: 5-10% (logical reasoning)

Data Preprocessing Pipeline

1. Quality Filtering

Content-based filters:

def quality_filter(text):
    # Length filters
    if len(text) < 100 or len(text) > 100000:
        return False
    
    # Language detection
    if detect_language(text) != target_language:
        return False
    
    # Repetition detection
    if repetition_ratio(text) > 0.3:
        return False
    
    # Symbol/word ratio
    if symbol_ratio(text) > 0.1:
        return False
    
    return True

Advanced quality metrics:

  • Perplexity scoring: Menggunakan pre-trained LM untuk score naturalness
  • Topic diversity: Memastikan coverage topik yang seimbang
  • Readability scores: Flesch-Kincaid, SMOG, dll
  • Toxicity detection: Menggunakan Perspective API atau similar tools

2. Deduplication Strategies

Exact deduplication:

# Hash-based exact matching
import hashlib

def get_content_hash(text):
    return hashlib.sha256(text.encode()).hexdigest()

# Near-duplicate detection using MinHash
from datasketch import MinHashLSH, MinHash

def near_duplicate_detection(documents, threshold=0.8):
    lsh = MinHashLSH(threshold=threshold, num_perm=128)
    minhashes = {}
    
    for doc_id, doc in enumerate(documents):
        m = MinHash(num_perm=128)
        for word in doc.split():
            m.update(word.encode('utf8'))
        lsh.insert(doc_id, m)
        minhashes[doc_id] = m
    
    return lsh

Advanced deduplication:

  • Semantic deduplication: Using embedding similarity
  • Cross-lingual deduplication: Detecting translations
  • Temporal deduplication: Handling news article updates

3. Privacy and Safety Filtering

PII (Personally Identifiable Information) removal:

import re

def remove_pii(text):
    # Email addresses
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', text)
    
    # Phone numbers
    text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)
    
    # Social Security Numbers
    text = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', text)
    
    return text

Content safety:

  • Toxicity filtering: Removing hate speech, harassment
  • Adult content: NSFW content detection and removal
  • Violence/harmful content: Detecting potentially harmful instructions
  • Bias mitigation: Addressing demographic and cultural biases

Tokenization Strategy

Subword Tokenization

BPE (Byte-Pair Encoding) advantages:

  • Handles OOV words gracefully
  • Language-agnostic approach
  • Efficient vocabulary usage

SentencePiece implementation:

import sentencepiece as spm

# Training tokenizer
spm.SentencePieceTrainer.train(
    input='training_data.txt',
    model_prefix='tokenizer',
    vocab_size=50000,
    character_coverage=0.995,
    model_type='bpe'
)

# Using trained tokenizer
sp = smp.SentencePieceProcessor()
sp.load('tokenizer.model')
tokens = sp.encode_as_pieces("Hello world!")

Vocabulary Considerations

Optimal vocabulary size:

  • Small models (< 1B params): 30K-50K tokens
  • Large models (> 10B params): 50K-100K tokens
  • Multilingual models: 100K-250K tokens

Special tokens design:

SPECIAL_TOKENS = {
    '<pad>': 0,      # Padding token
    '<unk>': 1,      # Unknown token  
    '<bos>': 2,      # Beginning of sequence
    '<eos>': 3,      # End of sequence
    '<mask>': 4,     # Masking token
    '<sep>': 5,      # Separator token
}

Part 2: Compute Infrastructure and Scaling

Hardware Requirements

GPU Selection and Configuration

Training hardware tiers:

Tier 1: Research/Small Models

  • Hardware: 8x RTX 4090 (24GB each)
  • Model size: Up to 7B parameters
  • Training time: Weeks to months
  • Cost: ~$20K setup

Tier 2: Mid-scale Training

  • Hardware: 64x A100 (80GB each)
  • Model size: 20B-70B parameters
  • Training time: Weeks
  • Cost: ~$2M cluster

Tier 3: Large-scale Training

  • Hardware: 1000+ H100 GPUs
  • Model size: 100B+ parameters
  • Training time: Months
  • Cost: $10M+ infrastructure

Memory and Storage Requirements

Memory hierarchy optimization:

# Memory usage breakdown for 70B parameter model:
Model parameters: 70B × 2 bytes (fp16) = 140GB
Gradients: 70B × 2 bytes = 140GB  
Optimizer states: 70B × 8 bytes (Adam) = 560GB
Activations: Varies by batch size and sequence length
Total: ~1TB+ per training step

Storage infrastructure:

  • High-speed NVMe: For active training data
  • Distributed storage: HDFS, GlusterFS for data pipeline
  • Checkpointing: Regular model state saves
  • Data loading: Optimized data pipeline to prevent GPU starvation

Distributed Training Strategies

1. Data Parallelism

Concept: Same model replicated across GPUs, different data batches

# PyTorch DistributedDataParallel example
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

def setup_distributed():
    dist.init_process_group("nccl")
    torch.cuda.set_device(local_rank)

model = LLMModel()
model = DDP(model, device_ids=[local_rank])

Gradient synchronization:

  • AllReduce: Synchronize gradients across all GPUs
  • Gradient compression: Reduce communication overhead
  • Asynchronous updates: For very large clusters

2. Model Parallelism

Pipeline parallelism:

# Different layers on different GPUs
class PipelinedLLM(nn.Module):
    def __init__(self):
        self.embedding = nn.Embedding().to('cuda:0')
        self.layers_1_6 = TransformerLayers().to('cuda:1') 
        self.layers_7_12 = TransformerLayers().to('cuda:2')
        self.output = nn.Linear().to('cuda:3')
    
    def forward(self, x):
        x = self.embedding(x)  # GPU 0
        x = x.to('cuda:1')
        x = self.layers_1_6(x)  # GPU 1
        x = x.to('cuda:2') 
        x = self.layers_7_12(x)  # GPU 2
        x = x.to('cuda:3')
        return self.output(x)  # GPU 3

Tensor parallelism:

  • Megatron-style: Split attention heads and FFN across GPUs
  • Communication: Reduce memory usage per GPU
  • Complexity: Requires careful implementation

3. Hybrid Approaches

3D Parallelism (Data + Pipeline + Tensor):

# Example configuration for 1024 GPUs:
Data parallel: 8 groups
Pipeline parallel: 16 stages  
Tensor parallel: 8 GPUs per layer
Total: 8 × 16 × 8 = 1024 GPUs

Training Infrastructure Management

Fault Tolerance

Checkpointing strategy:

def save_checkpoint(model, optimizer, step, loss):
    checkpoint = {
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'step': step,
        'loss': loss,
        'rng_state': torch.get_rng_state()
    }
    torch.save(checkpoint, f'checkpoint_step_{step}.pt')

Automatic restart mechanisms:

  • Node failure detection: Monitor GPU health
  • Dynamic node replacement: Swap failed nodes
  • Gradient state recovery: Resume from last checkpoint

Monitoring and Observability

Key metrics to track:

# Training metrics
loss_curves = {
    'train_loss': [],
    'validation_loss': [],
    'perplexity': [],
    'gradient_norm': [],
    'learning_rate': []
}

# System metrics  
system_metrics = {
    'gpu_utilization': [],
    'memory_usage': [],
    'disk_io': [],
    'network_bandwidth': [],
    'temperature': []
}

Part 3: Training Methodology

Architecture Design Decisions

Model Architecture Components

Core Transformer modifications:

class LLMTransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, d_ff, dropout):
        super().__init__()
        
        # Pre-norm vs Post-norm
        self.ln1 = nn.LayerNorm(d_model)  # Pre-norm for stability
        self.attention = MultiHeadAttention(d_model, n_heads)
        
        self.ln2 = nn.LayerNorm(d_model)
        self.ffn = FeedForward(d_model, d_ff)
        
        # Residual dropout
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, x):
        # Pre-norm architecture
        residual = x
        x = self.ln1(x)
        x = self.attention(x)
        x = self.dropout(x) + residual
        
        residual = x  
        x = self.ln2(x)
        x = self.ffn(x)
        x = self.dropout(x) + residual
        
        return x

Key architectural choices:

  • Layer normalization: Pre-norm vs post-norm (pre-norm more stable)
  • Activation functions: GELU, SwiGLU, ReLU variants
  • Positional encoding: Learned vs sinusoidal vs RoPE
  • Attention patterns: Full attention vs sparse variants

Scaling Laws and Model Sizing

Chinchilla scaling laws:

  • Optimal ratio: ~20 tokens per parameter
  • 70B model: Needs ~1.4T tokens for optimal training
  • Compute allocation: Balance between model size and data

Parameter allocation:

def calculate_parameters(vocab_size, d_model, n_layers, n_heads):
    # Embedding parameters
    embedding_params = vocab_size * d_model
    
    # Transformer block parameters  
    attention_params = n_layers * (4 * d_model * d_model)  # Q,K,V,O projections
    ffn_params = n_layers * (8 * d_model * d_model)  # Assuming 4x expansion
    
    # Layer norm parameters
    ln_params = n_layers * 2 * d_model  # Two layer norms per block
    
    total = embedding_params + attention_params + ffn_params + ln_params
    return total

Training Hyperparameters

Learning Rate Scheduling

Warmup + Cosine Decay:

def get_lr_schedule(step, warmup_steps, max_steps, max_lr, min_lr):
    if step < warmup_steps:
        # Linear warmup
        return max_lr * step / warmup_steps
    else:
        # Cosine decay
        progress = (step - warmup_steps) / (max_steps - warmup_steps)
        return min_lr + (max_lr - min_lr) * 0.5 * (1 + math.cos(math.pi * progress))

Typical hyperparameters:

  • Learning rate: 1e-4 to 3e-4 (large models need smaller LR)
  • Warmup steps: 2000-5000 steps
  • Beta1, Beta2: 0.9, 0.95 (Adam optimizer)
  • Weight decay: 0.1
  • Gradient clipping: 1.0

Batch Size and Context Length

Global batch size scaling:

# Rule of thumb: larger models need larger batch sizes
model_size_to_batch_size = {
    '125M': 256,
    '350M': 512, 
    '1.3B': 1024,
    '6.7B': 2048,
    '13B': 4096,
    '30B': 8192,
}

Context length considerations:

  • Computational cost: O(n²) attention complexity
  • Memory usage: Activations scale with sequence length
  • Quality trade-off: Longer context = better coherence

Training Phases and Curriculum

Phase 1: Foundation Pre-training

Duration: 80-90% of total training Objective: Next-token prediction on diverse text

def next_token_loss(logits, targets):
    # Shift targets for causal prediction
    shift_logits = logits[..., :-1, :].contiguous()
    shift_labels = targets[..., 1:].contiguous()
    
    loss = F.cross_entropy(
        shift_logits.view(-1, shift_logits.size(-1)),
        shift_labels.view(-1),
        ignore_index=-100
    )
    return loss

Training stability techniques:

  • Gradient accumulation: Simulate larger batch sizes
  • Mixed precision: FP16/BF16 for memory efficiency
  • Gradient checkpointing: Trade compute for memory

Phase 2: Instruction Tuning (Optional)

Supervised Fine-Tuning (SFT):

# Instruction-response format
instruction_template = """Below is an instruction. Write a response.

### Instruction:
{instruction}

### Response:
{response}"""

Data requirements:

  • High-quality examples: 10K-100K instruction-response pairs
  • Diverse tasks: QA, summarization, creative writing, reasoning
  • Safety filtering: Remove harmful or biased examples

Phase 3: Alignment (Advanced)

RLHF (Reinforcement Learning from Human Feedback):

  1. Reward model training: Learn human preferences
  2. PPO training: Optimize policy using reward model
  3. Safety filtering: Constitutional AI, red-teaming

Optimization Techniques

Memory Optimization

ZeRO (Zero Redundancy Optimizer):

# DeepSpeed ZeRO Stage 3 example
from deepspeed import initialize

model_engine, optimizer, _, _ = initialize(
    model=model,
    config_params={
        "zero_optimization": {
            "stage": 3,  # Partition optimizer states, gradients, and parameters
            "cpu_offload": True,
            "contiguous_gradients": True,
        },
        "fp16": {"enabled": True},
        "gradient_clipping": 1.0,
    }
)

Activation checkpointing:

# Trade compute for memory
def checkpoint_forward(layer, *args):
    return torch.utils.checkpoint.checkpoint(layer, *args)

Communication Optimization

Gradient compression:

def compress_gradients(gradients, compression_ratio=0.1):
    # Top-k sparsification
    flat_grad = torch.cat([g.flatten() for g in gradients])
    k = int(len(flat_grad) * compression_ratio)
    _, indices = torch.topk(torch.abs(flat_grad), k)
    
    compressed = torch.zeros_like(flat_grad)
    compressed[indices] = flat_grad[indices]
    return compressed

Evaluation and Validation

Intrinsic Evaluation Metrics

Perplexity Tracking

def calculate_perplexity(model, dataloader):
    model.eval()
    total_loss = 0
    total_tokens = 0
    
    with torch.no_grad():
        for batch in dataloader:
            logits = model(batch['input_ids'])
            loss = F.cross_entropy(
                logits.view(-1, logits.size(-1)),
                batch['labels'].view(-1),
                ignore_index=-100
            )
            total_loss += loss.item() * batch['labels'].numel()
            total_tokens += batch['labels'].numel()
    
    avg_loss = total_loss / total_tokens
    perplexity = math.exp(avg_loss)
    return perplexity

Learning Curves Analysis

Key patterns to watch:

  • Training vs validation gap: Indicates overfitting
  • Loss plateaus: May need learning rate adjustment
  • Instability spikes: Check for data quality issues or numerical instability

Downstream Task Evaluation

Standard Benchmarks

Language Understanding:

  • GLUE/SuperGLUE: General language understanding
  • MMLU: Massive multitask language understanding
  • HellaSwag: Commonsense reasoning

Generation Quality:

  • BLEU/ROUGE: For summarization and translation
  • Human evaluation: Gold standard for generation quality
  • Factual accuracy: Knowledge probing tasks

Custom Evaluation Suites

def evaluate_model_capabilities(model, tokenizer):
    evaluations = {}
    
    # Reasoning capability
    evaluations['reasoning'] = evaluate_reasoning_tasks(model, tokenizer)
    
    # Knowledge retention
    evaluations['knowledge'] = evaluate_knowledge_tasks(model, tokenizer)
    
    # Safety and bias
    evaluations['safety'] = evaluate_safety_tasks(model, tokenizer)
    
    # Instruction following
    evaluations['instruction'] = evaluate_instruction_tasks(model, tokenizer)
    
    return evaluations

Cost Analysis and Resource Planning

Training Cost Breakdown

Compute Costs

Cloud training costs (AWS/GCP/Azure):

def estimate_training_cost(
    model_params,
    tokens,
    gpu_type='A100',
    hourly_rate=3.0,  # per GPU hour
    flops_per_token_per_param=6  # Forward + backward pass
):
    # FLOPs calculation
    total_flops = model_params * tokens * flops_per_token_per_param
    
    # GPU specs (A100 example)
    gpu_flops = 312e12  # 312 TFLOPS for A100
    gpu_utilization = 0.5  # Realistic utilization
    
    # Time calculation
    effective_flops = gpu_flops * gpu_utilization
    gpu_hours = total_flops / effective_flops / 3600
    
    # Cost calculation
    total_cost = gpu_hours * hourly_rate
    
    return {
        'total_flops': total_flops,
        'gpu_hours': gpu_hours,
        'estimated_cost': total_cost
    }

# Example: 7B parameter model
cost_estimate = estimate_training_cost(
    model_params=7e9,
    tokens=1e12,
    gpu_type='A100',
    hourly_rate=3.0
)
print(f"Estimated cost: ${cost_estimate['estimated_cost']:,.2f}")

Infrastructure Costs

Additional cost factors:

  • Storage: $0.02-0.1 per GB-month for training data
  • Networking: Data transfer between regions
  • Personnel: ML engineers, researchers, DevOps
  • Power consumption: Electricity costs for on-premise

ROI and Business Considerations

When to Train vs Fine-tune

Train from scratch when:

  • Need specialized domain knowledge
  • Have unique data distribution
  • Require custom tokenization
  • Need complete control over training process

Fine-tune existing models when:

  • Limited computational resources
  • Standard use cases
  • Need faster time-to-market
  • Want to leverage existing capabilities

Common Pitfalls and Troubleshooting

Training Instabilities

Loss Spikes and Divergence

Symptoms:

  • Sudden loss increases during training
  • NaN values in gradients or activations
  • Model performance degradation

Solutions:

# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# Learning rate scheduling
if loss > previous_loss * 1.5:
    lr_scheduler.step_back()  # Reduce learning rate
    
# Mixed precision scaling
scaler = torch.cuda.amp.GradScaler()
if scaler.is_enabled():
    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    scaler.step(optimizer)
    scaler.update()

Memory Issues

Common problems:

  • Out of memory errors
  • Memory fragmentation
  • Inefficient memory usage

Solutions:

  • Gradient accumulation for smaller batch sizes
  • Activation checkpointing
  • Model parallelism
  • Mixed precision training

Data Quality Issues

Detecting Poor Quality Data

def detect_data_issues(batch):
    issues = []
    
    # Check for repetitive text
    if detect_repetition(batch['text']) > 0.5:
        issues.append('high_repetition')
    
    # Check for encoding issues  
    if has_encoding_errors(batch['text']):
        issues.append('encoding_error')
        
    # Check for length anomalies
    if len(batch['text'].split()) < 10:
        issues.append('too_short')
        
    return issues

Future Trends and Considerations

Emerging Training Paradigms

1. Mixture of Experts (MoE)

Concept: Activate only subset of parameters per token Benefits:

  • Scale model capacity without proportional compute increase
  • Specialized experts for different domains
  • Better parameter efficiency
class MoELayer(nn.Module):
    def __init__(self, d_model, num_experts, expert_capacity):
        super().__init__()
        self.gate = nn.Linear(d_model, num_experts)
        self.experts = nn.ModuleList([
            ExpertMLP(d_model) for _ in range(num_experts)
        ])
        
    def forward(self, x):
        gate_scores = F.softmax(self.gate(x), dim=-1)
        # Route to top-k experts
        top_k_gates, top_k_indices = torch.topk(gate_scores, k=2)
        # ... expert routing logic

2. Retrieval-Augmented Training

Concept: Combine parametric knowledge with external retrieval Applications:

  • Real-time knowledge updates
  • Factual accuracy improvements
  • Reduced hallucination

3. Multimodal Foundation Models

Integration challenges:

  • Cross-modal attention mechanisms
  • Unified tokenization strategies
  • Balanced training objectives

Efficiency Improvements

Model Compression During Training

Techniques:

  • Structured pruning: Remove entire attention heads or layers
  • Knowledge distillation: Train smaller models using larger teacher
  • Quantization-aware training: Train with reduced precision from start

Green AI Initiatives

Sustainability considerations:

  • Carbon footprint tracking and reduction
  • Renewable energy for compute centers
  • Efficient model architectures
  • Training methodology improvements

Conclusion

Training LLMs from scratch adalah undertaking yang kompleks yang memerlukan perencanaan matang dalam tiga aspek kritis: data, compute, dan methodology. Key takeaways dari artikel ini:

Data Strategy

  • Quality over quantity: Invest heavily in data curation and filtering
  • Diversity matters: Balance different data sources and domains
  • Privacy and safety: Implement robust filtering for PII and harmful content
  • Tokenization: Choose appropriate vocabulary size and tokenization strategy

Compute Infrastructure

  • Scale planning: Start with clear understanding of resource requirements
  • Distributed training: Master data, model, and pipeline parallelism
  • Fault tolerance: Build robust checkpointing and recovery systems
  • Cost optimization: Balance performance with economic constraints

Training Methodology

  • Architecture decisions: Understand trade-offs in model design
  • Hyperparameter tuning: Start with proven baselines and iterate
  • Training stability: Implement monitoring and automatic recovery
  • Evaluation framework: Establish comprehensive evaluation from day one

Practical Recommendations

For researchers and small teams:

  • Start with smaller models (1B-7B parameters) to validate methodology
  • Use existing frameworks (Megatron, DeepSpeed) rather than building from scratch
  • Focus on data quality and domain specialization
  • Leverage cloud computing for cost-effective experimentation

For organizations considering large-scale training:

  • Invest in dedicated ML infrastructure team
  • Plan for 6-12 month training cycles
  • Budget 2-3x initial estimates for unexpected costs
  • Build comprehensive monitoring and observability from start

For the future:

  • Keep up with efficiency improvements (Flash Attention, MoE, etc.)
  • Consider environmental impact and sustainable practices
  • Prepare for multimodal and retrieval-augmented architectures
  • Invest in evaluation and safety research alongside model development

Training LLMs from scratch tetap menjadi frontier challenge dalam AI, tetapi dengan pemahaman yang solid tentang data, compute, dan methodology, proyek seperti ini bisa berhasil dan menghasilkan breakthrough dalam kemampuan AI.


Disclaimer: Estimasi cost dan resource requirements dalam artikel ini berdasarkan informasi publik dan dapat bervariasi tergantung pada implementasi spesifik, hardware choices, dan kondisi pasar.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

CAPTCHA ImageChange Image