Fine-Tuning Techniques: LoRA, QLoRA, and Parameter-Efficient Methods

Introduction

Parameter-Efficient Fine-Tuning (PEFT) telah merevolusi cara kita mengadaptasi Large Language Models (LLMs) untuk task-specific applications. Dengan model yang semakin besar (GPT-4, Claude, LLaMA), full fine-tuning menjadi tidak praktis dan mahal. Artikel ini membahas teknik-teknik modern yang memungkinkan fine-tuning dengan resource minimal namun hasil maksimal.

Mengapa Parameter-Efficient Fine-Tuning?

Masalah Full Fine-Tuning

  • Memory Requirements: Model 7B parameters butuh ~28GB GPU memory untuk training
  • Storage Overhead: Setiap task membutuhkan copy model lengkap
  • Computational Cost: Training time dan energy consumption yang tinggi
  • Catastrophic Forgetting: Model kehilangan kemampuan general setelah fine-tuning

Keuntungan PEFT

  • Efisiensi Memory: Hanya update 0.1-1% dari total parameters
  • Storage Efficiency: Adapter weights hanya beberapa MB vs GB model lengkap
  • Faster Training: Konvergensi lebih cepat dengan parameter lebih sedikit
  • Multi-Task: Satu base model + multiple adapters untuk berbagai tasks

LoRA (Low-Rank Adaptation)

Konsep Dasar

LoRA berdasarkan hypothesis bahwa weight updates selama fine-tuning memiliki low intrinsic rank. Alih-alih mengupdate matrix W secara langsung, LoRA mendekomposisi update menjadi dua matrix berrank rendah.

Formula Matematis

W_new = W_original + ΔW
ΔW = A × B

Dimana:

  • A: matrix m×r (m = original dimension, r = rank)
  • B: matrix r×n (n = output dimension)
  • r << min(m,n): rank jauh lebih kecil dari dimensi asli

Implementasi LoRA

import torch
import torch.nn as nn

class LoRALayer(nn.Module):
    def __init__(self, in_features, out_features, rank=16, alpha=32):
        super().__init__()
        self.rank = rank
        self.alpha = alpha
        
        # Original weights (frozen)
        self.original_layer = nn.Linear(in_features, out_features, bias=False)
        self.original_layer.weight.requires_grad = False
        
        # LoRA matrices
        self.lora_A = nn.Parameter(torch.randn(rank, in_features) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
        
        self.scaling = alpha / rank
    
    def forward(self, x):
        # Original forward pass
        original_output = self.original_layer(x)
        
        # LoRA adaptation
        lora_output = (x @ self.lora_A.T @ self.lora_B.T) * self.scaling
        
        return original_output + lora_output

# Usage example
model = transformers.AutoModel.from_pretrained("llama-7b")

# Replace linear layers with LoRA
for name, module in model.named_modules():
    if isinstance(module, nn.Linear) and 'attention' in name.lower():
        # Replace attention layers with LoRA
        lora_layer = LoRALayer(
            module.in_features, 
            module.out_features, 
            rank=16
        )
        lora_layer.original_layer.weight.data = module.weight.data.clone()
        setattr(model, name, lora_layer)

LoRA Hyperparameters

Rank (r):

  • r=1-4: Very lightweight, cocok untuk simple tasks
  • r=8-16: Standard choice, balance antara efficiency dan performance
  • r=32-64: Higher capacity, untuk complex tasks

Alpha (α):

  • α = r: Conservative scaling
  • α = 2r: Standard recommendation
  • α = 4r: Aggressive adaptation

Target Modules:

# Common LoRA targets
lora_config = {
    "target_modules": [
        "q_proj",  # Query projection
        "v_proj",  # Value projection  
        "k_proj",  # Key projection
        "o_proj",  # Output projection
        # Optional: "gate_proj", "up_proj", "down_proj" for FFN
    ],
    "rank": 16,
    "alpha": 32,
    "dropout": 0.1
}

QLoRA (Quantized LoRA)

Inovasi QLoRA

QLoRA menggabungkan LoRA dengan 4-bit quantization, memungkinkan fine-tuning model yang sangat besar dengan GPU consumer-grade.

Key Components

1. 4-bit NormalFloat (NF4)

# NF4 quantization concept
def quantize_nf4(tensor):
    # Normalize to [-1, 1] range
    scale = tensor.abs().max()
    normalized = tensor / scale
    
    # Map to 4-bit values optimized for normal distribution
    nf4_values = [-1.0, -0.6962, -0.5250, -0.3906, -0.2739, -0.1647, 
                  -0.0618, 0.0, 0.0618, 0.1647, 0.2739, 0.3906, 
                  0.5250, 0.6962, 1.0]
    
    # Find closest NF4 value for each element
    quantized = torch.tensor([min(nf4_values, key=lambda x: abs(x-val)) 
                             for val in normalized.flatten()])
    
    return quantized.reshape(tensor.shape), scale

2. Double Quantization

Quantize quantization constants untuk additional memory saving:

class DoubleQuantization:
    def __init__(self, blocksize=64):
        self.blocksize = blocksize
    
    def quantize(self, tensor):
        # First level: quantize tensor in blocks
        blocks = tensor.view(-1, self.blocksize)
        scales_1 = []
        quantized_blocks = []
        
        for block in blocks:
            scale = block.abs().max()
            quantized = (block / scale * 15).round().clamp(-8, 7)
            scales_1.append(scale)
            quantized_blocks.append(quantized)
        
        # Second level: quantize the scales
        scales_tensor = torch.tensor(scales_1)
        scale_scale = scales_tensor.abs().max()
        quantized_scales = (scales_tensor / scale_scale * 255).round()
        
        return quantized_blocks, quantized_scales, scale_scale

3. Paged Optimizers

# Memory-efficient optimizer for QLoRA
import bitsandbytes as bnb

optimizer = bnb.optim.PagedAdamW32bit(
    model.parameters(),
    lr=2e-4,
    betas=(0.9, 0.999),
    weight_decay=0.01
)

QLoRA Implementation

from transformers import (
    AutoModelForCausalLM, 
    AutoTokenizer,
    BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

# LoRA configuration
lora_config = LoraConfig(
    r=64,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA to quantized model
model = get_peft_model(model, lora_config)

print(f"Trainable parameters: {model.num_parameters(only_trainable=True):,}")
print(f"Total parameters: {model.num_parameters():,}")

Memory Comparison

# Memory usage comparison (7B parameter model)
memory_usage = {
    "Full Fine-tuning": "~28GB",
    "LoRA (r=16)": "~12GB", 
    "QLoRA (4-bit)": "~6GB",
    "QLoRA + Gradient Checkpointing": "~4GB"
}

Parameter-Efficient Methods Lainnya

1. Adapter Layers

Konsep: Insert small bottleneck layers ke dalam pre-trained model

class AdapterLayer(nn.Module):
    def __init__(self, hidden_size, adapter_size=64):
        super().__init__()
        self.adapter_down = nn.Linear(hidden_size, adapter_size)
        self.adapter_up = nn.Linear(adapter_size, hidden_size)
        self.activation = nn.ReLU()
        self.dropout = nn.Dropout(0.1)
    
    def forward(self, x):
        residual = x
        x = self.adapter_down(x)
        x = self.activation(x)
        x = self.dropout(x)
        x = self.adapter_up(x)
        return x + residual  # Residual connection

2. Prefix Tuning

Konsep: Optimize continuous task-specific vectors yang di-prepend ke input

class PrefixTuning(nn.Module):
    def __init__(self, prefix_length=10, hidden_size=768):
        super().__init__()
        self.prefix_length = prefix_length
        self.prefix_embeddings = nn.Parameter(
            torch.randn(prefix_length, hidden_size) * 0.01
        )
    
    def forward(self, input_embeds):
        batch_size = input_embeds.size(0)
        prefix = self.prefix_embeddings.unsqueeze(0).expand(batch_size, -1, -1)
        return torch.cat([prefix, input_embeds], dim=1)

3. P-Tuning v2

Konsep: Add trainable prompts pada setiap layer, bukan hanya input

class PTuningV2(nn.Module):
    def __init__(self, num_layers, hidden_size, prompt_length=100):
        super().__init__()
        self.prompt_length = prompt_length
        
        # Prompt embeddings for each layer
        self.prompt_embeddings = nn.ParameterList([
            nn.Parameter(torch.randn(prompt_length, hidden_size) * 0.01)
            for _ in range(num_layers)
        ])
    
    def get_prompt(self, layer_idx, batch_size):
        return self.prompt_embeddings[layer_idx].unsqueeze(0).expand(batch_size, -1, -1)

4. (IA)³ – Infused Adapter by Inhibiting and Amplifying

Konsep: Scale activations dengan learned vectors

class IA3Layer(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()
        # Learned scaling factors
        self.scale_factor = nn.Parameter(torch.ones(hidden_size))
    
    def forward(self, x):
        return x * self.scale_factor

Perbandingan Methods

Performance vs Efficiency Trade-off

method_comparison = {
    "Full Fine-tuning": {
        "trainable_params": "100%",
        "memory": "High", 
        "performance": "Best",
        "storage": "High"
    },
    "LoRA": {
        "trainable_params": "0.1-1%",
        "memory": "Medium",
        "performance": "Very Good", 
        "storage": "Low"
    },
    "QLoRA": {
        "trainable_params": "0.1-1%", 
        "memory": "Low",
        "performance": "Good",
        "storage": "Very Low"
    },
    "Adapters": {
        "trainable_params": "2-5%",
        "memory": "Medium",
        "performance": "Good",
        "storage": "Low"
    },
    "Prefix Tuning": {
        "trainable_params": "0.01-0.1%",
        "memory": "Low", 
        "performance": "Fair",
        "storage": "Very Low"
    }
}

Task-Specific Recommendations

Natural Language Understanding (NLU):

  • LoRA: Target attention layers (q_proj, v_proj)
  • Rank: 8-16 sudah sufficient
  • Tasks: Classification, NER, sentiment analysis

Natural Language Generation (NLG):

  • LoRA: Target attention + FFN layers
  • Rank: 16-64 untuk better generation quality
  • Tasks: Summarization, translation, creative writing

Code Generation:

  • QLoRA: Untuk model besar (70B+ parameters)
  • Higher rank: r=64-128 untuk complex reasoning
  • Target: All linear layers dalam attention

Domain Adaptation:

  • Full LoRA: Attention + FFN layers
  • Higher alpha: α=64-128 untuk aggressive adaptation
  • Additional: Consider vocabulary expansion

Training Best Practices

1. Learning Rate Scheduling

# LoRA biasanya butuh learning rate lebih tinggi
lr_config = {
    "LoRA layers": 2e-4,      # Higher LR for adaptation
    "Base model": 2e-5,       # Lower LR if any base params trained
    "Warmup steps": 100,
    "Scheduler": "cosine"
}

2. Gradient Accumulation

# Untuk memory-constrained training
training_config = {
    "per_device_batch_size": 1,
    "gradient_accumulation_steps": 8,  # Effective batch size = 8
    "dataloader_pin_memory": False,    # Save memory
    "gradient_checkpointing": True     # Trade compute for memory
}

3. Mixed Precision Training

# Enable automatic mixed precision
from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()

for batch in dataloader:
    with autocast():
        outputs = model(batch)
        loss = outputs.loss
    
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Practical Implementation Guide

Setup Environment

# Install required packages
pip install transformers peft bitsandbytes accelerate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Complete QLoRA Training Script

from transformers import TrainingArguments, Trainer
import torch

# Model and tokenizer setup
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# Training arguments
training_args = TrainingArguments(
    output_dir="./qlora-output",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    num_train_epochs=3,
    learning_rate=2e-4,
    fp16=True,
    logging_steps=10,
    save_steps=500,
    evaluation_strategy="steps",
    eval_steps=500,
    warmup_steps=100,
    remove_unused_columns=False,
    group_by_length=True,
    dataloader_pin_memory=False
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator
)

# Start training
trainer.train()

# Save LoRA weights only
model.save_pretrained("./qlora-weights")

Inference dengan LoRA

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config
)

# Load and apply LoRA weights
from peft import PeftModel
model = PeftModel.from_pretrained(base_model, "./qlora-weights")

# Inference
inputs = tokenizer("Your prompt here", return_tensors="pt")
with torch.no_grad():
    outputs = model.generate(**inputs, max_length=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

Troubleshooting Common Issues

Memory Problems

# Solutions for OOM errors
solutions = {
    "Reduce batch size": "per_device_batch_size=1",
    "Increase gradient accumulation": "gradient_accumulation_steps=16", 
    "Enable gradient checkpointing": "gradient_checkpointing=True",
    "Use smaller rank": "r=8 instead of r=64",
    "Reduce sequence length": "max_length=512 instead of 2048"
}

Poor Performance

# Debugging poor adaptation results
debugging_checklist = [
    "Check target modules - include FFN layers for generation tasks",
    "Increase rank - try r=32 or r=64",
    "Adjust alpha - try alpha=2*rank",
    "Verify learning rate - LoRA needs higher LR (1e-4 to 2e-4)",
    "Check data quality - ensure proper formatting",
    "Monitor training loss - should decrease consistently"
]

Conclusion

Parameter-efficient fine-tuning telah membuat LLM adaptation accessible untuk researchers dan practitioners dengan resource terbatas. Key takeaways:

LoRA adalah sweet spot antara efficiency dan performance – cocok untuk sebagian besar use cases dengan resource GPU yang reasonable.

QLoRA membuka possibilities untuk fine-tuning model sangat besar (70B+) dengan consumer hardware, meski dengan slight performance trade-off.

Method lainnya (Adapters, Prefix Tuning, dll) menawarkan alternative approaches dengan trade-off berbeda untuk specific scenarios.

Best practices: Start dengan LoRA r=16, α=32 pada attention layers, lalu adjust berdasarkan task complexity dan available resources.

Future direction: Kombinasi multiple PEFT methods, adaptive rank selection, dan integration dengan model compression techniques akan terus mengembangkan field ini.


Parameter-efficient fine-tuning bukan hanya tentang saving resources, tetapi juga enabling democratization of LLM customization untuk broader community.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

CAPTCHA ImageChange Image