Introduction
Parameter-Efficient Fine-Tuning (PEFT) telah merevolusi cara kita mengadaptasi Large Language Models (LLMs) untuk task-specific applications. Dengan model yang semakin besar (GPT-4, Claude, LLaMA), full fine-tuning menjadi tidak praktis dan mahal. Artikel ini membahas teknik-teknik modern yang memungkinkan fine-tuning dengan resource minimal namun hasil maksimal.
Mengapa Parameter-Efficient Fine-Tuning?
Masalah Full Fine-Tuning
- Memory Requirements: Model 7B parameters butuh ~28GB GPU memory untuk training
- Storage Overhead: Setiap task membutuhkan copy model lengkap
- Computational Cost: Training time dan energy consumption yang tinggi
- Catastrophic Forgetting: Model kehilangan kemampuan general setelah fine-tuning
Keuntungan PEFT
- Efisiensi Memory: Hanya update 0.1-1% dari total parameters
- Storage Efficiency: Adapter weights hanya beberapa MB vs GB model lengkap
- Faster Training: Konvergensi lebih cepat dengan parameter lebih sedikit
- Multi-Task: Satu base model + multiple adapters untuk berbagai tasks
LoRA (Low-Rank Adaptation)
Konsep Dasar
LoRA berdasarkan hypothesis bahwa weight updates selama fine-tuning memiliki low intrinsic rank. Alih-alih mengupdate matrix W secara langsung, LoRA mendekomposisi update menjadi dua matrix berrank rendah.
Formula Matematis
W_new = W_original + ΔW
ΔW = A × B
Dimana:
- A: matrix m×r (m = original dimension, r = rank)
- B: matrix r×n (n = output dimension)
- r << min(m,n): rank jauh lebih kecil dari dimensi asli
Implementasi LoRA
import torch
import torch.nn as nn
class LoRALayer(nn.Module):
def __init__(self, in_features, out_features, rank=16, alpha=32):
super().__init__()
self.rank = rank
self.alpha = alpha
# Original weights (frozen)
self.original_layer = nn.Linear(in_features, out_features, bias=False)
self.original_layer.weight.requires_grad = False
# LoRA matrices
self.lora_A = nn.Parameter(torch.randn(rank, in_features) * 0.01)
self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
self.scaling = alpha / rank
def forward(self, x):
# Original forward pass
original_output = self.original_layer(x)
# LoRA adaptation
lora_output = (x @ self.lora_A.T @ self.lora_B.T) * self.scaling
return original_output + lora_output
# Usage example
model = transformers.AutoModel.from_pretrained("llama-7b")
# Replace linear layers with LoRA
for name, module in model.named_modules():
if isinstance(module, nn.Linear) and 'attention' in name.lower():
# Replace attention layers with LoRA
lora_layer = LoRALayer(
module.in_features,
module.out_features,
rank=16
)
lora_layer.original_layer.weight.data = module.weight.data.clone()
setattr(model, name, lora_layer)
LoRA Hyperparameters
Rank (r):
- r=1-4: Very lightweight, cocok untuk simple tasks
- r=8-16: Standard choice, balance antara efficiency dan performance
- r=32-64: Higher capacity, untuk complex tasks
Alpha (α):
- α = r: Conservative scaling
- α = 2r: Standard recommendation
- α = 4r: Aggressive adaptation
Target Modules:
# Common LoRA targets
lora_config = {
"target_modules": [
"q_proj", # Query projection
"v_proj", # Value projection
"k_proj", # Key projection
"o_proj", # Output projection
# Optional: "gate_proj", "up_proj", "down_proj" for FFN
],
"rank": 16,
"alpha": 32,
"dropout": 0.1
}
QLoRA (Quantized LoRA)
Inovasi QLoRA
QLoRA menggabungkan LoRA dengan 4-bit quantization, memungkinkan fine-tuning model yang sangat besar dengan GPU consumer-grade.
Key Components
1. 4-bit NormalFloat (NF4)
# NF4 quantization concept
def quantize_nf4(tensor):
# Normalize to [-1, 1] range
scale = tensor.abs().max()
normalized = tensor / scale
# Map to 4-bit values optimized for normal distribution
nf4_values = [-1.0, -0.6962, -0.5250, -0.3906, -0.2739, -0.1647,
-0.0618, 0.0, 0.0618, 0.1647, 0.2739, 0.3906,
0.5250, 0.6962, 1.0]
# Find closest NF4 value for each element
quantized = torch.tensor([min(nf4_values, key=lambda x: abs(x-val))
for val in normalized.flatten()])
return quantized.reshape(tensor.shape), scale
2. Double Quantization
Quantize quantization constants untuk additional memory saving:
class DoubleQuantization:
def __init__(self, blocksize=64):
self.blocksize = blocksize
def quantize(self, tensor):
# First level: quantize tensor in blocks
blocks = tensor.view(-1, self.blocksize)
scales_1 = []
quantized_blocks = []
for block in blocks:
scale = block.abs().max()
quantized = (block / scale * 15).round().clamp(-8, 7)
scales_1.append(scale)
quantized_blocks.append(quantized)
# Second level: quantize the scales
scales_tensor = torch.tensor(scales_1)
scale_scale = scales_tensor.abs().max()
quantized_scales = (scales_tensor / scale_scale * 255).round()
return quantized_blocks, quantized_scales, scale_scale
3. Paged Optimizers
# Memory-efficient optimizer for QLoRA
import bitsandbytes as bnb
optimizer = bnb.optim.PagedAdamW32bit(
model.parameters(),
lr=2e-4,
betas=(0.9, 0.999),
weight_decay=0.01
)
QLoRA Implementation
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto"
)
# LoRA configuration
lora_config = LoraConfig(
r=64,
lora_alpha=16,
target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)
# Apply LoRA to quantized model
model = get_peft_model(model, lora_config)
print(f"Trainable parameters: {model.num_parameters(only_trainable=True):,}")
print(f"Total parameters: {model.num_parameters():,}")
Memory Comparison
# Memory usage comparison (7B parameter model)
memory_usage = {
"Full Fine-tuning": "~28GB",
"LoRA (r=16)": "~12GB",
"QLoRA (4-bit)": "~6GB",
"QLoRA + Gradient Checkpointing": "~4GB"
}
Parameter-Efficient Methods Lainnya
1. Adapter Layers
Konsep: Insert small bottleneck layers ke dalam pre-trained model
class AdapterLayer(nn.Module):
def __init__(self, hidden_size, adapter_size=64):
super().__init__()
self.adapter_down = nn.Linear(hidden_size, adapter_size)
self.adapter_up = nn.Linear(adapter_size, hidden_size)
self.activation = nn.ReLU()
self.dropout = nn.Dropout(0.1)
def forward(self, x):
residual = x
x = self.adapter_down(x)
x = self.activation(x)
x = self.dropout(x)
x = self.adapter_up(x)
return x + residual # Residual connection
2. Prefix Tuning
Konsep: Optimize continuous task-specific vectors yang di-prepend ke input
class PrefixTuning(nn.Module):
def __init__(self, prefix_length=10, hidden_size=768):
super().__init__()
self.prefix_length = prefix_length
self.prefix_embeddings = nn.Parameter(
torch.randn(prefix_length, hidden_size) * 0.01
)
def forward(self, input_embeds):
batch_size = input_embeds.size(0)
prefix = self.prefix_embeddings.unsqueeze(0).expand(batch_size, -1, -1)
return torch.cat([prefix, input_embeds], dim=1)
3. P-Tuning v2
Konsep: Add trainable prompts pada setiap layer, bukan hanya input
class PTuningV2(nn.Module):
def __init__(self, num_layers, hidden_size, prompt_length=100):
super().__init__()
self.prompt_length = prompt_length
# Prompt embeddings for each layer
self.prompt_embeddings = nn.ParameterList([
nn.Parameter(torch.randn(prompt_length, hidden_size) * 0.01)
for _ in range(num_layers)
])
def get_prompt(self, layer_idx, batch_size):
return self.prompt_embeddings[layer_idx].unsqueeze(0).expand(batch_size, -1, -1)
4. (IA)³ – Infused Adapter by Inhibiting and Amplifying
Konsep: Scale activations dengan learned vectors
class IA3Layer(nn.Module):
def __init__(self, hidden_size):
super().__init__()
# Learned scaling factors
self.scale_factor = nn.Parameter(torch.ones(hidden_size))
def forward(self, x):
return x * self.scale_factor
Perbandingan Methods
Performance vs Efficiency Trade-off
method_comparison = {
"Full Fine-tuning": {
"trainable_params": "100%",
"memory": "High",
"performance": "Best",
"storage": "High"
},
"LoRA": {
"trainable_params": "0.1-1%",
"memory": "Medium",
"performance": "Very Good",
"storage": "Low"
},
"QLoRA": {
"trainable_params": "0.1-1%",
"memory": "Low",
"performance": "Good",
"storage": "Very Low"
},
"Adapters": {
"trainable_params": "2-5%",
"memory": "Medium",
"performance": "Good",
"storage": "Low"
},
"Prefix Tuning": {
"trainable_params": "0.01-0.1%",
"memory": "Low",
"performance": "Fair",
"storage": "Very Low"
}
}
Task-Specific Recommendations
Natural Language Understanding (NLU):
- LoRA: Target attention layers (q_proj, v_proj)
- Rank: 8-16 sudah sufficient
- Tasks: Classification, NER, sentiment analysis
Natural Language Generation (NLG):
- LoRA: Target attention + FFN layers
- Rank: 16-64 untuk better generation quality
- Tasks: Summarization, translation, creative writing
Code Generation:
- QLoRA: Untuk model besar (70B+ parameters)
- Higher rank: r=64-128 untuk complex reasoning
- Target: All linear layers dalam attention
Domain Adaptation:
- Full LoRA: Attention + FFN layers
- Higher alpha: α=64-128 untuk aggressive adaptation
- Additional: Consider vocabulary expansion
Training Best Practices
1. Learning Rate Scheduling
# LoRA biasanya butuh learning rate lebih tinggi
lr_config = {
"LoRA layers": 2e-4, # Higher LR for adaptation
"Base model": 2e-5, # Lower LR if any base params trained
"Warmup steps": 100,
"Scheduler": "cosine"
}
2. Gradient Accumulation
# Untuk memory-constrained training
training_config = {
"per_device_batch_size": 1,
"gradient_accumulation_steps": 8, # Effective batch size = 8
"dataloader_pin_memory": False, # Save memory
"gradient_checkpointing": True # Trade compute for memory
}
3. Mixed Precision Training
# Enable automatic mixed precision
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
for batch in dataloader:
with autocast():
outputs = model(batch)
loss = outputs.loss
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Practical Implementation Guide
Setup Environment
# Install required packages
pip install transformers peft bitsandbytes accelerate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Complete QLoRA Training Script
from transformers import TrainingArguments, Trainer
import torch
# Model and tokenizer setup
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# Training arguments
training_args = TrainingArguments(
output_dir="./qlora-output",
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
num_train_epochs=3,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_steps=500,
evaluation_strategy="steps",
eval_steps=500,
warmup_steps=100,
remove_unused_columns=False,
group_by_length=True,
dataloader_pin_memory=False
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
tokenizer=tokenizer,
data_collator=data_collator
)
# Start training
trainer.train()
# Save LoRA weights only
model.save_pretrained("./qlora-weights")
Inference dengan LoRA
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config
)
# Load and apply LoRA weights
from peft import PeftModel
model = PeftModel.from_pretrained(base_model, "./qlora-weights")
# Inference
inputs = tokenizer("Your prompt here", return_tensors="pt")
with torch.no_grad():
outputs = model.generate(**inputs, max_length=100)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
Troubleshooting Common Issues
Memory Problems
# Solutions for OOM errors
solutions = {
"Reduce batch size": "per_device_batch_size=1",
"Increase gradient accumulation": "gradient_accumulation_steps=16",
"Enable gradient checkpointing": "gradient_checkpointing=True",
"Use smaller rank": "r=8 instead of r=64",
"Reduce sequence length": "max_length=512 instead of 2048"
}
Poor Performance
# Debugging poor adaptation results
debugging_checklist = [
"Check target modules - include FFN layers for generation tasks",
"Increase rank - try r=32 or r=64",
"Adjust alpha - try alpha=2*rank",
"Verify learning rate - LoRA needs higher LR (1e-4 to 2e-4)",
"Check data quality - ensure proper formatting",
"Monitor training loss - should decrease consistently"
]
Conclusion
Parameter-efficient fine-tuning telah membuat LLM adaptation accessible untuk researchers dan practitioners dengan resource terbatas. Key takeaways:
LoRA adalah sweet spot antara efficiency dan performance – cocok untuk sebagian besar use cases dengan resource GPU yang reasonable.
QLoRA membuka possibilities untuk fine-tuning model sangat besar (70B+) dengan consumer hardware, meski dengan slight performance trade-off.
Method lainnya (Adapters, Prefix Tuning, dll) menawarkan alternative approaches dengan trade-off berbeda untuk specific scenarios.
Best practices: Start dengan LoRA r=16, α=32 pada attention layers, lalu adjust berdasarkan task complexity dan available resources.
Future direction: Kombinasi multiple PEFT methods, adaptive rank selection, dan integration dengan model compression techniques akan terus mengembangkan field ini.
Parameter-efficient fine-tuning bukan hanya tentang saving resources, tetapi juga enabling democratization of LLM customization untuk broader community.
Leave a Reply