AI/ML
DPO (Direct Preference Optimization)
Também chamado de:Direct Preference Optimization
📖O que é
A simplified alternative to RLHF that aligns LLM outputs with human preferences without training a separate reward model or using reinforcement learning. DPO directly optimizes a policy using pairs of preferred and dispreferred outputs, making it computationally cheaper and more stable than RLHF's multi-stage pipeline. Widely adopted in 2024-2025 for fine-tuning open-source models.
Sua exploração
0 termos visitados no totalTermos relacionados explorados0/3
Termos Relacionados
RLHF (Reinforcement Learning from Human Feedback)AI/ML
A training technique that aligns LLM outputs with human preferences. Process: (1) train a …
Ver termo →Fine-TuningAI/ML
The process of further training a pre-trained model on a specialized dataset to improve pe…
Ver termo →Training (ML)AI/ML
The process of optimizing a model's parameters by exposing it to data and adjusting weight…
Ver termo →