AI/ML

DPO (Direct Preference Optimization)

Também chamado de:Direct Preference Optimization

📖O que é

A simplified alternative to RLHF that aligns LLM outputs with human preferences without training a separate reward model or using reinforcement learning. DPO directly optimizes a policy using pairs of preferred and dispreferred outputs, making it computationally cheaper and more stable than RLHF's multi-stage pipeline. Widely adopted in 2024-2025 for fine-tuning open-source models.

Sua exploração

0 termos visitados no total

Termos relacionados explorados0/3

Termos Relacionados

RLHF (Reinforcement Learning from Human Feedback)AI/ML

A training technique that aligns LLM outputs with human preferences. Process: (1) train a …

Ver termo →

Fine-TuningAI/ML

The process of further training a pre-trained model on a specialized dataset to improve pe…

Ver termo →

Training (ML)AI/ML

The process of optimizing a model's parameters by exposing it to data and adjusting weight…

Ver termo →

Voltar ao glossário