AI/ML
Compartilhar

DPO (Direct Preference Optimization)

Também chamado de:Direct Preference Optimization
📖O que é

A simplified alternative to RLHF that aligns LLM outputs with human preferences without training a separate reward model or using reinforcement learning. DPO directly optimizes a policy using pairs of preferred and dispreferred outputs, making it computationally cheaper and more stable than RLHF's multi-stage pipeline. Widely adopted in 2024-2025 for fine-tuning open-source models.

Sua exploração

0 termos visitados no total
Termos relacionados explorados0/3

Termos Relacionados