Glossary · AI
What is
RLHF?
Reinforcement Learning from Human Feedback: training a model based on human preference rankings of outputs.
By Anish· Founder · Vedwix
·Definition
RLHF trains a model to align with human preferences. After SFT, humans rank multiple model outputs, a reward model is trained to predict those preferences, and the LLM is then fine-tuned to maximize the reward. RLHF (and its alternatives like DPO) are how frontier models get their helpfulness and safety behavior.
Example
OpenAI's post-training pipeline for GPT-4 uses RLHF extensively to align the model with human preferences.
How Vedwix uses RLHF in client work
Rare in client work — RLHF needs scale. We use DPO occasionally for smaller alignment tasks.
Building with RLHF?
We ship this.
If you're building with RLHF in production, we can help — from architecture review to full implementation.
Brief us