Research Index

Join the Menttor community

Access accelerated AI inference, track progress, and collaborate on roadmaps with students worldwide.

🐢

Research Decoded/Askell et al. (2022)

#RLHF: Helpful & Harmless

Bai, Y., Jones, A., Ndousse, K., Askell, A., Commission, G., ... & Kaplan, J. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.

Read Original Paper

In 2022, the 'Helpful and Harmless' paper from Anthropic deepened the understanding of how Reinforcement Learning from Human Feedback (RLHF) can be used to align AI behavior. While previous work had focused on following simple instructions, this paper explored the inherent trade-offs between being useful to a user and avoiding harmful content. The researchers argued that alignment is not a single target, but a multi-dimensional space that requires careful data collection and model tuning. It was a push for safety as a core architectural requirement.

#Preference Modeling

Crowdworker preferences across different model types and training methods.

The technical shift was the use of large-scale preference modeling to capture the nuances of what humans consider 'helpful' and 'harmless.' By showing crowdworkers multiple model responses and asking them to rank them, the researchers built a preference model that could then be used to train the main assistant. As the paper states, 'We find that RLHF leads to significant improvements in both helpfulness and harmlessness across all model sizes.' This revealed that the model's ability to be a good assistant is directly tied to the quality of the feedback it receives during training.

#The Alignment Tax

The reasoning behind this work was the observation of an 'alignment tax'—the phenomenon where making a model safer or more helpful can sometimes lead to a decrease in its performance on other tasks. The researchers found that larger models were more robust to this tax, suggesting that scale provides the necessary capacity to handle conflicting objectives. This proved that building a safe AI is not just about constraints, but about having a model large enough to understand the complexity of human values.

#Iterative Online Learning

The success of this approach highlighted the importance of iterative, 'online' learning, where the model is continuously updated based on new human interactions. This creates a feedback loop that allows the model to adapt to increasingly subtle and difficult scenarios. It raises the question of how we can scale this human-in-the-loop process to a global level, and whether the values of a small group of crowdworkers can ever truly represent the diverse needs of all users.

#Dive Deeper

Anthropic RLHF Blog
Anthropic • article
Explore Resource
RLHF Concept Guide
Hugging Face • article
Explore Resource