[0:00]It's a mouthful, but you've almost certainly seen the impact of reinforcement learning from human feedback, that's abbreviated to R L H F.
[0:12]And you've seen it whenever you interact with a large language model.R L H F is a technique used to enhance the performance and alignment of AI systems with human preferences and values.
[0:26]You see, LLM's are trained, and they learn all sorts of stuff.And we need to be careful how some of that stuff surfaces to the user.
[0:36]So for example, if I ask an LLM, how can I get revenge on somebody who's wronged me, but without the benefit of R L H F, we might get a response that says something like, spread rumors about them to their friends.
[0:49]But it's much more likely an LLM will respond with something like this.
[0:56]Now, this is a bit more of a boring standard LLM response, but it is better aligned to human values.
[1:02]That's the impact of R L H F. So, let's get into what R L H F is, how it works, and where it can be helpful or a hindrance.
[1:13]And we'll start by defining the R L in R L H F, which is reinforcement learning.
[1:20]Now, conceptually, reinforcement learning aims to emulate the way that human beings learn.
[1:27]AI agents learn holistically through trial and error, motivated by strong incentives to succeed.
[1:32]It's actually a mathematical framework, which consists of a few components.
[1:37]So let's take a look at some of those.So first of all, we have a component called the state space.
[1:46]Which is all available information about the task at hand that is relevant to decisions the AI agent might make.
[1:54]The state space usually changes with each decision the agent makes.Another component is the action space.
[2:07]The action space contains all of the decisions the AI agent might make.
[2:12]Now, in the context of, let's say, a board game, the action space is discrete and well-defined.
[2:18]It's all the legal moves available to the AI player at a given moment.
[2:23]For text generation, well, the action space is massive, the entire vocabulary of all of the tokens available to a large language model.
[2:32]Another component is the reward function and this one really is key to reinforcement learning.
[2:41]It's the measure of success or progress that incentivizes the AI agent.
[2:47]So for the board game, it's to win the game, easy enough.
[2:51]But when the definition of success is nebulous, designing an effective reward function, it can be a bit of a challenge.
[2:58]There's also constraints that we need to be concerned about here.
[3:04]Constraints where the reward function could be supplemented by penalties for actions deemed counterproductive to the task at hand, like the chatbot telling its users to spread rumors.
[3:16]And then underlying all of this, we have policy.
[3:22]Policy is essentially the strategy or the thought process that drives an AI agent's behavior.
[3:28]In mathematical terms, a policy is a function that takes a state as input and returns an action.
[3:36]The goal of an R L algorithm is to optimize a policy to yield maximum reward.
[3:42]And conventional R L, it has achieved impressive real-world results in many fields, but it can struggle to construct a good reward function for complex tasks where a clear-cut definition of success is hard to establish.
[4:00]So, enter us human beings with R L H F, with its ability to capture nuance and subjectivity by using positive human feedback in lieu of formally defined objectives.
[4:13]So, how does R L H F actually work?
[4:17]Well, in the realm of large language models, R L H F typically occurs in four phases.
[4:24]So let's take a brief look at each one of those.Now, phase one, where we're going to start here is with a pre-trained model.
[4:41]We can't re-perform this process without it.
[4:44]Now, R L H F is generally employed to fine-tune and optimize existing models, so an existing pre-trained model rather than as an end-to-end training method.
[4:57]Now with a pre-trained model at the ready, we can move on to the next phase, which is supervised fine-tuning of this model.
[5:09]Now, supervised fine-tuning is used to prime the model to generate its responses in the format expected by users.
[5:16]The LLM pre-training process optimizes models for completion, predicting the next words in a sequence.
[5:23]Now sometimes LLM's won't complete a sequence in a way that the user wants.
[5:30]So for example, if a user prompt is teach me how to make a resume, the LLM might respond with using Microsoft Word.
[5:37]I mean, it's valid, but it's not really aligned with the user's goal.
[5:42]Supervised fine-tuning trains models to respond appropriately to different kinds of prompts, and this is where the humans come in because human experts create labeled examples to demonstrate how to respond to prompts for different use cases like question answering or summarization or translation.
[6:01]Then we move to reward model training.
[6:09]So now we're actually going to train our model here.
[6:12]We need a reward model to translate human preferences into a numerical reward signal.
[6:20]The main purpose of this phase is to provide the reward model with sufficient training data, and what I mean by that is direct feedback from human evaluators.
[6:30]And that will help the model to learn to mimic the way that human preferences allocate rewards to different kinds of model responses.
[6:36]This lets training continue offline without the human in the loop.
[6:40]Now, a reward model must intake a sequence of text and output a single reward value that predicts numerically how much a user would reward or penalize that text.
[6:53]Now, while it might seem intuitive to simply have human evaluators express their opinion of each model response with a rating scale of, let's say, one for worst and ten for best, it's difficult to get all human raters aligned on the relative value of a given score.
[7:09]Instead, a rating system is usually built by comparing human feedback for different model outputs.
[7:15]Now often this is done by having users compare two text sequences, like the outputs of two different large language models, responding to the same prompt in head-to-head match-ups and then using an Elo rating system to generate an aggregated ranking of each bit of generated text relative to one another.
[7:34]Now a simple system might allow users to thumbs up or thumbs down each output with outputs then being ranked by their relative favorability.
[7:43]More complex systems might ask labelers to provide an overall rating and answer categorical questions about the flaws of each response, then aggregate this feedback into weighted quality scores.
[7:55]But either way, the outcomes of the ranking systems are ultimately normalized into a reward signal to inform reward model training.
[8:06]Now, the final hurdle of R L H F is determining how and how much the reward model should be used to update the AI agency's policy.
[8:15]And that is called policy optimization.
[8:22]We want to maximize reward, but if the reward function is used to train the LLM without any guardrails, the language model may dramatically change its weights to the point of outputting gibberish in an effort to game the reward system.
[8:38]Now an algorithm such as P P O, or Proximal Policy Optimization, limits how much the policy can be updated in each training iteration.
[8:51]Okay. Now though R L H F models have demonstrated impressive results in training A I agents for all sorts of complex tasks from robotics and video games to N L P, using R L H F is not without its limitations.
[9:04]So let's think about some of those.Now gathering all of this first-hand human input, I think it's pretty obvious to say, it could be quite expensive to do that, and it can create a costly bottleneck that limits model scalability.
[9:21]Also, you know, us humans and our feedback, it's highly subjective.
[9:27]So we need to consider that as well. It's difficult, if not impossible, to establish firm consensus on what constitutes high-quality output, as human annotators will often disagree on what high-quality model behavior actually should mean.
[9:40]There is no human ground truth against which the model can be judged.
[9:46]Now we also have to be concerned about bad actors, so adversarial.
[9:52]Now, adversarial input could be entered into this process here, where human guidance to the model is not always provided in good faith.
[10:02]That would essentially be R L H F trolling.And R L H F also has risks of over-fitting and bias, which, you know, we talk about a lot with machine learning.
[10:14]And in this case, if human feedback is gathered from a narrow demographic, the model may demonstrate performance issues when used by different groups or prompted on subject matters for which the human evaluators hold certain biases.
[10:28]Now, all of these limitations do beg a question.
[10:32]The question of can A I perform reinforcement learning for us?
[10:37]Can it do it without the humans?And there are proposed methods for something called R L A I F.
[10:44]That stands for reinforcement learning from A I feedback.
[10:50]That replaces some or all of the human feedback by having another large language model evaluate model responses and may help overcome some or all of these limitations.
[11:01]But at least for now, reinforcement learning from from human feedback remains a popular and effective methods for improving the behavior and performance of models, aligning them closer to our own desired human behaviors.



