Improving Large Language Models with Automatic Prompt Optimization (APO) from Microsoft Research
Many developers are trying to integrate ChatGPT with their products, but they face the problem of user prompts not being directly usable. Preprocessing is required so that Large Language Models (LLMs) can generate correct output. The abilities of LLMs remain highly dependent on prompts. To ease prompt engineering, Microsoft researchers have developed a new prompt optimization method called Automatic Prompt Optimization (APO) inspired by numerical gradient descent and beam search, which can be leveraged in these scenarios. In this article, we will discuss the algorithm.
Dataset Preparation
The dataset is created by generating n contexts for every prompt in the dataset. When the prompt is applied on x, it generates y as the output. The size of the dataset is P x N, where P represents the number of prompts and N represents the scenario numbers. Although it’s assumed that the dataset has the same scenarios for every prompt, they can vary too.
Dataset can be visualized in the form of tuple (p,x,y).
Example:
p = Is the following text hate speech?
x = "Do you know why he is smiling because there is no “excretion law” in New Zealand! The max sentence he will receive from a judge is no more than 27 years in prison! Is this justice? Or because Muslims lives don’t matter!??? :((("
y = No
Algorithm
The algorithm utilizes small batches of data to create language gradients that provide feedback on the current prompt. These gradients are then incorporated into the prompt by modifying it in the opposite semantic direction indicated by the gradient. It is using a text-based Socratic dialogue approach that mirrors the steps of gradient descent. Instead of differentiation, we leverage feedback from the language model (LLM), and instead of backpropagation, we use LLM editing techniques. Let us try to understand with help of example.
Step 1: Gradient descent with Prompts
Initially, we assess a prompt using a batch of data, as illustrated in Figure 2. By comparing the predicted label with the original label, we generate a local loss and identify errors. The following template guides the LLM to articulate the issues with p0 that might have caused these mistakes. These descriptions (error strings) in natural language form serve as our gradients for improvement.
# Prompt template to generate errors strings
I'm trying to write a zero-shot classifier prompt.
My current prompt is: "{prompt}"
But this prompt gets the following examples wrong:
{error_string}
give
{num_feedbacks}
reasons why the prompt could have gotten these examples wrong.
Wrap each reason with <START> and <END>
Recommended by LinkedIn
Step 2: Beam Search over Prompts: Expansion Step
After computing the error strings in the previous step according to Algorithm 2 (Figure 4, Line 2), we proceed to integrate the gradient feedback into the current prompt, p0, to generate successor candidates.
The gradients generated in the previous step, are then utilized by another LLM prompt. This prompt instructs the LLM to edit the current prompt, p0, with the objective of resolving the problems described by the gradient.
The substrings in brackets represent dynamically loaded variables corresponding to the initial prompt, error string, text feedback gradient, and expansion factor.
# LLM prompt to expand prompts
I'm trying to write a zero-shot classifier.
My current prompt is:
"{prompt}"
But it gets the following
examples wrong: {error_samples}
Based on these examples the problem with this prompt is that {gradient}
Based on the above information, I wrote
{steps_per_gradient} different improved prompts.
Each prompt is wrapped with <START> and <END> .
The {steps_per_gradient} new prompts are:
In addition to sampling from the prompts influenced by text gradients, the algorithm expands exploration by conducting a small Monte Carlo search in the local search space around the new prompt candidates and employs the following prompt to guide this search.
Generate a variation of the following instruction while keeping
the semantic meaning.
Input: {prompt_instruction}
Step 3: Beam Search over Prompts: Selection Step
After the expansion process generates multiple successor candidates for each candidate prompt, the selection step determines which candidates are the most promising and should remain on the beam for the next iteration.
The specific method for selecting these candidates can vary depending on the problem at hand. In paper some of the adopted approaches are UCB Bandits, Extended UCB Bandits, Successive Rejects and Successive Halving.
Result
In their empirical study, the research team conducted a comparison between their APO framework and three advanced prompt learning baselines, namely Monte-Carlo (MC, Zhou et al., 2022), RL, and AutoGPT. The comparison was performed across several NLP tasks including Jailbreak detection, Ethos (hate speech detection, Mollas et al., 2020), Liar (fake news detection, Wang, 2017), and Sarcasm detection (Farha and Magdy, 2020).
APO outperformed the baseline methods on all four tasks, demonstrating significant improvements of 3.9 percent and 8.2 percent over MC and RL, respectively. Notably, these improvements were achieved without the need for additional hyperparameter tuning or model training, highlighting the APO's efficient and effective capability to enhance prompts.
恒逸石化 Hengyi Petrochemicals - 工程师
2yhello, i want to using your code to do some experiment on my own dataset , can you release your code on the github?MANMEET KAUR
It sounds like Automatic Prompt Optimization (APO) is a valuable tool that can help developers with their ChatGPT integration. It's good to hear that it outperformed the baseline methods on all four tasks. However, it's unfortunate that the improvements were achieved without the need for additional hyperparameter tuning or model training.
While this article is interesting, it seems to lack practical application. What is the benefit of using APO for developers who are integrating ChatGPT? Is the 3.9-8.2% improvement in accuracy worth the effort?
This sounds like a great development, but it may not be applicable to all developers due to the complexity of the algorithm. Also, it is not clear how this approach can be used to improve user experience.
Well articulated exposition, MANMEET! Subscribed to your newsletter immediately! Keep writing, you are good at it! 😊