Improving Large Language Models with Automatic Prompt Optimization (APO) from Microsoft Research

Improving Large Language Models with Automatic Prompt Optimization (APO) from Microsoft Research

Many developers are trying to integrate ChatGPT with their products, but they face the problem of user prompts not being directly usable. Preprocessing is required so that Large Language Models (LLMs) can generate correct output. The abilities of LLMs remain highly dependent on prompts. To ease prompt engineering, Microsoft researchers have developed a new prompt optimization method called Automatic Prompt Optimization (APO) inspired by numerical gradient descent and beam search, which can be leveraged in these scenarios. In this article, we will discuss the algorithm.

Dataset Preparation

The dataset is created by generating n contexts for every prompt in the dataset. When the prompt is applied on x, it generates y as the output. The size of the dataset is P x N, where P represents the number of prompts and N represents the scenario numbers. Although it’s assumed that the dataset has the same scenarios for every prompt, they can vary too.

No alt text provided for this image
Figure 1 : DataSet

Dataset can be visualized in the form of tuple (p,x,y).

  • p represents the prompt
  • x represents the context of the question asked
  • y is the result

Example:

p = Is the following text hate speech?

x = "Do you know why he is smiling because there is no “excretion law” in New Zealand! The max sentence he will receive from a judge is no more than 27 years in prison! Is this justice? Or because Muslims lives don’t matter!??? :((("

y = No


Algorithm

The algorithm utilizes small batches of data to create language gradients that provide feedback on the current prompt. These gradients are then incorporated into the prompt by modifying it in the opposite semantic direction indicated by the gradient. It is using a text-based Socratic dialogue approach that mirrors the steps of gradient descent. Instead of differentiation, we leverage feedback from the language model (LLM), and instead of backpropagation, we use LLM editing techniques. Let us try to understand with help of example.

Step 1: Gradient descent with Prompts

Initially, we assess a prompt using a batch of data, as illustrated in Figure 2. By comparing the predicted label with the original label, we generate a local loss and identify errors. The following template guides the LLM to articulate the issues with p0 that might have caused these mistakes. These descriptions (error strings) in natural language form serve as our gradients for improvement.

# Prompt template to generate errors strings

I'm trying to write a zero-shot classifier prompt. 
My current prompt is: "{prompt}" 
But this prompt gets the following examples wrong: 
{error_string} 
give 
{num_feedbacks} 
reasons why the prompt could have gotten these examples wrong. 
Wrap each reason with <START> and <END>         
No alt text provided for this image
Figure 2: The text dialogue tree we use to mirror gradient descent and overcome the discrete optimization barrier. First, a feedback prompt generates the gradient g from input data (x,y) and starting prompt p0 and prediction y (left). Second, an editing prompt applies the gradient g to the prompt p0 to produce an improved prompt p(right) Source: https://arxiv.org/pdf/2305.03495.pdf

Step 2: Beam Search over Prompts: Expansion Step

No alt text provided for this image
Figure 3 Source: https://arxiv.org/pdf/2305.03495.pdf
No alt text provided for this image
Figure 4 Source: https://arxiv.org/pdf/2305.03495.pdf

After computing the error strings in the previous step according to Algorithm 2 (Figure 4, Line 2), we proceed to integrate the gradient feedback into the current prompt, p0, to generate successor candidates.

The gradients generated in the previous step, are then utilized by another LLM prompt. This prompt instructs the LLM to edit the current prompt, p0, with the objective of resolving the problems described by the gradient.

The substrings in brackets represent dynamically loaded variables corresponding to the initial prompt, error string, text feedback gradient, and expansion factor.

# LLM prompt to expand prompts

I'm trying to write a zero-shot classifier. 
My current prompt is: 
"{prompt}" 
But it gets the following 
examples wrong: {error_samples} 
Based on these examples the problem with this prompt is that {gradient}
Based on the above information, I wrote
{steps_per_gradient} different improved prompts.
Each prompt is wrapped with <START> and <END> . 
The {steps_per_gradient} new prompts are:        

In addition to sampling from the prompts influenced by text gradients, the algorithm expands exploration by conducting a small Monte Carlo search in the local search space around the new prompt candidates and employs the following prompt to guide this search.

Generate a variation of the following instruction while keeping 
the semantic meaning. 

Input: {prompt_instruction}        

Step 3: Beam Search over Prompts: Selection Step

No alt text provided for this image
Figure 5 : Overview of the complete Automatic Prompt Optimization (APO) framework Source: https://arxiv.org/pdf/2305.03495.pdf

After the expansion process generates multiple successor candidates for each candidate prompt, the selection step determines which candidates are the most promising and should remain on the beam for the next iteration.

The specific method for selecting these candidates can vary depending on the problem at hand. In paper some of the adopted approaches are UCB Bandits, Extended UCB Bandits, Successive Rejects and Successive Halving.

Result

In their empirical study, the research team conducted a comparison between their APO framework and three advanced prompt learning baselines, namely Monte-Carlo (MC, Zhou et al., 2022), RL, and AutoGPT. The comparison was performed across several NLP tasks including Jailbreak detection, Ethos (hate speech detection, Mollas et al., 2020), Liar (fake news detection, Wang, 2017), and Sarcasm detection (Farha and Magdy, 2020).

APO outperformed the baseline methods on all four tasks, demonstrating significant improvements of 3.9 percent and 8.2 percent over MC and RL, respectively. Notably, these improvements were achieved without the need for additional hyperparameter tuning or model training, highlighting the APO's efficient and effective capability to enhance prompts.

References:

甄化春

恒逸石化 Hengyi Petrochemicals - 工程师

2y

hello, i want to using your code to do some experiment on my own dataset , can you release your code on the github?MANMEET KAUR

It sounds like Automatic Prompt Optimization (APO) is a valuable tool that can help developers with their ChatGPT integration. It's good to hear that it outperformed the baseline methods on all four tasks. However, it's unfortunate that the improvements were achieved without the need for additional hyperparameter tuning or model training.

Like
Reply

While this article is interesting, it seems to lack practical application. What is the benefit of using APO for developers who are integrating ChatGPT? Is the 3.9-8.2% improvement in accuracy worth the effort?

Like
Reply

This sounds like a great development, but it may not be applicable to all developers due to the complexity of the algorithm. Also, it is not clear how this approach can be used to improve user experience.

Like
Reply

Well articulated exposition, MANMEET! Subscribed to your newsletter immediately! Keep writing, you are good at it! 😊

To view or add a comment, sign in

More articles by MANMEET KAUR

Others also viewed

Explore content categories