Improving Large Language Models with Automatic Prompt Optimization (APO) from Microsoft Research

MANMEET KAUR

Published May 21, 2023

Many developers are trying to integrate ChatGPT with their products, but they face the problem of user prompts not being directly usable. Preprocessing is required so that Large Language Models (LLMs) can generate correct output. The abilities of LLMs remain highly dependent on prompts. To ease prompt engineering, Microsoft researchers have developed a new prompt optimization method called Automatic Prompt Optimization (APO) inspired by numerical gradient descent and beam search, which can be leveraged in these scenarios. In this article, we will discuss the algorithm.

Dataset Preparation

The dataset is created by generating n contexts for every prompt in the dataset. When the prompt is applied on x, it generates y as the output. The size of the dataset is P x N, where P represents the number of prompts and N represents the scenario numbers. Although it’s assumed that the dataset has the same scenarios for every prompt, they can vary too.

No alt text provided for this image — Figure 1 : DataSet

Dataset can be visualized in the form of tuple (p,x,y).

p represents the prompt
x represents the context of the question asked
y is the result

Example:

p = Is the following text hate speech?

x = "Do you know why he is smiling because there is no “excretion law” in New Zealand! The max sentence he will receive from a judge is no more than 27 years in prison! Is this justice? Or because Muslims lives don’t matter!??? :((("

y = No

Algorithm

The algorithm utilizes small batches of data to create language gradients that provide feedback on the current prompt. These gradients are then incorporated into the prompt by modifying it in the opposite semantic direction indicated by the gradient. It is using a text-based Socratic dialogue approach that mirrors the steps of gradient descent. Instead of differentiation, we leverage feedback from the language model (LLM), and instead of backpropagation, we use LLM editing techniques. Let us try to understand with help of example.

Step 1: Gradient descent with Prompts

Initially, we assess a prompt using a batch of data, as illustrated in Figure 2. By comparing the predicted label with the original label, we generate a local loss and identify errors. The following template guides the LLM to articulate the issues with p0 that might have caused these mistakes. These descriptions (error strings) in natural language form serve as our gradients for improvement.

# Prompt template to generate errors strings

I'm trying to write a zero-shot classifier prompt. 
My current prompt is: "{prompt}" 
But this prompt gets the following examples wrong: 
{error_string} 
give 
{num_feedbacks} 
reasons why the prompt could have gotten these examples wrong. 
Wrap each reason with <START> and <END>

Recommended by LinkedIn

Unlocking the Power of LLMs: A Guide to Successful…

Wagner Souza 2 years ago

Building a Domain-Aligned Small Language Model from…

Aditi Swarnkaar 3 months ago

DS Fortune Cookies: System Prompts

Scott McKean 1 year ago

Step 2: Beam Search over Prompts: Expansion Step

After computing the error strings in the previous step according to Algorithm 2 (Figure 4, Line 2), we proceed to integrate the gradient feedback into the current prompt, p0, to generate successor candidates.

The gradients generated in the previous step, are then utilized by another LLM prompt. This prompt instructs the LLM to edit the current prompt, p0, with the objective of resolving the problems described by the gradient.

The substrings in brackets represent dynamically loaded variables corresponding to the initial prompt, error string, text feedback gradient, and expansion factor.

# LLM prompt to expand prompts

I'm trying to write a zero-shot classifier. 
My current prompt is: 
"{prompt}" 
But it gets the following 
examples wrong: {error_samples} 
Based on these examples the problem with this prompt is that {gradient}
Based on the above information, I wrote
{steps_per_gradient} different improved prompts.
Each prompt is wrapped with <START> and <END> . 
The {steps_per_gradient} new prompts are:

In addition to sampling from the prompts influenced by text gradients, the algorithm expands exploration by conducting a small Monte Carlo search in the local search space around the new prompt candidates and employs the following prompt to guide this search.

Generate a variation of the following instruction while keeping 
the semantic meaning. 

Input: {prompt_instruction}

Step 3: Beam Search over Prompts: Selection Step

After the expansion process generates multiple successor candidates for each candidate prompt, the selection step determines which candidates are the most promising and should remain on the beam for the next iteration.

The specific method for selecting these candidates can vary depending on the problem at hand. In paper some of the adopted approaches are UCB Bandits, Extended UCB Bandits, Successive Rejects and Successive Halving.

Result

In their empirical study, the research team conducted a comparison between their APO framework and three advanced prompt learning baselines, namely Monte-Carlo (MC, Zhou et al., 2022), RL, and AutoGPT. The comparison was performed across several NLP tasks including Jailbreak detection, Ethos (hate speech detection, Mollas et al., 2020), Liar (fake news detection, Wang, 2017), and Sarcasm detection (Farha and Magdy, 2020).

APO outperformed the baseline methods on all four tasks, demonstrating significant improvements of 3.9 percent and 8.2 percent over MC and RL, respectively. Notably, these improvements were achieved without the need for additional hyperparameter tuning or model training, highlighting the APO's efficient and effective capability to enhance prompts.

References:

Deep Sense

256 followers

+ Subscribe

甄化春

恒逸石化 Hengyi Petrochemicals - 工程师

hello, i want to using your code to do some experiment on my own dataset , can you release your code on the github?MANMEET KAUR

1 Reaction

Hemalatha s 2y

It sounds like Automatic Prompt Optimization (APO) is a valuable tool that can help developers with their ChatGPT integration. It's good to hear that it outperformed the baseline methods on all four tasks. However, it's unfortunate that the improvements were achieved without the need for additional hyperparameter tuning or model training.

Hemalatha s 2y

While this article is interesting, it seems to lack practical application. What is the benefit of using APO for developers who are integrating ChatGPT? Is the 3.9-8.2% improvement in accuracy worth the effort?

Hemalatha s 2y

This sounds like a great development, but it may not be applicable to all developers due to the complexity of the algorithm. Also, it is not clear how this approach can be used to improve user experience.

Sanjeev Khadilkar 2y

Well articulated exposition, MANMEET! Subscribed to your newsletter immediately! Keep writing, you are good at it! 😊

Improving Large Language Models with Automatic Prompt Optimization (APO) from Microsoft Research

MANMEET KAUR

Dataset Preparation

Algorithm

Step 1: Gradient descent with Prompts

Recommended by LinkedIn

Step 2: Beam Search over Prompts: Expansion Step

Step 3: Beam Search over Prompts: Selection Step

Result

References:

Deep Sense

256 followers

More articles by MANMEET KAUR

Others also viewed

This Happens When RAG Meets Knowledge Graphs

Exploring Large Language Models: Insights for Architects

Introduction To Retrieval Augmented Generation (RAG)

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Unleashing the Potential: 7 Practical Considerations for Choosing the Right LLM for Your Product

Data Transparency in Open-Source Models

The Dawn of Tool Use for Large Language Models: Understanding Model Context Protocol

Why Text Chunking Matters for AI

Open Weights vs. Closed Weights in Large Language Models

LLM Prompting Techniques for Non-Programmers

How LLMs Handle Selective Reading Prompts

How Llms Process Language

Evaluating Large Language Models With Real-World Scenarios

Ensuring Consistent Text Generation With Large Language Models

Explore content categories

Dataset Preparation

Algorithm

Step 1: Gradient descent with Prompts

Recommended by LinkedIn

Step 2: Beam Search over Prompts: Expansion Step

Step 3: Beam Search over Prompts: Selection Step

Result

References:

Deep Sense

256 followers

More articles by MANMEET KAUR

Fake It Till You Make It: Microsoft’s Mixed Reality & AI on Face Analysis Using Synthetic Data Alone

How AlphaDev is Changing the Game for Sorting Algorithms

A New Approach to Eviction Policy for YouTube CDN DRAM Cache using HALP

Zero-Shot Learning for Multimedia Understanding Tasks from Adobe Research: A Novel Approach

Others also viewed

This Happens When RAG Meets Knowledge Graphs

Exploring Large Language Models: Insights for Architects

Introduction To Retrieval Augmented Generation (RAG)

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Unleashing the Potential: 7 Practical Considerations for Choosing the Right LLM for Your Product

Data Transparency in Open-Source Models

The Dawn of Tool Use for Large Language Models: Understanding Model Context Protocol

Why Text Chunking Matters for AI

Open Weights vs. Closed Weights in Large Language Models

Similar topics

LLM Prompting Techniques for Non-Programmers

How LLMs Handle Selective Reading Prompts

How Llms Process Language

Evaluating Large Language Models With Real-World Scenarios

Ensuring Consistent Text Generation With Large Language Models

Explore content categories