Using LangChain to Create Cost-Effective Datasets for Fine-Tuning Large Language Models
In today's digital age, the vast majority of information exists in a free-form, unstructured manner — from casual product descriptions to offhand travel plans. While such descriptions might make perfect sense to humans, machines often struggle to derive meaningful insights from them. However, transforming this unstructured information into a clear, organized format can unlock a wealth of opportunities for data analysis and automation.
Imagine a scenario where someone casually mentions wanting a lightweight smartphone with a large screen that fits snugly in their pocket. Or perhaps, a traveler thinking aloud about a week-long trip to Italy, emphasizing their desire to explore Rome, indulge in the local cuisine, and fit in a day of hiking. The challenge lies in converting these loose descriptions into structured datasets, which can then be processed, analyzed, and actioned upon by algorithms and systems.
This article delves into the creation of Structured Output Datasets using two powerful tools: LangChain and the OpenAI API.
Example & Motivation : Travel plans
Consider a scenario: We're designing a chatbot for a travel agency. The chatbot's task is to understand various travel plans described by users and to make relevant suggestions.
Input from User: "I'm thinking of a solo trip for about 2 weeks, primarily focusing on sightseeing and museum visits in Paris. I have a budget of around $5000."
Now, this statement might seem simple, but there's a lot of structured data embedded within:
To train our chatbot, we need thousands, if not millions, of such sentences, each with its structured counterpart. Manually curating such data is cumbersome, hence the need for our structured output generator!
By running the provided code, we generate sentences like: "Duration: 2 weeks, Travel Type: solo, Activities: sightseeing, museum visits, Destinations: Paris, Budget: $5000."
Or maybe: "Booking Mode: online platform, Travelers: 1 person, Destinations: Tokyo, Rio, Cultural Interest: historical sites, modern attractions."
These sentences serve as the base data for our chatbot. With these, we can employ the OpenAI API to convert the unstructured input (user's sentence) into structured output, teaching our chatbot to understand and process diverse travel plans.
Structured Data Generation.
import rando
# Expanded Data Pools
DURATION_POOL = [f"{i} days" for i in range(1, 31)] + [f"{i} weeks" for i in range(1, 9)]
DESTINATIONS_POOL = ["Paris", "New York", "Tokyo", "London", "Sydney", "Cairo", "Rio", "Cape Town", "Moscow", "Beijing"]
ACTIVITIES_POOL = ["sightseeing", "trekking", "culinary experiences", "museum visits", "beach relaxation", "mountain climbing"]
BUDGET_POOL = [f"${i}000" for i in range(1, 21)]
ACCOMMODATION_POOL = ["hotel", "hostel", "B&B", "luxury resort", "rented apartment"]
TRANSPORTATION_POOL = ["bus", "train", "flight", "rented car", "bicycle"]
# Additional Elements
TRAVELERS_POOL = [f"{i} persons" for i in range(1, 11)]
SEASON_POOL = ["spring", "summer", "fall", "winter"]
MEAL_PREF_POOL = ["local cuisine", "vegetarian meals", "vegan options", "seafood delights", "fast food"]
TRAVEL_TYPE_POOL = ["solo", "couple", "family", "group", "business", "backpacking"]
BOOKING_MODE_POOL = ["travel agency", "online platform", "direct booking", "last minute deals"]
GUIDE_PREF_POOL = ["guided tours", "self-exploration", "audio guide", "local guide", "group tour", "private tour"]
LANG_PREF_POOL = ["English", "French", "Spanish", "local language", "multilingual"]
CULTURAL_INTEREST_POOL = ["historical sites", "modern attractions", "folk performances", "art galleries", "music concerts"]
def generate_travel_plan():
elements = {
"Duration": random.choice(DURATION_POOL),
"Destinations": ", ".join(random.sample(DESTINATIONS_POOL, random.randint(1, 3))),
"Activities": ", ".join(random.sample(ACTIVITIES_POOL, random.randint(1, 3))),
"Budget": random.choice(BUDGET_POOL),
"Accommodation": random.choice(ACCOMMODATION_POOL),
"Transportation": ", ".join(random.sample(TRANSPORTATION_POOL, random.randint(1, 3))),
"Travelers": random.choice(TRAVELERS_POOL),
"Season": random.choice(SEASON_POOL),
"Meal Preference": random.choice(MEAL_PREF_POOL),
"Travel Type": random.choice(TRAVEL_TYPE_POOL),
"Booking Mode": random.choice(BOOKING_MODE_POOL),
"Guide Preference": random.choice(GUIDE_PREF_POOL),
"Language Preference": random.choice(LANG_PREF_POOL),
"Cultural Interest": random.choice(CULTURAL_INTEREST_POOL)
}
# Drop some keys for variability
num_elements_to_use = random.randint(6, 10) # Using between 6 and 10 elements
keys_to_use = random.sample(list(elements.keys()), num_elements_to_use)
# Construct the travel plan
plan_elements = [f"{key}: {elements[key]}" for key in keys_to_use]
travel_plan = ", ".join(plan_elements) + '.'
return travel_plan
# Generate 100k structured travel plans
data_points = 100_000
structured_data = [generate_travel_plan() for _ in range(data_points)]
# For demonstration, let's print the first 5 entries
for entry in structured_data[:5]:
print(entry)
print("---------------------")
# To save the structured data into a text file
with open("structured_data.txt", "w") as f:
for entry in structured_data:
f.write(entry + "\n---------------------\n")
Understanding Structured Output Generation
As we delve into the world of natural language processing (NLP) and machine learning, one critical challenge we often face is the availability of structured datasets. These datasets serve as the foundation upon which many AI and NLP models are trained. The more diverse and expansive the dataset, the more our models can learn and adapt.
The code provided in the previous section does exactly that—it's an example of generating a structured output in the domain of travel planning. By utilizing Python’s random library, the code dynamically produces travel plans with varied elements like destinations, activities, accommodations, and more.
Let's dissect the core components:
Travelers: 5 persons, Travel Type: business, Meal Preference: fast food, Transportation: train, flight, Booking Mode: last minute deals, Cultural Interest: historical sites
---------------------
Duration: 5 weeks, Accommodation: hotel, Guide Preference: self-exploration, Booking Mode: travel agency, Language Preference: Spanish, Activities: beach relaxation, culinary experiences, Budget: $11000, Travel Type: couple.
---------------------
Booking Mode: direct booking, Travelers: 9 persons, Transportation: bicycle, Destinations: Beijing, Cape Town, New York, Activities: beach relaxation, Accommodation: rented apartment, Duration: 3 weeks, Season: winter, Travel Type: backpacking, Meal Preference: local cuisine.
---------------------
Travel Type: business, Destinations: Rio, Activities: sightseeing, Transportation: train, flight, rented car, Guide Preference: private tour, Meal Preference: fast food, Cultural Interest: art galleries, Language Preference: French, Accommodation: luxury resort.
---------------------
Travelers: 1 persons, Accommodation: hostel, Booking Mode: online platform, Travel Type: backpacking, Duration: 2 weeks, Season: winter, Language Preference: English, Guide Preference: private tour, Destinations: Tokyo, Budget: $5000.
---------------------.
Generating Unstructured Data using LangChain and OpenAI API
LangChain provides a streamlined approach to programmatically craft prompts for Language Learning Models (LLMs). While LangChain supports various models, in this context, we'll harness the capabilities of OpenAI's ChatGPT 3.5. Our goal is to design a prompt that can seamlessly transform structured information into a fluid, natural-sounding narrative.
For instance:
Below, we will explore a comprehensive walkthrough of generating such unstructured data based on the given structured travel plans.
1. Installing Necessary Libraries
!pip install langchai
!pip install OpenAI
2. Setup and Imports
import os
os.environ["OPENAI_API_KEY"] = "paste-the-key-here"
import numpy as np
import pandas as pd
from tqdm import tqdm
from langchain import PromptTemplate
from langchain.llms import OpenAI
from langchain.chains import LLMChain
3. Loading Structured Data
with open("structured_data.txt", "r") as file:
lines = file.readlines()
entries = [entry.strip() for entry in ''.join(lines).split("---------------------") if entry.strip() != ""]
structured_data_array = np.array(entries)
3.1 Writing style
Everyone has a unique way of expressing themselves in writing, which makes it crucial for our dataset to capture a wide spectrum of styles. This ensures that a model trained on such data is versatile and resilient to various stylistic nuances. While one could further refine this data by addressing spelling errors, grammatical mistakes, or even incorporating multiple languages, our focus here is primarily on capturing diverse writing styles.
import random
def select_style():
styles = [
"Narrative", "Persuasive", "Expository", "Journalistic",
"Satirical", "Stream-of-Consciousness", "Epistolary", "Conversational",
"Didactic", "Slang or Colloquial"
]
return random.choice(styles)
sample_style = select_style()
# Uncomment to print: print(f"Randomly chosen style: {sample_style}")
plan_list = [
{"user_plan": plan, "writing_style": select_style()} for plan in structured_data_array
]
Recommended by LinkedIn
4. Preparing Data for Processing
plan_list = [
{"user_plan": plan, "style": pick_style()} for plan in structured_data_array
]
Each structured plan is converted to a dictionary with the key as `user_plan`. This format is useful for processing with LangChain.
5. Setting Up the Templating Process
demo_template='''
I want you to come up with a unstructered text for the following plan: {user_plan}.
Use the writing style: {style}
for example: I'm thinking of a solo trip for about 2 weeks, primarily focusing on sightseeing and museum visits in Paris. I have a budget of around $5000.
'''
prompt=PromptTemplate(
input_variables=['user_plan','style'],
template=demo_template
)
Explanation:
6. Setting Up the OpenAI Model
llm=OpenAI(temperature=0.5
chain=LLMChain(llm=llm, prompt=prompt)
Explanation:
7. Generating Unstructured Data
summary = [
for item in tqdm(plan_list, desc='Processing topics'):
summary.append(chain.run(item))
]
Explanation:
8. Saving the Generated Data
topic_list = [
{"user_plan": user_plan, "user_input": user_input}
for user_plan, user_input in zip(structured_data_array, summary)
]
df = pd.DataFrame(topic_list)
df.to_csv('dataset.csv', index=False, encoding='utf-8')
Explanation:
Examples:
Structured: Travelers: 5 persons, Travel Type: business, Meal Preference: fast food, Transportation: train, flight, Booking Mode: last minute deals, Cultural Interest: historical sites.
Unstructured: Dear Friends, It's time to plan our next business trip! We are a group of five and we'd like to take advantage of some last minute deals. We are looking for a combination of train and flight transportation and our meal preference is fast food. We also have a cultural interest in visiting some historical sites. We have a budget of $5000 and are looking to travel for two weeks. We are all excited to explore and experience new cultures and places. Let's start planning our adventure today! Sincerely, The Travelers
Structured: Duration: 5 weeks, Accommodation: hotel, Guide Preference: self-exploration, Booking Mode: travel agency, Language Preference: Spanish, Activities: beach relaxation, culinary experiences, Budget: $11000, Travel Type: couple.
Unstructured: My partner and I have decided to embark on a 5-week journey through Spain. We've booked a hotel through a travel agency and plan to explore the country on our own. We want to experience the Spanish culture through its cuisine, so we'll be trying out all the local delicacies. We'll also be spending some time relaxing on the beach. Our budget for this trip is $11000. We're looking forward to this incredible journey and the memories we'll make together.
Structured: Booking Mode: direct booking, Travelers: 9 persons, Transportation: bicycle, Destinations: Beijing, Cape Town, New York, Activities: beach relaxation, Accommodation: rented apartment, Duration: 3 weeks, Season: winter, Travel Type: backpacking, Meal Preference: local cuisine.
Unstructured: If you are planning a 3-week backpacking trip for 9 people, then cycling is an ideal form of transportation. You can begin your journey in Beijing, then make your way to Cape Town and New York. During your travels, you can enjoy beach relaxation in some of the most beautiful places in the world. To save money, you can rent an apartment in each city you visit. As it is winter, you should bring plenty of warm clothes and blankets. To truly experience the local culture, it is best to sample the local cuisine. Bon voyage!
Structured: Travel Type: business, Destinations: Rio, Activities: sightseeing, Transportation: train, flight, rented car, Guide Preference: private tour, Meal Preference: fast food, Cultural Interest: art galleries, Language Preference: French, Accommodation: luxury resort.
Unstructured: I'm planning a business trip to Rio, and I'm looking to make the most of my time there. I want to see the sights, so I'm considering taking the train, flying, or renting a car. I'd like to have a private tour guide to get the most out of my experience, so I'm looking into that. I'm mostly interested in fast-food options for meals, but I also want to check out the art galleries and museums. I'd prefer to have a guide who speaks French, and I want to stay in a luxury resort. This is going to be an amazing trip!
Structured: Travelers: 1 persons, Accommodation: hostel, Booking Mode: online platform, Travel Type: backpacking, Duration: 2 weeks, Season: winter, Language Preference: English, Guide Preference: private tour, Destinations: Tokyo, Budget: $5000.
Unstructured: So, I'm looking for a winter backpacking adventure in Tokyo, and I'm not afraid to go alone! I'm hoping to find a hostel to stay in, and I'm going to book it through an online platform. I'm looking for a private tour guide who speaks English, and I'm sure I'll have a great time! After all, what could possibly go wrong with a budget of $5000?
Structured Cultural Interest: folk performances, Budget: $15000, Destinations: Paris, Tokyo, Duration: 23 days, Season: spring, Activities: sightseeing, Travelers: 7 persons.
Unstructured Ah, what a grand adventure I have planned! Seven of us are setting off to explore the wonders of Paris, Tokyo, and beyond in the glorious springtime! We shall be traveling for 23 days and have a budget of $15,000, so there will be plenty of room for sightseeing, folk performances, and other exciting activities. We shall be sure to take in the sights, sounds, and cultures of these two amazing cities! Who knows what wonders await us?
Conclusion
In conclusion, the versatility of written expression is paramount in the context of data science, particularly when training models on language. By harnessing the capabilities of LangChain and OpenAI, we've showcased how one can economically and efficiently create diverse datasets, ensuring our models are exposed to a myriad of writing styles. This foundation is not only pivotal for capturing the richness of human expression but also for making subsequent models as resilient as possible.
Stay tuned for our upcoming article, where we'll take this journey further. We will delve into the nuances of fine-tuning a Language Learning Model (LLM) using QLORA, employing the very dataset we've crafted today. The road to machine comprehension and versatility in language understanding is exciting, and we're thrilled to guide you through it.