Using LangChain to Create Cost-Effective Datasets for Fine-Tuning Large Language Models

In today's digital age, the vast majority of information exists in a free-form, unstructured manner — from casual product descriptions to offhand travel plans. While such descriptions might make perfect sense to humans, machines often struggle to derive meaningful insights from them. However, transforming this unstructured information into a clear, organized format can unlock a wealth of opportunities for data analysis and automation.

Imagine a scenario where someone casually mentions wanting a lightweight smartphone with a large screen that fits snugly in their pocket. Or perhaps, a traveler thinking aloud about a week-long trip to Italy, emphasizing their desire to explore Rome, indulge in the local cuisine, and fit in a day of hiking. The challenge lies in converting these loose descriptions into structured datasets, which can then be processed, analyzed, and actioned upon by algorithms and systems.

This article delves into the creation of Structured Output Datasets using two powerful tools: LangChain and the OpenAI API.

Example & Motivation : Travel plans

Consider a scenario: We're designing a chatbot for a travel agency. The chatbot's task is to understand various travel plans described by users and to make relevant suggestions.

Input from User: "I'm thinking of a solo trip for about 2 weeks, primarily focusing on sightseeing and museum visits in Paris. I have a budget of around $5000."

Now, this statement might seem simple, but there's a lot of structured data embedded within:

Duration: 2 weeks
Travel Type: Solo
Activities: Sightseeing, Museum Visits
Destinations: Paris
Budget: $5000

To train our chatbot, we need thousands, if not millions, of such sentences, each with its structured counterpart. Manually curating such data is cumbersome, hence the need for our structured output generator!

By running the provided code, we generate sentences like: "Duration: 2 weeks, Travel Type: solo, Activities: sightseeing, museum visits, Destinations: Paris, Budget: $5000."

Or maybe: "Booking Mode: online platform, Travelers: 1 person, Destinations: Tokyo, Rio, Cultural Interest: historical sites, modern attractions."

These sentences serve as the base data for our chatbot. With these, we can employ the OpenAI API to convert the unstructured input (user's sentence) into structured output, teaching our chatbot to understand and process diverse travel plans.

Structured Data Generation.

import rando

# Expanded Data Pools

DURATION_POOL = [f"{i} days" for i in range(1, 31)] + [f"{i} weeks" for i in range(1, 9)]
DESTINATIONS_POOL = ["Paris", "New York", "Tokyo", "London", "Sydney", "Cairo", "Rio", "Cape Town", "Moscow", "Beijing"]
ACTIVITIES_POOL = ["sightseeing", "trekking", "culinary experiences", "museum visits", "beach relaxation", "mountain climbing"]
BUDGET_POOL = [f"${i}000" for i in range(1, 21)]
ACCOMMODATION_POOL = ["hotel", "hostel", "B&B", "luxury resort", "rented apartment"]
TRANSPORTATION_POOL = ["bus", "train", "flight", "rented car", "bicycle"]

# Additional Elements

TRAVELERS_POOL = [f"{i} persons" for i in range(1, 11)]
SEASON_POOL = ["spring", "summer", "fall", "winter"]
MEAL_PREF_POOL = ["local cuisine", "vegetarian meals", "vegan options", "seafood delights", "fast food"]
TRAVEL_TYPE_POOL = ["solo", "couple", "family", "group", "business", "backpacking"]
BOOKING_MODE_POOL = ["travel agency", "online platform", "direct booking", "last minute deals"]
GUIDE_PREF_POOL = ["guided tours", "self-exploration", "audio guide", "local guide", "group tour", "private tour"]
LANG_PREF_POOL = ["English", "French", "Spanish", "local language", "multilingual"]
CULTURAL_INTEREST_POOL = ["historical sites", "modern attractions", "folk performances", "art galleries", "music concerts"]

def generate_travel_plan():
    elements = {
        "Duration": random.choice(DURATION_POOL),
        "Destinations": ", ".join(random.sample(DESTINATIONS_POOL, random.randint(1, 3))),
        "Activities": ", ".join(random.sample(ACTIVITIES_POOL, random.randint(1, 3))),
        "Budget": random.choice(BUDGET_POOL),
        "Accommodation": random.choice(ACCOMMODATION_POOL),
        "Transportation": ", ".join(random.sample(TRANSPORTATION_POOL, random.randint(1, 3))),
        "Travelers": random.choice(TRAVELERS_POOL),
        "Season": random.choice(SEASON_POOL),
        "Meal Preference": random.choice(MEAL_PREF_POOL),
        "Travel Type": random.choice(TRAVEL_TYPE_POOL),
        "Booking Mode": random.choice(BOOKING_MODE_POOL),
        "Guide Preference": random.choice(GUIDE_PREF_POOL),
        "Language Preference": random.choice(LANG_PREF_POOL),
        "Cultural Interest": random.choice(CULTURAL_INTEREST_POOL)
    }

    # Drop some keys for variability
    num_elements_to_use = random.randint(6, 10)  # Using between 6 and 10 elements
    keys_to_use = random.sample(list(elements.keys()), num_elements_to_use)
    
    # Construct the travel plan
    plan_elements = [f"{key}: {elements[key]}" for key in keys_to_use]
    travel_plan = ", ".join(plan_elements) + '.'

    return travel_plan

# Generate 100k structured travel plans

data_points = 100_000
structured_data = [generate_travel_plan() for _ in range(data_points)]

# For demonstration, let's print the first 5 entries
for entry in structured_data[:5]:
    print(entry)
    print("---------------------")

# To save the structured data into a text file
with open("structured_data.txt", "w") as f:
    for entry in structured_data:
        f.write(entry + "\n---------------------\n")

Understanding Structured Output Generation

As we delve into the world of natural language processing (NLP) and machine learning, one critical challenge we often face is the availability of structured datasets. These datasets serve as the foundation upon which many AI and NLP models are trained. The more diverse and expansive the dataset, the more our models can learn and adapt.

The code provided in the previous section does exactly that—it's an example of generating a structured output in the domain of travel planning. By utilizing Python’s random library, the code dynamically produces travel plans with varied elements like destinations, activities, accommodations, and more.

Let's dissect the core components:

Data Pools: The first step in our code is to define various data pools. These are arrays of possible values for different attributes, like destinations, durations, or budgets.
Random Selection: Using random selection, our code ensures a diverse set of combinations. Whether it's choosing between Paris and Tokyo, a hotel or a hostel, or a trip lasting 2 days versus 2 weeks, the choices are vast.
Constructing Structured Outputs: Using a dictionary structure, we map each attribute (like "Duration" or "Destinations") to a value drawn from the relevant data pool. Not every attribute is used in every entry—again, ensuring diversity in our dataset.
Output Formatting: Finally, the structured plan is formulated as a single, coherent sentence. This is crucial as it mimics real-world data where information, even if structured, is often presented in a narrative format.

Travelers: 5 persons, Travel Type: business, Meal Preference: fast food, Transportation: train, flight, Booking Mode: last minute deals, Cultural Interest: historical sites
---------------------
Duration: 5 weeks, Accommodation: hotel, Guide Preference: self-exploration, Booking Mode: travel agency, Language Preference: Spanish, Activities: beach relaxation, culinary experiences, Budget: $11000, Travel Type: couple.
---------------------
Booking Mode: direct booking, Travelers: 9 persons, Transportation: bicycle, Destinations: Beijing, Cape Town, New York, Activities: beach relaxation, Accommodation: rented apartment, Duration: 3 weeks, Season: winter, Travel Type: backpacking, Meal Preference: local cuisine.
---------------------
Travel Type: business, Destinations: Rio, Activities: sightseeing, Transportation: train, flight, rented car, Guide Preference: private tour, Meal Preference: fast food, Cultural Interest: art galleries, Language Preference: French, Accommodation: luxury resort.
---------------------
Travelers: 1 persons, Accommodation: hostel, Booking Mode: online platform, Travel Type: backpacking, Duration: 2 weeks, Season: winter, Language Preference: English, Guide Preference: private tour, Destinations: Tokyo, Budget: $5000.
---------------------.

Generating Unstructured Data using LangChain and OpenAI API

LangChain provides a streamlined approach to programmatically craft prompts for Language Learning Models (LLMs). While LangChain supports various models, in this context, we'll harness the capabilities of OpenAI's ChatGPT 3.5. Our goal is to design a prompt that can seamlessly transform structured information into a fluid, natural-sounding narrative.

For instance:

Structured Data: 'Travelers: 5 persons, Travel Type: business, Meal Preference: fast food, Transportation: train, flight, Booking Mode: last minute deals, Cultural Interest: historical sites.'
Unstructured Narrative: "I'm organizing a business trip for a team of five. Given our preference for fast food and the need to travel by both train and flight, we're on the lookout for last-minute deals. Notably, we're eager to discover historical sites during our journey, ensuring a mix of business and culture."

Below, we will explore a comprehensive walkthrough of generating such unstructured data based on the given structured travel plans.

1. Installing Necessary Libraries

!pip install langchai
!pip install OpenAI

2. Setup and Imports

import os
os.environ["OPENAI_API_KEY"] = "paste-the-key-here"

import numpy as np
import pandas as pd
from tqdm import tqdm

from langchain import PromptTemplate
from langchain.llms import OpenAI
from langchain.chains import LLMChain

3. Loading Structured Data

with open("structured_data.txt", "r") as file:
    lines = file.readlines()

entries = [entry.strip() for entry in ''.join(lines).split("---------------------") if entry.strip() != ""]
structured_data_array = np.array(entries)

3.1 Writing style

Everyone has a unique way of expressing themselves in writing, which makes it crucial for our dataset to capture a wide spectrum of styles. This ensures that a model trained on such data is versatile and resilient to various stylistic nuances. While one could further refine this data by addressing spelling errors, grammatical mistakes, or even incorporating multiple languages, our focus here is primarily on capturing diverse writing styles.

import random

def select_style():
    styles = [
        "Narrative", "Persuasive", "Expository", "Journalistic",
        "Satirical", "Stream-of-Consciousness", "Epistolary", "Conversational",
        "Didactic", "Slang or Colloquial"
    ]
    
    return random.choice(styles)

sample_style = select_style()
# Uncomment to print: print(f"Randomly chosen style: {sample_style}")

plan_list = [
    {"user_plan": plan, "writing_style": select_style()} for plan in structured_data_array
]

4. Preparing Data for Processing

plan_list = [
    {"user_plan": plan, "style": pick_style()} for plan in structured_data_array
]

Each structured plan is converted to a dictionary with the key as `user_plan`. This format is useful for processing with LangChain.

5. Setting Up the Templating Process

demo_template='''
I want you to come up with a unstructered text for the following plan: {user_plan}.
Use the writing style: {style} 
for example: I'm thinking of a solo trip for about 2 weeks, primarily focusing on sightseeing and museum visits in Paris. I have a budget of around $5000.
'''


prompt=PromptTemplate(
    input_variables=['user_plan','style'],
    template=demo_template
    )

Explanation:

A template is defined which provides a guideline for the OpenAI model to convert structured data into unstructured text.
The PromptTemplate object allows for dynamic insertion of structured data into the template, facilitating the conversion process.

6. Setting Up the OpenAI Model

llm=OpenAI(temperature=0.5
chain=LLMChain(llm=llm, prompt=prompt)

Explanation:

The OpenAI model is initialized with a specific temperature, controlling the randomness of its output. A lower value like 0.5 makes the output more deterministic.
The LLMChain is established to chain the processes of templating and unstructured data generation.

7. Generating Unstructured Data

summary = [
for item in tqdm(plan_list, desc='Processing topics'):
    summary.append(chain.run(item))
]

Explanation:

For each structured plan, the OpenAI model is invoked via LangChain to produce an unstructured text description.
The generated descriptions are stored in the summary list.

8. Saving the Generated Data

topic_list = [
    {"user_plan": user_plan, "user_input": user_input}
    for user_plan, user_input in zip(structured_data_array, summary)
]

df = pd.DataFrame(topic_list)
df.to_csv('dataset.csv', index=False, encoding='utf-8')

Explanation:

The original structured data and the newly generated unstructured descriptions are combined to create a comprehensive dataset.
This dataset is converted to a pandas DataFrame and saved as a CSV file named dataset.csv.

Examples:

Structured: Travelers: 5 persons, Travel Type: business, Meal Preference: fast food, Transportation: train, flight, Booking Mode: last minute deals, Cultural Interest: historical sites.

Unstructured: Dear Friends, It's time to plan our next business trip! We are a group of five and we'd like to take advantage of some last minute deals. We are looking for a combination of train and flight transportation and our meal preference is fast food. We also have a cultural interest in visiting some historical sites. We have a budget of $5000 and are looking to travel for two weeks. We are all excited to explore and experience new cultures and places. Let's start planning our adventure today! Sincerely, The Travelers

Structured: Duration: 5 weeks, Accommodation: hotel, Guide Preference: self-exploration, Booking Mode: travel agency, Language Preference: Spanish, Activities: beach relaxation, culinary experiences, Budget: $11000, Travel Type: couple.

Unstructured: My partner and I have decided to embark on a 5-week journey through Spain. We've booked a hotel through a travel agency and plan to explore the country on our own. We want to experience the Spanish culture through its cuisine, so we'll be trying out all the local delicacies. We'll also be spending some time relaxing on the beach. Our budget for this trip is $11000. We're looking forward to this incredible journey and the memories we'll make together.

Structured: Booking Mode: direct booking, Travelers: 9 persons, Transportation: bicycle, Destinations: Beijing, Cape Town, New York, Activities: beach relaxation, Accommodation: rented apartment, Duration: 3 weeks, Season: winter, Travel Type: backpacking, Meal Preference: local cuisine.

Unstructured: If you are planning a 3-week backpacking trip for 9 people, then cycling is an ideal form of transportation. You can begin your journey in Beijing, then make your way to Cape Town and New York. During your travels, you can enjoy beach relaxation in some of the most beautiful places in the world. To save money, you can rent an apartment in each city you visit. As it is winter, you should bring plenty of warm clothes and blankets. To truly experience the local culture, it is best to sample the local cuisine. Bon voyage!

Structured: Travel Type: business, Destinations: Rio, Activities: sightseeing, Transportation: train, flight, rented car, Guide Preference: private tour, Meal Preference: fast food, Cultural Interest: art galleries, Language Preference: French, Accommodation: luxury resort.

Unstructured: I'm planning a business trip to Rio, and I'm looking to make the most of my time there. I want to see the sights, so I'm considering taking the train, flying, or renting a car. I'd like to have a private tour guide to get the most out of my experience, so I'm looking into that. I'm mostly interested in fast-food options for meals, but I also want to check out the art galleries and museums. I'd prefer to have a guide who speaks French, and I want to stay in a luxury resort. This is going to be an amazing trip!

Structured: Travelers: 1 persons, Accommodation: hostel, Booking Mode: online platform, Travel Type: backpacking, Duration: 2 weeks, Season: winter, Language Preference: English, Guide Preference: private tour, Destinations: Tokyo, Budget: $5000.

Unstructured: So, I'm looking for a winter backpacking adventure in Tokyo, and I'm not afraid to go alone! I'm hoping to find a hostel to stay in, and I'm going to book it through an online platform. I'm looking for a private tour guide who speaks English, and I'm sure I'll have a great time! After all, what could possibly go wrong with a budget of $5000?

Structured Cultural Interest: folk performances, Budget: $15000, Destinations: Paris, Tokyo, Duration: 23 days, Season: spring, Activities: sightseeing, Travelers: 7 persons.

Unstructured Ah, what a grand adventure I have planned! Seven of us are setting off to explore the wonders of Paris, Tokyo, and beyond in the glorious springtime! We shall be traveling for 23 days and have a budget of $15,000, so there will be plenty of room for sightseeing, folk performances, and other exciting activities. We shall be sure to take in the sights, sounds, and cultures of these two amazing cities! Who knows what wonders await us?

Conclusion

In conclusion, the versatility of written expression is paramount in the context of data science, particularly when training models on language. By harnessing the capabilities of LangChain and OpenAI, we've showcased how one can economically and efficiently create diverse datasets, ensuring our models are exposed to a myriad of writing styles. This foundation is not only pivotal for capturing the richness of human expression but also for making subsequent models as resilient as possible.

Stay tuned for our upcoming article, where we'll take this journey further. We will delve into the nuances of fine-tuning a Language Learning Model (LLM) using QLORA, employing the very dataset we've crafted today. The road to machine comprehension and versatility in language understanding is exciting, and we're thrilled to guide you through it.

Using LangChain to Create Cost-Effective Datasets for Fine-Tuning Large Language Models

Krishna Chaitanya Kosaraju, Ph. D.

Example & Motivation : Travel plans

Structured Data Generation.

Understanding Structured Output Generation

Generating Unstructured Data using LangChain and OpenAI API

1. Installing Necessary Libraries

2. Setup and Imports

3. Loading Structured Data

3.1 Writing style

Recommended by LinkedIn

4. Preparing Data for Processing

5. Setting Up the Templating Process

6. Setting Up the OpenAI Model

7. Generating Unstructured Data

8. Saving the Generated Data

Conclusion

More articles by Krishna Chaitanya Kosaraju, Ph. D.

Others also viewed

Welcome to a Full Year of Esri Pipeline Events and Activities

From India to Kenya: Building Voice Tech That Listens: Karya x Digital Green

Skilled Professionals, Quality Assurance Key to Geospatial AI Integration

AI-Powered Spatial Data Labeling Platforms: The Next Startup Wave

Travel Time Prediction using Deep Learning

Top 10 Companies to Hire Offshore AI Engineers in New York

Geospatial AI for Land Use: Hugely Important Geospatial Commission Report

AI Business Analyst: A New Role, Promotion, or Evolution in Business Analysis?

Why Data Driven Businesses need Analytics Translators

Explore content categories

Example & Motivation : Travel plans

Structured Data Generation.

Understanding Structured Output Generation

Generating Unstructured Data using LangChain and OpenAI API

1. Installing Necessary Libraries

2. Setup and Imports

3. Loading Structured Data

3.1 Writing style

Recommended by LinkedIn

4. Preparing Data for Processing

5. Setting Up the Templating Process

6. Setting Up the OpenAI Model

7. Generating Unstructured Data

8. Saving the Generated Data

Conclusion

More articles by Krishna Chaitanya Kosaraju, Ph. D.

A step-by-step guide for LLM fine-tuning using PEFT and bitsandbytes

Decoding the Transformers: A Dive into GPT with TensorFlow

Others also viewed

Welcome to a Full Year of Esri Pipeline Events and Activities

From India to Kenya: Building Voice Tech That Listens: Karya x Digital Green

Skilled Professionals, Quality Assurance Key to Geospatial AI Integration

AI-Powered Spatial Data Labeling Platforms: The Next Startup Wave

Travel Time Prediction using Deep Learning

Top 10 Companies to Hire Offshore AI Engineers in New York

Geospatial AI for Land Use: Hugely Important Geospatial Commission Report

AI Business Analyst: A New Role, Promotion, or Evolution in Business Analysis?

Why Data Driven Businesses need Analytics Translators

Similar topics

How to Train AI Models on a Budget

How to Improve Chatbot Responses With Custom Instructions

Evaluating Large Language Models With Real-World Scenarios

Explore content categories