No Real Data? No Problem!

Mischa Van Oijen

Published Oct 15, 2024

Data plays an important role in every business. It helps make informed decisions, develop new products, and improve services. However, using real data comes with challenges. Privacy regulations restrict how data can be used, data can be hard to access, and sometimes there simply isn't enough data available. This is where synthetic data becomes valuable.

What is Synthetic Data?

Synthetic data is artificially generated data that behaves like real data but does not contain any personal information. This makes synthetic data a safe and practical option for many applications, allowing businesses to work more efficiently without the risks associated with real data.

For instance, hospitals can use synthetic data to analyse patient records while complying with privacy regulations. This allows them to gain insights and improve healthcare practices without putting sensitive information at risk. Similarly, a small retailer might use synthetic data to train an AI model that recommends products to customers. By using synthetic data, they can make their AI system smarter faster, ultimately providing a better customer experience.

Generative AI: The Engine Behind Synthetic Data

Generative AI is very capable of generating synthetic data. Generative AI learns from real data and then creates new, similar data. It can generate realistic, privacy-safe versions of the data needed for training and testing. Tools like ChatGPT make this process easier, just provide a few real world examples, and the AI can generate more data that fits the same patterns.

Our generative AI platform, Liza, has proven to be very effective in generating synthetic datasets for our projects. Even though our use of synthetic in Liza is straightforward, it has shown to be very useful to accelerate customer onboarding. For example, when we were testing a Retrieval-Augmented Generation (RAG) system, we used Liza to generate 100 questions based on documents, user persona's and customer expectations.

We then used those questions to create answers grounded in the original documents. Creating an extensive test dataset of the RAG system. This process helped run automate quality evaluation with these datasets, which provides us with input on what model to best choose, model configuration and prompts. This also relieves the burden on customers to come up with large variety of question their users might ask.

Recommended by LinkedIn

Actionable Reporting (and AI) In A Data-Driven World

Debra Shannon 2 years ago

Hermeneutic Prompting: The Next Frontier in Enterprise…

Navdeep Singh Gill 5 months ago

Data too sensitive, too sparse for AI? How synthetic…

Diane Chaleff 1 year ago

Let's create a simple example

Suppose we have created a new sales forecasting application for bike shops that we need to test using synthetic data to populate the database for the app. We could start by prompting a generative AI tool like OpenAI’s ChatGPT for some data:

Create a CSV file with 25 random sales records for a bike shop.
Each record should include the following fields:
  - id (incrementing integer starting at 1)
  - date (random date between 1/1/2022 and 15/10/2023)
  - time (random time between 6:00am and 9:00pm in 1-minute increments)
  - product_id (incrementing integer starting at 1)
  - product
  - brand (use real world bike brands)
  - type (bike, accessory etc)
  - price in EURO
  - quantity (random integer between 1 and 3)
  - amount (price * quantity)
  - payment type (cash, credit, debit, or gift card)

The generated output is as follows:

id,date,time,product_id,product,brand,type,price,quantity,amount,payment type
1,2022-03-15,10:23,1,Road Bike,Specialized,bike,899.99,2,1799.98,credit
2,2022-07-21,14:45,2,Mountain Bike,Trek,bike,749.50,1,749.50,debit
3,2022-11-09,09:15,3,Hybrid Bike,Giant,bike,599.99,3,1799.97,cash
4,2022-05-05,16:30,4,Helmet,Bontrager,accessory,49.99,2,99.98,gift card
5,2023-01-12,11:05,5,Water Bottle,CamelBak,accessory,19.99,1,19.99,cash
6,2023-08-22,18:20,6,Mountain Bike,Scott,bike,899.00,1,899.00,debit
7,2023-04-30,12:00,7,Repair Kit,Park Tool,accessory,29.99,3,89.97,credit
8,2022-09-14,15:30,8,Electric Bike,Rad Power,bike,1499.99,1,1499.99,credit
9,2023-06-01,08:45,9,Bike Lock,Kryptonite,accessory,39.99,2,79.98,debit
10,2022-10-17,13:00,10,Road Bike,Cannondale,bike,949.99,1,949.99,gift card
11,2023-02-20,17:10,11,Gloves,Pearl Izumi,accessory,29.99,2,59.98,cash
12,2023-05-25,14:00,12,Mountain Bike,Merida,bike,699.99,3,2099.97,credit
13,2022-12-05,09:30,13,Seat Pack,Topeak,accessory,24.99,1,24.99,debit
14,2023-03-18,11:50,14,Triathlon Bike,Pinarello,bike,2999.99,1,2999.99,cash
15,2023-07-11,19:40,15,Pedals,Shimano,accessory,49.99,2,99.98,gift card
16,2022-08-29,10:15,16,Bike Computer,Garmin,accessory,199.99,1,199.99,credit
17,2022-04-19,20:00,17,Road Bike,Colnago,bike,1199.99,1,1199.99,debit
18,2023-09-06,15:30,18,Helmet,Giro,accessory,59.99,3,179.97,cash
19,2022-01-28,11:20,19,Hybrid Bike,Fuji,bike,699.99,2,1399.98,gift card
20,2023-10-03,14:10,20,Water Bottle,Hydrapak,accessory,14.99,1,14.99,credit
21,2022-06-13,18:25,21,Mountain Bike,Merida,bike,749.00,2,1498.00,debit
22,2023-05-14,12:35,22,Repair Stand,Park Tool,accessory,199.99,1,199.99,cash
23,2023-10-01,09:50,23,Road Bike,Specialized,bike,849.99,3,2549.97,gift card
24,2022-03-07,16:15,24,Gloves,Castelli,accessory,29.99,2,59.98,debit
25,2023-09-30,13:00,25,Bike Lock,Abus,accessory,39.99,1,39.99,cash

If needed, we could refine further, for example have a vision model add images of the products, then we can ask the LLM, to create SQL statements for our database to add the data.

The example above is using simple prompt techniques. Using advanced techniques, we could generate increasingly complex, realistic synthetic data for various use cases. This article is a simple introduction to synthetic data. Now you know the basics we will go a bit deeper in upcoming articles. Stay tuned.

Neil Milliken 1y

Synthetic data can potentially be used to support accessibility and disability inclusion. Generative Adversarial Networks can help fill in gaps as we try to build new models. The challenge is that not everyone is mindful of disability when creating personas that are used to generate new synthetic data. We risk amplifying exclusion if we don’t build disability into the process.

Mitko Kolarov 1y

Very informative

See more comments

To view or add a comment, sign in

No Real Data? No Problem!

Mischa Van Oijen

Recommended by LinkedIn

More articles by Mischa Van Oijen

Others also viewed

Retrieval-Augmented Generation (RAG): The Future of Enterprise-Grade AI

Questions to Ask When Selecting an AI Tool for Data Synthesis

From Pilot to Profit: Overcoming the Hidden Risks in AI Implementation

Best Practices for Responsible Use of AI Tools: What not to do

Is Your Data Safe in AI Tools? Here’s What You Need to Know

Why Your Business Doesn’t Just Need AI — It Needs the Right AI

Data Integrity for AI Success

Can We Trust AI Products?

Bitesize Reads #9: AI & Data + Must-Attend Kerv Events

Reasons to Use Synthetic Data

How Generative AI Improves Customer Experience

How Synthetic Data Transforms AI Training

Best Practices for Data Quality in Generative AI

Explore content categories

Recommended by LinkedIn

More articles by Mischa Van Oijen

What we learned from vibe coding

Why "Context Engineering" Will Redefine How We Deliver Work

AgentOps: The Next Step in Generative AI Management

Implementing GenAI - My view

Are We Still Communicating?

Realtime genAI conversations a game changer?

Is This the End of Prompt Engineering?

blueKiwi shines in Annual Lecko report, #1 solution for conversational use

Others also viewed

Retrieval-Augmented Generation (RAG): The Future of Enterprise-Grade AI

Questions to Ask When Selecting an AI Tool for Data Synthesis

From Pilot to Profit: Overcoming the Hidden Risks in AI Implementation

Best Practices for Responsible Use of AI Tools: What not to do

Is Your Data Safe in AI Tools? Here’s What You Need to Know

Why Your Business Doesn’t Just Need AI — It Needs the Right AI

Data Integrity for AI Success

Can We Trust AI Products?

Bitesize Reads #9: AI & Data + Must-Attend Kerv Events

Similar topics

Reasons to Use Synthetic Data

How Generative AI Improves Customer Experience

How Synthetic Data Transforms AI Training

Best Practices for Data Quality in Generative AI

Explore content categories