No Real Data? No Problem!
Data plays an important role in every business. It helps make informed decisions, develop new products, and improve services. However, using real data comes with challenges. Privacy regulations restrict how data can be used, data can be hard to access, and sometimes there simply isn't enough data available. This is where synthetic data becomes valuable.
What is Synthetic Data?
Synthetic data is artificially generated data that behaves like real data but does not contain any personal information. This makes synthetic data a safe and practical option for many applications, allowing businesses to work more efficiently without the risks associated with real data.
For instance, hospitals can use synthetic data to analyse patient records while complying with privacy regulations. This allows them to gain insights and improve healthcare practices without putting sensitive information at risk. Similarly, a small retailer might use synthetic data to train an AI model that recommends products to customers. By using synthetic data, they can make their AI system smarter faster, ultimately providing a better customer experience.
Generative AI: The Engine Behind Synthetic Data
Generative AI is very capable of generating synthetic data. Generative AI learns from real data and then creates new, similar data. It can generate realistic, privacy-safe versions of the data needed for training and testing. Tools like ChatGPT make this process easier, just provide a few real world examples, and the AI can generate more data that fits the same patterns.
Our generative AI platform, Liza, has proven to be very effective in generating synthetic datasets for our projects. Even though our use of synthetic in Liza is straightforward, it has shown to be very useful to accelerate customer onboarding. For example, when we were testing a Retrieval-Augmented Generation (RAG) system, we used Liza to generate 100 questions based on documents, user persona's and customer expectations.
We then used those questions to create answers grounded in the original documents. Creating an extensive test dataset of the RAG system. This process helped run automate quality evaluation with these datasets, which provides us with input on what model to best choose, model configuration and prompts. This also relieves the burden on customers to come up with large variety of question their users might ask.
Recommended by LinkedIn
Let's create a simple example
Suppose we have created a new sales forecasting application for bike shops that we need to test using synthetic data to populate the database for the app. We could start by prompting a generative AI tool like OpenAI’s ChatGPT for some data:
Create a CSV file with 25 random sales records for a bike shop.
Each record should include the following fields:
- id (incrementing integer starting at 1)
- date (random date between 1/1/2022 and 15/10/2023)
- time (random time between 6:00am and 9:00pm in 1-minute increments)
- product_id (incrementing integer starting at 1)
- product
- brand (use real world bike brands)
- type (bike, accessory etc)
- price in EURO
- quantity (random integer between 1 and 3)
- amount (price * quantity)
- payment type (cash, credit, debit, or gift card)
The generated output is as follows:
id,date,time,product_id,product,brand,type,price,quantity,amount,payment type
1,2022-03-15,10:23,1,Road Bike,Specialized,bike,899.99,2,1799.98,credit
2,2022-07-21,14:45,2,Mountain Bike,Trek,bike,749.50,1,749.50,debit
3,2022-11-09,09:15,3,Hybrid Bike,Giant,bike,599.99,3,1799.97,cash
4,2022-05-05,16:30,4,Helmet,Bontrager,accessory,49.99,2,99.98,gift card
5,2023-01-12,11:05,5,Water Bottle,CamelBak,accessory,19.99,1,19.99,cash
6,2023-08-22,18:20,6,Mountain Bike,Scott,bike,899.00,1,899.00,debit
7,2023-04-30,12:00,7,Repair Kit,Park Tool,accessory,29.99,3,89.97,credit
8,2022-09-14,15:30,8,Electric Bike,Rad Power,bike,1499.99,1,1499.99,credit
9,2023-06-01,08:45,9,Bike Lock,Kryptonite,accessory,39.99,2,79.98,debit
10,2022-10-17,13:00,10,Road Bike,Cannondale,bike,949.99,1,949.99,gift card
11,2023-02-20,17:10,11,Gloves,Pearl Izumi,accessory,29.99,2,59.98,cash
12,2023-05-25,14:00,12,Mountain Bike,Merida,bike,699.99,3,2099.97,credit
13,2022-12-05,09:30,13,Seat Pack,Topeak,accessory,24.99,1,24.99,debit
14,2023-03-18,11:50,14,Triathlon Bike,Pinarello,bike,2999.99,1,2999.99,cash
15,2023-07-11,19:40,15,Pedals,Shimano,accessory,49.99,2,99.98,gift card
16,2022-08-29,10:15,16,Bike Computer,Garmin,accessory,199.99,1,199.99,credit
17,2022-04-19,20:00,17,Road Bike,Colnago,bike,1199.99,1,1199.99,debit
18,2023-09-06,15:30,18,Helmet,Giro,accessory,59.99,3,179.97,cash
19,2022-01-28,11:20,19,Hybrid Bike,Fuji,bike,699.99,2,1399.98,gift card
20,2023-10-03,14:10,20,Water Bottle,Hydrapak,accessory,14.99,1,14.99,credit
21,2022-06-13,18:25,21,Mountain Bike,Merida,bike,749.00,2,1498.00,debit
22,2023-05-14,12:35,22,Repair Stand,Park Tool,accessory,199.99,1,199.99,cash
23,2023-10-01,09:50,23,Road Bike,Specialized,bike,849.99,3,2549.97,gift card
24,2022-03-07,16:15,24,Gloves,Castelli,accessory,29.99,2,59.98,debit
25,2023-09-30,13:00,25,Bike Lock,Abus,accessory,39.99,1,39.99,cash
If needed, we could refine further, for example have a vision model add images of the products, then we can ask the LLM, to create SQL statements for our database to add the data.
The example above is using simple prompt techniques. Using advanced techniques, we could generate increasingly complex, realistic synthetic data for various use cases. This article is a simple introduction to synthetic data. Now you know the basics we will go a bit deeper in upcoming articles. Stay tuned.
Synthetic data can potentially be used to support accessibility and disability inclusion. Generative Adversarial Networks can help fill in gaps as we try to build new models. The challenge is that not everyone is mindful of disability when creating personas that are used to generate new synthetic data. We risk amplifying exclusion if we don’t build disability into the process.
Very informative