Stop Using “Hello World” Data: Generate Data Directly from your Schemas

Mike Williamson, PhD

Published Dec 16, 2025

The "Cold Start" Data Problem

Building successful, robust data pipelines is nearly impossible without data. When architecting a new platform, tasks and user stories must be developed in parallel. Often, the core services that deliver raw data aren't available until shortly before the pipeline needs to be functional—because those downstream services are waiting on the very analytics your pipeline is supposed to provide.

Data engineers and architects are frequently forced to rely on the "hello world" equivalents of data objects. While there are great tools in Python, Go, Rust, etc., to generate random names or addresses, they usually provide simple, flat data. To create complex, nested structures that mirror what your new service will actually emit, there is often a lot of heavy lifting involved. Worse, there is no guarantee that your separate "fake data generator" will stay aligned with the evolving operational data models.

The Solution: Schema-Driven Generation

I’ve relied on protocol buffers for large-scale data models at several recent companies. If you are already using protocol buffers (or similar schemas like Avro), you have an ideal location to define your test data: the schema itself.

The benefits are clear:

Guaranteed Accuracy: The definition of the fake data resides exactly where the data structure is defined.
Visibility: Engineers interacting with the model see the test data rules alongside the fields.

Introducing protoc-gen-fake

I built protoc-gen-fake, an open-source protoc plugin, to solve this exact problem. It allows you to use custom Protobuf options to define exactly how your test data should look.

Here is a standard, simple Protobuf message:

syntax = "proto3";
package examples;
message User {
  string id = 1;
  string name = 2;
  string family_name = 3;
  repeated string phone_numbers = 4;
}

This models a User, but it doesn't tell us much about the content. A UUID looks different from a name, and we need to test edge cases (e.g., what if a user has no phone numbers?).

Here is that same schema decorated with protoc-gen-fake options:

syntax = "proto3";
package examples;
import "gen_fake/fake_field.proto";
message User {
  // Opt-in this message for generation
  option (gen_fake.fake_msg).include = true;
  string id = 1 [(gen_fake.fake_data).data_type = "SafeEmail"];
  string name = 2 [(gen_fake.fake_data) = {
    data_type: "FirstName"
    language: "FR_FR"
  }];
  string family_name = 3 [(gen_fake.fake_data) = {
    data_type: "LastName"
    language: "PT_BR"
  }];
  repeated string phone_numbers = 4 [(gen_fake.fake_data) = {
    data_type: "PhoneNumber"
    min_count: 0
    max_count: 3
  }];
}

With this configuration, we’ve mandated that the id should look like an email, the names should be drawn from specific language locales (French and Brazilian Portuguese), and the phone number list can range from empty to 3 entries.

Generating the Data

The plugin works like any other Protobuf plugin (e.g., buf validate or gen-bq-schema). To create a binary file of fake data based on the rules above, you run:

> protoc --fake_out . -I proto proto/examples/simple_user.proto --fake_opt output_path=./my_fake_data

If you have worked with protobuf compilation before, you’ll see that this plugin is called identically to any other plugin to generate Go or Python protobuf parsing code, for example.

Let’s inspect the output using Python to see the generated object:

> python
Python 3.12.11 (main, Jun  6 2025, 23:18:08) [Clang 16.0.0 (clang-1600.0.26.6)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import examples.simple_user_pb2 as user
>>> fake_user = user.User()
>>> with open("my_fake_data/simple_user.bin", "rb") as fh:
...   data = fh.read()
...
>>> fake_user.ParseFromString(data)
74
>>> fake_user
name: "Régis"
family_name: "Batista"
phone_numbers: "1-377-506-3783 x179"
phone_numbers: "924.093.6602 x6795"
phone_numbers: "(215) 419-9077"
>>>

There is the fake data in the glory of its true schema. A few things to notice:

The first name is French, and the last name is Brazilian, as requested.
We got 3 phone numbers (the maximum we allowed).
The ID is missing. This is a feature, not a bug! By default, optional fields are populated 60% of the time to test handling of missing data. If you need a field to always be present (like an ID), you simply set min_count: 1 in the options.

Complex Structures

The tool is capable of handling much more complex scenarios.

The repository contains examples demonstrating:

Enums: Randomly selecting valid enum values.
Nested Messages: Recursively generating data for child objects (e.g., an Address message inside a Customer message).
Many languages are available, including Chinese, Arabic, and others that can easily validate handling of more complex UTF-8 characters.
Targeted Generation: Using option (gen_fake.fake_msg).include = true; to specify exactly which top-level message you want to generate data for in a file with multiple definitions.

Why this matters

This simple command can be called repeatedly to create a massive velocity of data or to quickly generate datasets that span every shape allowed by your schema.

It is a lightweight, accurate way to seed your pipelines without needing a single upstream service to be online.

While I built this with Data Engineering in mind, it is immediately applicable to any engineer working with gRPC or Protobuf-based architectures.

Check out the repository here: https://github.com/lazarillo/protoc-gen-fake

Neel Banerjee 4mo

Clear and practical point Mike Williamson, PhD - generating data from schemas to help more realistic testing early on

Oksana Didyk, PhD 4mo

Fascinating perspective, Mike 👏

1 Reaction

See more comments

To view or add a comment, sign in

Stop Using “Hello World” Data: Generate Data Directly from your Schemas

Mike Williamson, PhD

The "Cold Start" Data Problem

The Solution: Schema-Driven Generation

Introducing protoc-gen-fake

Recommended by LinkedIn

Generating the Data

Complex Structures

Why this matters

More articles by Mike Williamson, PhD

Others also viewed

New Optimization Technique in Spark 3.0

The Silver Layer Deep Dive — How I Standardize 10 Different Sources Into One Trusted Schema

Data Structures and Algorithm (DSA) – Performance, Complexity And Big-O Notation

A Little Slice of Asynchronous Queries in the Sentinel data lake

Flattening JSON Data in Databricks for Downstream Processing

From "Something's Wrong" to "Here's Exactly What Happened" — In Hours, Not Days

Unleashing the Value of Your Data

Delta Tables

Three data transformation engines walk into a bar...

Automatic Column-Level Data Lineage for DuckDB

Explore content categories

The "Cold Start" Data Problem

The Solution: Schema-Driven Generation

Introducing protoc-gen-fake

Recommended by LinkedIn

Generating the Data

Complex Structures

Why this matters

More articles by Mike Williamson, PhD

How I Learned to Stop Worrying and Love My Cloud's IAM Rights

Others also viewed

New Optimization Technique in Spark 3.0

The Silver Layer Deep Dive — How I Standardize 10 Different Sources Into One Trusted Schema

Data Structures and Algorithm (DSA) – Performance, Complexity And Big-O Notation

A Little Slice of Asynchronous Queries in the Sentinel data lake

Flattening JSON Data in Databricks for Downstream Processing

From "Something's Wrong" to "Here's Exactly What Happened" — In Hours, Not Days

Unleashing the Value of Your Data

Delta Tables

Three data transformation engines walk into a bar...

Automatic Column-Level Data Lineage for DuckDB

Explore content categories