Stop Using “Hello World” Data: Generate Data Directly from your Schemas
The "Cold Start" Data Problem
Building successful, robust data pipelines is nearly impossible without data. When architecting a new platform, tasks and user stories must be developed in parallel. Often, the core services that deliver raw data aren't available until shortly before the pipeline needs to be functional—because those downstream services are waiting on the very analytics your pipeline is supposed to provide.
Data engineers and architects are frequently forced to rely on the "hello world" equivalents of data objects. While there are great tools in Python, Go, Rust, etc., to generate random names or addresses, they usually provide simple, flat data. To create complex, nested structures that mirror what your new service will actually emit, there is often a lot of heavy lifting involved. Worse, there is no guarantee that your separate "fake data generator" will stay aligned with the evolving operational data models.
The Solution: Schema-Driven Generation
I’ve relied on protocol buffers for large-scale data models at several recent companies. If you are already using protocol buffers (or similar schemas like Avro), you have an ideal location to define your test data: the schema itself.
The benefits are clear:
Introducing protoc-gen-fake
I built protoc-gen-fake, an open-source protoc plugin, to solve this exact problem. It allows you to use custom Protobuf options to define exactly how your test data should look.
Here is a standard, simple Protobuf message:
syntax = "proto3";
package examples;
message User {
string id = 1;
string name = 2;
string family_name = 3;
repeated string phone_numbers = 4;
}
This models a User, but it doesn't tell us much about the content. A UUID looks different from a name, and we need to test edge cases (e.g., what if a user has no phone numbers?).
Here is that same schema decorated with protoc-gen-fake options:
syntax = "proto3";
package examples;
import "gen_fake/fake_field.proto";
message User {
// Opt-in this message for generation
option (gen_fake.fake_msg).include = true;
string id = 1 [(gen_fake.fake_data).data_type = "SafeEmail"];
string name = 2 [(gen_fake.fake_data) = {
data_type: "FirstName"
language: "FR_FR"
}];
string family_name = 3 [(gen_fake.fake_data) = {
data_type: "LastName"
language: "PT_BR"
}];
repeated string phone_numbers = 4 [(gen_fake.fake_data) = {
data_type: "PhoneNumber"
min_count: 0
max_count: 3
}];
}
With this configuration, we’ve mandated that the id should look like an email, the names should be drawn from specific language locales (French and Brazilian Portuguese), and the phone number list can range from empty to 3 entries.
Recommended by LinkedIn
Generating the Data
The plugin works like any other Protobuf plugin (e.g., buf validate or gen-bq-schema). To create a binary file of fake data based on the rules above, you run:
> protoc --fake_out . -I proto proto/examples/simple_user.proto --fake_opt output_path=./my_fake_data
If you have worked with protobuf compilation before, you’ll see that this plugin is called identically to any other plugin to generate Go or Python protobuf parsing code, for example.
Let’s inspect the output using Python to see the generated object:
> python
Python 3.12.11 (main, Jun 6 2025, 23:18:08) [Clang 16.0.0 (clang-1600.0.26.6)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import examples.simple_user_pb2 as user
>>> fake_user = user.User()
>>> with open("my_fake_data/simple_user.bin", "rb") as fh:
... data = fh.read()
...
>>> fake_user.ParseFromString(data)
74
>>> fake_user
name: "Régis"
family_name: "Batista"
phone_numbers: "1-377-506-3783 x179"
phone_numbers: "924.093.6602 x6795"
phone_numbers: "(215) 419-9077"
>>>
There is the fake data in the glory of its true schema. A few things to notice:
Complex Structures
The tool is capable of handling much more complex scenarios.
The repository contains examples demonstrating:
Why this matters
This simple command can be called repeatedly to create a massive velocity of data or to quickly generate datasets that span every shape allowed by your schema.
It is a lightweight, accurate way to seed your pipelines without needing a single upstream service to be online.
While I built this with Data Engineering in mind, it is immediately applicable to any engineer working with gRPC or Protobuf-based architectures.
Check out the repository here: https://github.com/lazarillo/protoc-gen-fake
Clear and practical point Mike Williamson, PhD - generating data from schemas to help more realistic testing early on
Fascinating perspective, Mike 👏