What is market data and how to generate it with Python
synthetic market data generated with Python

What is market data and how to generate it with Python

Abstract

Our trading software is definitely going to consume some market data. The most straightforward way to use market data is by analyzing the market and producing some sort of signals for our trading system. But we can also gather a vast amount of data to further build analytical models, train ML algorithms, prepare reports, and more.

Market data is often gathered from data providers, exchanges, and other sources. We will learn how to generate synthetic market data. Later we will receive data from external providers, and we will build a data storage system ourselves for educational purposes.  

What is market data

Market data means prices in general. It consists of facts about an occurred trade: name of exchange, name of the asset, price, quantity, trade timestamp, and more. Now we will talk about the two most common types of market data: ohlcv and ticks.

OHLCV

First of all, some definitions. OHLCV is market data combined into a block, it’s not really precise, but it gives traders a general understanding of what happened in the market.

OHLCV corresponds to Open, High, Low, Close, Volume – this is information about trades closed during a period of time(e.g., 1 minute, 15 minutes, 1 hour, 1 day, etc.).

For example, let us say we talk about BTC/USDT market; OHLCV data was gathered for one day period. During such a period, thousands of trades can be processed on large exchanges like Binance. OHLCV gives us only general information: first trade was at a price 19123$(O), the highest price during this period was 19347$(H), the lowest price was 18900$(L), the last price was 19041$(C ), and the sum of all trades during the period was 223500(V).

And, of course, our data is marked with timestamps so we can easily understand when those trades were executed. In our example, data was gathered from October 22, 2021, to October 20, 2022.

OHLCV data example from Binance

Ticks

Tick data is information about each separate trade that happened in an exchange. Those who need more details about the market prefer tick data.

In this example, we connected to DYDX decentralized exchange via WebSocket. Our tick data consists of a timestamp of each trade, symbol(e.g., BTC/USD), side(BUY or SELL) and amount(in BTC).  

tick data example from dYdX websocket

Synthetic data

As our goal is to understand the pros and cons of different data consumption, storage, and manipulation approaches, we will need a lot of data. We will use synthetic data generation as it’s pretty easy to generate 10 or 10 million data samples with the same function.

First of all, let’s define some basic data classes to work with.

class OHLCV
    def __init__(self, timestamp, open, high, low, close, volume):
        self.timestamp = timestamp
        self.open = open
        self.high = high
        self.low = low
        self.close = close
        self.volume = volume:        


class Tick
    def __init__(self, timestamp, symbol, side, amount, price, exchange):
        self.timestamp = timestamp
        self.symbol = symbol
        self.side = side
        self.amount = amount
        self.price = price
        self.exchange = exchange:        


Now we will create one more class, DataSample. It will be used to generate synthetic data. We will implement 3 methods for now:   

- generate() to generate a single data sample  

- build() to generate datasets using generate function  

- plot() to visualize our data

class DataSample
    def __init__(self):
        pass


    def generate(self):
        '''generates 1 data sample'''
        pass


    def build(self):
        '''builds a dataframe of data samples'''
        pass


    def plot(self):
        '''visualizes our dataset'''
        pass        

Now we will build 2 more classes – DataSampleOHLCV and DataSampleTicks. They both will have implemented methods generate, build and plot.

Generate function will use Random library so our prices will look more natural. Each next generated price will be slightly different from the previos one.

DataSampleOHLCV and DataSampleTicks code is available in my [GitHub](https://github.com/Jungle-Sven/efficient_trading_software/blob/main/4_synthetic_data.py) repo.

And, finally, one more thing – DataGenerator. It will utilize both DataSampleOHLCV and DataSampleTicks classes and will work as a single interface to generate synthetic data.

class DataGenerator
    def __init__(self):
        self.ticks = DataSampleTicks()
        self.ohlcv = DataSampleOHLCV()


    def run_ticks(self):
        data = self.ticks.build(n = 10000)
        self.ticks.plot(data)
        return data


    def run_ohlcv(self):
        data = self.ohlcv.build(n = 1000)
        self.ohlcv.plot(data)
        return data        

Now lets run the code and see what happens!

Generated OHLCV, 1k samples.

synthetic OHLCV data

Generated tick data, 10k samples.

synthetic tick data

The code is available in my [GitHub](https://github.com/Jungle-Sven/efficient_trading_software/blob/main/4_synthetic_data.py) repo.

#python #marketdata #syntheticdata

To view or add a comment, sign in

More articles by Andrii Zakharkin

  • Trading System Components

    In the last article, I discussed best practices for developing trading software, such as using events, data storage, or…

  • Simple Yet Effective Architecture Patterns for Algorithmic Trading

    I'll show you how to develop trading software that is easy to maintain and update, while making it simple to collect…

  • Using Standard Deviation for Signals and Spreads

    I'll show you how to use standard deviation to attempt to predict price reversals. As a bonus, we'll measure volatility…

  • Classification of Market Regimes Using Standard Deviation

    Understanding what's currently happening in the market is crucial to make informed decisions. The concept of market…

  • 5 ways to store market data: CSV, SQLite, Postgres, Mongo, Arctic

    What’s the best way to store market data? Well, it depends on a combination of factors. Among the most important, I…

  • Microservices

    Microservice software architecture is something opposite of monolithic architecture. For small projects, monolithic…

  • Event-driven approach

    The event-driven approach is a way to think about software architecture. Developing and maintaining a considerable…

  • Asynchronous programming

    It would be best if you had at least a basic understanding of asynchronous programming to build fast-reacting trading…

  • Building efficient crypto trading software with python

    Hello and welcome to this short list of articles where I will talk about the modern approach to building…

Explore content categories