9000% speed up in data access at no extra cost!

Matthew Sherborne

Published Aug 23, 2020

For the trading robot, I've now converted the trading data from csv files, into memory mapped f32 array files. This has given us much higher speed and simplity for back testing.

Old format

$ exa --tree -sname
.
├── histdata.com-raw
│  └── EURUSD
│     ├── 2018
│     │  ├── DAT_ASCII_EURUSD_T_201801.csv
│     │  ├── DAT_ASCII_EURUSD_T_201802.csv
│     │  ├── DAT_ASCII_EURUSD_T_201803.csv
│     │  ├── DAT_ASCII_EURUSD_T_201804.csv
│     │  ├── DAT_ASCII_EURUSD_T_201805.csv
│     │  ├── DAT_ASCII_EURUSD_T_201806.csv
│     │  ├── DAT_ASCII_EURUSD_T_201807.csv
│     │  ├── DAT_ASCII_EURUSD_T_201808.csv
│     │  ├── DAT_ASCII_EURUSD_T_201809.csv
│     │  ├── DAT_ASCII_EURUSD_T_201810.csv
│     │  ├── DAT_ASCII_EURUSD_T_201811.csv
│     │  ├── DAT_ASCII_EURUSD_T_201812.csv
│     ├── 2019
│     │  ├── DAT_ASCII_EURUSD_T_201901.csv
│     │  ├── DAT_ASCII_EURUSD_T_201902.csv
│     │  ├── DAT_ASCII_EURUSD_T_201903.csv
...

New format

Now we have three files per currency.

└── mmaps
   └── EURUSD
      ├── asks.dat
      ├── bids.dat
      └── timestamps.dat

Space saving

We're still using zfs compression, but it seems that the giant files actually compress better than the .csv files. Remember this is just 2 years of data from one instrument so far:

$ du -hs histdata.com-raw/
707M	histdata.com-raw/

$ du -hs mmaps/
501M	mmaps/

Speed saving

This program generates the stats using the data; before (second run so it gets some disk caching advantage):

$ time cargo run --release
    Finished release [optimized] target(s) in 0.02s
     Running `/home/matiu/projects/trader/code/target/release/stats`
min_bid: 1.0879
max_bid: 1.2555
min_ask: 1.08793
max_ask: 1.25555
volume: 47546043

real	0m14.733s
user	0m13.668s
sys	0m1.064s

Using the giant memory mapped files:

$ time cargo run --release
    Finished release [optimized] target(s) in 0.04s
     Running `/home/matiu/projects/trader/code/target/release/stats`
min_bid: 1.0879
max_bid: 1.2555
min_ask: 1.08793
max_ask: 1.25555
volume: 47546043

real	0m0.162s
user	0m0.124s
sys	0m0.039s

That is a speed up of about 90 times (or 9094%).

Mainly due to:

Not having to UTF8 parse the csv file (rust's fault :P)
Memory map powers
We don't have to count the volume, we just read the size of the file

Reading and Writing

To write the data, I read in 1 month of data (from one csv file) into a Vec, then cast the slice of f32s to &[u8] and write (append) it to the output file:

let slice = data.as_slice();
let ptr: *const u8 = slice.as_ptr().cast();
let element_size = std::mem::size_of::<f32>();
let size = element_size * slice.len();
log::debug!(
    "Creating slice: len: {} element_size: {} size: {}",
    slice.len(),
    element_size,
    size
);
let data: &[u8] = unsafe { std::slice::from_raw_parts(ptr, size) };
out_file.write(&data)?;

Then to read them back as memory mapped files (using the vmap crate):

let map = Map::open(file_name)?;
let len = map.len() / std::mem::size_of::<f32>();
map.advise(AdviseAccess::Sequential, AdviseUsage::WillNeed)?;
let p: *const f32 = map.as_ptr().cast();
let data: &[f32] = unsafe { std::slice::from_raw_parts(p, len) };

Now we have 182 MB of floats that we can just read as if it was in memory (but the OS pages it in as needed).

Next I'll start working on some actual strategies and back test them.

9000% speed up in data access at no extra cost!

Matthew Sherborne

Old format

New format

Space saving

Speed saving

Reading and Writing

More articles by this author

Others also viewed

C++26 Reflection + Injection Massive Gains

Day 2 — Counting File Opens Per Process (eBPF Maps & Aggregations)

How to represent negative numbers in binary?

An Efficient (In-Mem) Graph Storage

💥 Demystifying Spark Shuffle: What It Is, Why It Hurts, and How to Minimise It

I Cut My LLM Token Usage by 32% With One Format Change

[LeetCode][Array][Easy][Eng] Single Number

Detect control characters, quotes and backslashes efficiently using 'SWAR'

The Silent Executor Killer: Why your Spark job "vanished" at 99%

zstandard compression unleashed in Big-Data!

Explore content categories

Old format

New format

Space saving

Speed saving

Reading and Writing

Trading Robot Part 4

Mar 24, 2023

Trading Robot Part 3

Mar 24, 2023

Trading Robot episode 2

Mar 17, 2023

Building My Rust-Based Forex Trading Robot: Harnessing AWS Lambda and Rust's Async Parallelism

Mar 16, 2023

sma crossovers v2

Sep 6, 2020

Back-tested a bunch of SMAs

Sep 5, 2020

Rust goodness case study

Aug 30, 2020

First algorithm failed

Aug 15, 2020

Data science from an IT perspective

Aug 9, 2020

Others also viewed

C++26 Reflection + Injection Massive Gains

Day 2 — Counting File Opens Per Process (eBPF Maps & Aggregations)

How to represent negative numbers in binary?

An Efficient (In-Mem) Graph Storage

💥 Demystifying Spark Shuffle: What It Is, Why It Hurts, and How to Minimise It

I Cut My LLM Token Usage by 32% With One Format Change

[LeetCode][Array][Easy][Eng] Single Number

Detect control characters, quotes and backslashes efficiently using 'SWAR'

The Silent Executor Killer: Why your Spark job "vanished" at 99%

zstandard compression unleashed in Big-Data!

Explore content categories