9000% speed up in data access at no extra cost!

For the trading robot, I've now converted the trading data from csv files, into memory mapped f32 array files. This has given us much higher speed and simplity for back testing.

Old format

$ exa --tree -sname
.
├── histdata.com-raw
│  └── EURUSD
│     ├── 2018
│     │  ├── DAT_ASCII_EURUSD_T_201801.csv
│     │  ├── DAT_ASCII_EURUSD_T_201802.csv
│     │  ├── DAT_ASCII_EURUSD_T_201803.csv
│     │  ├── DAT_ASCII_EURUSD_T_201804.csv
│     │  ├── DAT_ASCII_EURUSD_T_201805.csv
│     │  ├── DAT_ASCII_EURUSD_T_201806.csv
│     │  ├── DAT_ASCII_EURUSD_T_201807.csv
│     │  ├── DAT_ASCII_EURUSD_T_201808.csv
│     │  ├── DAT_ASCII_EURUSD_T_201809.csv
│     │  ├── DAT_ASCII_EURUSD_T_201810.csv
│     │  ├── DAT_ASCII_EURUSD_T_201811.csv
│     │  ├── DAT_ASCII_EURUSD_T_201812.csv
│     ├── 2019
│     │  ├── DAT_ASCII_EURUSD_T_201901.csv
│     │  ├── DAT_ASCII_EURUSD_T_201902.csv
│     │  ├── DAT_ASCII_EURUSD_T_201903.csv
...

New format

Now we have three files per currency.

└── mmaps
   └── EURUSD
      ├── asks.dat
      ├── bids.dat
      └── timestamps.dat

Space saving

We're still using zfs compression, but it seems that the giant files actually compress better than the .csv files. Remember this is just 2 years of data from one instrument so far:

$ du -hs histdata.com-raw/
707M	histdata.com-raw/

$ du -hs mmaps/
501M	mmaps/

Speed saving

This program generates the stats using the data; before (second run so it gets some disk caching advantage):

$ time cargo run --release
    Finished release [optimized] target(s) in 0.02s
     Running `/home/matiu/projects/trader/code/target/release/stats`
min_bid: 1.0879
max_bid: 1.2555
min_ask: 1.08793
max_ask: 1.25555
volume: 47546043

real	0m14.733s
user	0m13.668s
sys	0m1.064s

Using the giant memory mapped files:

$ time cargo run --release
    Finished release [optimized] target(s) in 0.04s
     Running `/home/matiu/projects/trader/code/target/release/stats`
min_bid: 1.0879
max_bid: 1.2555
min_ask: 1.08793
max_ask: 1.25555
volume: 47546043

real	0m0.162s
user	0m0.124s
sys	0m0.039s

That is a speed up of about 90 times (or 9094%).

Mainly due to:

  1. Not having to UTF8 parse the csv file (rust's fault :P)
  2. Memory map powers
  3. We don't have to count the volume, we just read the size of the file

Reading and Writing

To write the data, I read in 1 month of data (from one csv file) into a Vec, then cast the slice of f32s to &[u8] and write (append) it to the output file:

let slice = data.as_slice();
let ptr: *const u8 = slice.as_ptr().cast();
let element_size = std::mem::size_of::<f32>();
let size = element_size * slice.len();
log::debug!(
    "Creating slice: len: {} element_size: {} size: {}",
    slice.len(),
    element_size,
    size
);
let data: &[u8] = unsafe { std::slice::from_raw_parts(ptr, size) };
out_file.write(&data)?;

Then to read them back as memory mapped files (using the vmap crate):

let map = Map::open(file_name)?;
let len = map.len() / std::mem::size_of::<f32>();
map.advise(AdviseAccess::Sequential, AdviseUsage::WillNeed)?;
let p: *const f32 = map.as_ptr().cast();
let data: &[f32] = unsafe { std::slice::from_raw_parts(p, len) };

Now we have 182 MB of floats that we can just read as if it was in memory (but the OS pages it in as needed).

Next I'll start working on some actual strategies and back test them.

To view or add a comment, sign in

Others also viewed

Explore content categories