9000% speed up in data access at no extra cost!
For the trading robot, I've now converted the trading data from csv files, into memory mapped f32 array files. This has given us much higher speed and simplity for back testing.
Old format
$ exa --tree -sname . ├── histdata.com-raw │ └── EURUSD │ ├── 2018 │ │ ├── DAT_ASCII_EURUSD_T_201801.csv │ │ ├── DAT_ASCII_EURUSD_T_201802.csv │ │ ├── DAT_ASCII_EURUSD_T_201803.csv │ │ ├── DAT_ASCII_EURUSD_T_201804.csv │ │ ├── DAT_ASCII_EURUSD_T_201805.csv │ │ ├── DAT_ASCII_EURUSD_T_201806.csv │ │ ├── DAT_ASCII_EURUSD_T_201807.csv │ │ ├── DAT_ASCII_EURUSD_T_201808.csv │ │ ├── DAT_ASCII_EURUSD_T_201809.csv │ │ ├── DAT_ASCII_EURUSD_T_201810.csv │ │ ├── DAT_ASCII_EURUSD_T_201811.csv │ │ ├── DAT_ASCII_EURUSD_T_201812.csv │ ├── 2019 │ │ ├── DAT_ASCII_EURUSD_T_201901.csv │ │ ├── DAT_ASCII_EURUSD_T_201902.csv │ │ ├── DAT_ASCII_EURUSD_T_201903.csv ...
New format
Now we have three files per currency.
└── mmaps
└── EURUSD
├── asks.dat
├── bids.dat
└── timestamps.dat
Space saving
We're still using zfs compression, but it seems that the giant files actually compress better than the .csv files. Remember this is just 2 years of data from one instrument so far:
$ du -hs histdata.com-raw/ 707M histdata.com-raw/ $ du -hs mmaps/ 501M mmaps/
Speed saving
This program generates the stats using the data; before (second run so it gets some disk caching advantage):
$ time cargo run --release
Finished release [optimized] target(s) in 0.02s
Running `/home/matiu/projects/trader/code/target/release/stats`
min_bid: 1.0879
max_bid: 1.2555
min_ask: 1.08793
max_ask: 1.25555
volume: 47546043
real 0m14.733s
user 0m13.668s
sys 0m1.064s
Using the giant memory mapped files:
$ time cargo run --release
Finished release [optimized] target(s) in 0.04s
Running `/home/matiu/projects/trader/code/target/release/stats`
min_bid: 1.0879
max_bid: 1.2555
min_ask: 1.08793
max_ask: 1.25555
volume: 47546043
real 0m0.162s
user 0m0.124s
sys 0m0.039s
That is a speed up of about 90 times (or 9094%).
Mainly due to:
- Not having to UTF8 parse the csv file (rust's fault :P)
- Memory map powers
- We don't have to count the volume, we just read the size of the file
Reading and Writing
To write the data, I read in 1 month of data (from one csv file) into a Vec, then cast the slice of f32s to &[u8] and write (append) it to the output file:
let slice = data.as_slice();
let ptr: *const u8 = slice.as_ptr().cast();
let element_size = std::mem::size_of::<f32>();
let size = element_size * slice.len();
log::debug!(
"Creating slice: len: {} element_size: {} size: {}",
slice.len(),
element_size,
size
);
let data: &[u8] = unsafe { std::slice::from_raw_parts(ptr, size) };
out_file.write(&data)?;
Then to read them back as memory mapped files (using the vmap crate):
let map = Map::open(file_name)?;
let len = map.len() / std::mem::size_of::<f32>();
map.advise(AdviseAccess::Sequential, AdviseUsage::WillNeed)?;
let p: *const f32 = map.as_ptr().cast();
let data: &[f32] = unsafe { std::slice::from_raw_parts(p, len) };
Now we have 182 MB of floats that we can just read as if it was in memory (but the OS pages it in as needed).
Next I'll start working on some actual strategies and back test them.