1.1 Structured Data: Standard Bars

This post will describe how to process & store High Frequency financial data to support your Machine Learning Algorithm, based on statistical properties of the data

The post is directly based on content from the book "Advances in Financial Machine Learning" from Marcos Lopez de Prado

Physical meaning:
High frequency financial data is voluminous and data hungry. E.g. 20 days worth of High frequency (tick by tick or trade by trade) financial data makes up 5 million records (Excel has an internal limit of 1 million records), and will make up a file size of 300MB. 

Unless the data is summarised, we will quickly run into data processing limitations, as we process months/years of data.

The data that appears, on financial charts, on popular websites, is such summarised data. Visually, we do not lose out on anything significant by working on such summarised data

Algorithm description:

The basic idea is to slice the raw high frequency data into slices, based on a consistent rule. Once we have prepared a slice, we summarise the slice, using statistics. A common statistic used by many is OHLC [Open Price, High Price, Low Price, Close Price]. Specifically, within the slice we just took, what is the price at which the slice started, what was the highest price reached within the slice, etc.

These summary statistics are then stored, and forms the basis to train a Machine Learning model.

1.1.1 Time bars: We take slices at a constant frequency, e.g. every 10 minutes

1.1.2 Tick bars: We take slices after a constant number of trades has passed, e.g. every 10,000 trades

1.1.3 Volume bars: We take slices after a constant traded volume is exchanged, e.g. every 10,000 shares 

1.1.4 Dollar bars: We take slices after a constant financial amount is exchanged, e.g. every 10,000 dollars worth of shares 



Original time series had 5 million records. The sampled series has only 800 records, but we have lost hardly any discrimination. The sampled series shows plots as per tick bars, volume bars, and dollar bars, which are slightly different but roughly consistent.


Python Code:

Are there any coding tips to improve speed/execution of the algorithm? Let me know in the comments.....

# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv(r'C:\Users\josde\OneDrive\Denny\Deep-learning\Data-sets\Trade-data\ES_Trades.csv')
df = df.iloc[:,0:5]
df['Dollar'] = df['Price']*df['Volume']
print(df.columns)

# Identify threshold
def thresh(df,freq):
tmp1 = pd.DataFrame(pd.pivot_table(df, index='Date', values='Symbol', aggfunc='count'))
tmp2 = pd.DataFrame(pd.pivot_table(df, index='Date', values='Volume', aggfunc='sum'))
tmp3 = pd.DataFrame(pd.pivot_table(df, index='Date', values='Dollar', aggfunc='sum'))
tick_thresh = np.round((1/freq)*np.average(tmp1['Symbol']))
volume_thresh = np.round((1/freq)*np.average(tmp2['Volume']))
dollar_thresh = np.round((1/freq)*np.average(tmp3['Dollar']))
return tick_thresh,volume_thresh,dollar_thresh

tick_thresh,volume_thresh,dollar_thresh = thresh(df,50)

# Generate bars
def bargen(df,tick_thresh,volume_thresh,dollar_thresh):
tickprice = []
volumeprice = []
dollarprice = []
tick_tmp = []
volume_tmp = []
dollar_tmp = []
tick_count = 0
vol_count = 0
dol_count = 0
for idx, (price,vol, dol) in enumerate(zip(df['Price'],df['Volume'],df['Dollar'])):
tickprice.append(price)
volumeprice.append(price)
dollarprice.append(price)

tick_count = tick_count + 1
vol_count = vol_count + vol
dol_count = dol_count + dol
if tick_count == tick_thresh:
o = tickprice[0]
h = np.max(tickprice)
l = np.min(tickprice)
c = tickprice[-1]
tick_tmp.append((idx, o,h,l,c))
tick_count = 0
tickprice = []
if vol_count>=volume_thresh:
o = volumeprice[0]
h = np.max(volumeprice)
l = np.min(volumeprice)
c = volumeprice[-1]
volume_tmp.append((idx, o, h, l, c))
vol_count = 0
volumeprice = []
if dol_count>=dollar_thresh:
o = dollarprice[0]
h = np.max(dollarprice)
l = np.min(dollarprice)
c = dollarprice[-1]
dollar_tmp.append((idx, o, h, l, c))
dol_count = 0
dollarprice = []
cols = ['Index','Open','High','Low','Close']
tick_bar = pd.DataFrame(tick_tmp,columns = cols)
volume_bar = pd.DataFrame(volume_tmp,columns = cols)
dollar_bar = pd.DataFrame(dollar_tmp,columns = cols)
return tick_bar,volume_bar,dollar_bar
tick_bar,volume_bar,dollar_bar = bargen(df,tick_thresh,volume_thresh,dollar_thresh)
print(tick_bar.shape,volume_bar.shape,dollar_bar.shape)

Comments

Popular posts from this blog

1.2 Structured Data: Information Driven Bars

2.1 Labeling: Fixed Horizon Method

2.2 Labeling: Triple barrier method