The secret to labeling financial data

This post will describe the secret of labeling financial data

To avoid the curse of dimensionality, it is advisable to reframe a regression problem into a classification problem where possible.

Fixed horizon method:

Fixed horizon method is a popular method to do this. There are two components to this, a. generating the fixed horizon, and b. labeling the data. Generating the fixed horizon is done through the use of bars, time bars/tick bars/volume bars/dollar bars. The bars other than time bars are preferred because the return series is closer to the normal distribution. Labeling the data is done by setting up a return threshold, t. If the return, r, over a fixed number of bars is less than -t, label the bar as -1, between -t and t label 0, and beyond t label 1

A variation of the above method, that is very appealing, is to replace the raw return r with the standardised return z, which is the return that is adjusted for the volatility predicted over the interval of bars we are calculating for. If the standardised return, z, over a fixed number of bars is less than -t, label the bar as -1, between -t and t label 0, and beyond t label 1

A drawback of fixed horizon method, is that it ignores the return profile within the bar, and in actual practise if the return exceeds or declines above or below profit taking or stop loss positions defined by the risk management department, then the position will not reach the end of the bar in actual fact.

Triple barrier method

In the triple barrier method, we enter a position at time bar t0, we define a max number of bars that we will hold the position, h, i.e. till time t0 + h. We also define the expected volatility of returns over this horizon, t. We then initiate the position and start monitoring the return at each time step, if the return exeeeds t, we exit the position (profit taking) and label it as +1. If the return exceeds -t, we exit the position (stop loss), and label it as -1. We keep going till the return is within these thresholds, until we run out of time for the position, i.e. t0+h, we then exit the position and label it as 0.

Meta-labeling:

Labels can be used to train a model that can analyse current circumstances, and predict whether we will have a return exceeding a threshold of t, in the next h time bars. Before we place a bit on this outcome, it is useful to train a separate model, that offers comment on how often the earlier model has been right, when it has predicted a positive or negative outcome. This exercise results us in including precision and recall, and in estimating in some way, the level of true positives. This gives us confidence to size the bet (trade size), through which we will act on this prediction.

For an excellent tutorial on precision, recall and F1-score, visit this link: https://machinelearningmastery.com/precision-recall-and-f-measure-for-imbalanced-classification/

Physical meaning:

Practical use:

Compute using Python:

File: Fixed-Hor.py

# Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# Import Data
df = pd.read_csv(r'C:\Users\Denny Joseph\OneDrive\Denny\Deep-learning\Data-sets\Trade-data\ES_Trades.csv')
df = df.iloc[:,0:5] # Remove unwanted columns

# Add attributes to the dataframe
def add_attributes(df):
    df['Dollar'] = df['Price'] * df['Volume']
    df['Datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
    #df['DateTime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
    df['Hour'] = df['Datetime'].dt.hour
    df['Minute'] = df['Datetime'].dt.minute
    df['Second'] = df['Datetime'].dt.second
    return df

# Generate thresholds for bars
def thresh_gen(df):
    tt = pd.DataFrame(pd.pivot_table(df,index='Date',values='Symbol',aggfunc='count'))
    vt = pd.DataFrame(pd.pivot_table(df,index='Date',values='Volume',aggfunc='sum'))
    dt = pd.DataFrame(pd.pivot_table(df,index='Date',values='Dollar',aggfunc='sum'))
    tt = round((1/50)*np.average(tt['Symbol']))

    vt = round((1 / 50) * np.average(vt['Volume']))
    dt = round((1 / 50) * np.average(dt['Dollar']))
    return tt, vt, dt

# Create bars
def bar_gen(df,tt,vt,dt):
    tc,vc,dc = 0,0,0
    timetmp, ticktmp, voltmp, doltmp = [], [], [], []
    timebar, tickbar, volumebar, dollarbar = [],[],[],[]
    for i, (p,v,d,tstamp) in enumerate(zip(df['Price'],df['Volume'],df['Dollar'], df['Datetime'])):
        tc = tc + 1
        vc = vc + v
        dc = dc + d
        tickbar.append(p)
        volumebar.append(p)
        dollarbar.append(p)
        if tc == tt:
            tickbar = np.array(tickbar)
            open = tickbar[0]
            low = np.min(tickbar)
            high = np.max(tickbar)
            close = tickbar[-1]
            ticktmp.append((i,tstamp, open,high, low, close))
            tickbar = []
            tc = 0
        if vc >= vt:
            volumebar = np.array(volumebar)
            open = volumebar[0]
            low = np.min(volumebar)
            high = np.max(volumebar)
            close = volumebar[-1]
            voltmp.append((i,tstamp, open,high, low, close))
            volumebar = []
            vc = 0

        if dc >= dt:
            dollarbar = np.array(dollarbar)
            open = dollarbar[0]
            low = np.min(dollarbar)
            high = np.max(dollarbar)
            close = dollarbar[-1]
            doltmp.append((i,tstamp, open,high, low, close))
            dc = 0
            dollarbar = []

    df = df.set_index('Datetime')
    time_bar = pd.DataFrame(df['Price'].resample('10T').ohlc().bfill().between_time('09:00', '16:30'))

    cols = ['Index','Time stamp' ,'Open', 'High', 'Low', 'Close']
    return (time_bar, pd.DataFrame(ticktmp, columns=cols), pd.DataFrame(voltmp, columns=cols), pd.DataFrame(doltmp, columns=cols))

df = add_attributes(df)
tt, vt, dt = thresh_gen(df)
timebar, tickbar, volumebar, dollarbar = bar_gen(df,tt,vt,dt)

def fixed_hor(df,rc):
    p0 = 0
    marker = []
    res = []
    for i, p in enumerate(df['Close']):
        if i == 0:
            p0 = p
        r = ((p-p0)/p0)
        res.append((r))
        if r > rc:
            m = 1
            marker.append((m))
            p0 = p
        if r < -rc:
            m = -1
            marker.append((m))
            p0 = p
        else:
            m = 0
            marker.append((m))
            p0 = p
    cols = ['Marker']
    df['Marker'] = pd.DataFrame(marker,columns=cols)
    return df, res

rc = 0.0007685632562659526
tickbar, res = fixed_hor(tickbar,rc)
print(tickbar)
sns.distplot(res)
plt.show()

File: Triple-Barrier.py

# Import Libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# Import Data
df = pd.read_csv(r'C:\Users\Denny Joseph\OneDrive\Denny\Deep-learning\Data-sets\Trade-data\ES_Trades.csv')
df = df.iloc[:,0:5] # Remove unwanted columns

# Add attributes to the dataframe
def add_attributes(df):
    df['Dollar'] = df['Price'] * df['Volume']
    df['Datetime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
    #df['DateTime'] = pd.to_datetime(df['Date'] + ' ' + df['Time'])
    df['Hour'] = df['Datetime'].dt.hour
    df['Minute'] = df['Datetime'].dt.minute
    df['Second'] = df['Datetime'].dt.second
    return df

# Generate thresholds for bars
def thresh_gen(df):
    tt = pd.DataFrame(pd.pivot_table(df,index='Date',values='Symbol',aggfunc='count'))
    vt = pd.DataFrame(pd.pivot_table(df,index='Date',values='Volume',aggfunc='sum'))
    dt = pd.DataFrame(pd.pivot_table(df,index='Date',values='Dollar',aggfunc='sum'))
    tt = round((1/50)*np.average(tt['Symbol']))

    vt = round((1 / 50) * np.average(vt['Volume']))
    dt = round((1 / 50) * np.average(dt['Dollar']))
    return tt, vt, dt

# Create bars
def bar_gen(df,tt,vt,dt):
    tc,vc,dc = 0,0,0
    timetmp, ticktmp, voltmp, doltmp = [], [], [], []
    timebar, tickbar, volumebar, dollarbar = [],[],[],[]
    for i, (p,v,d,tstamp) in enumerate(zip(df['Price'],df['Volume'],df['Dollar'], df['Datetime'])):
        tc = tc + 1
        vc = vc + v
        dc = dc + d
        tickbar.append(p)
        volumebar.append(p)
        dollarbar.append(p)
        if tc == tt:
            tickbar = np.array(tickbar)
            open = tickbar[0]
            low = np.min(tickbar)
            high = np.max(tickbar)
            close = tickbar[-1]
            ticktmp.append((i,tstamp, open,high, low, close))
            tickbar = []
            tc = 0
        if vc >= vt:
            volumebar = np.array(volumebar)
            open = volumebar[0]
            low = np.min(volumebar)
            high = np.max(volumebar)
            close = volumebar[-1]
            voltmp.append((i,tstamp, open,high, low, close))
            volumebar = []
            vc = 0

        if dc >= dt:
            dollarbar = np.array(dollarbar)
            open = dollarbar[0]
            low = np.min(dollarbar)
            high = np.max(dollarbar)
            close = dollarbar[-1]
            doltmp.append((i,tstamp, open,high, low, close))
            dc = 0
            dollarbar = []

    df = df.set_index('Datetime')
    time_bar = pd.DataFrame(df['Price'].resample('10T').ohlc().bfill().between_time('09:00', '16:30'))

    cols = ['Index','Time stamp' ,'Open', 'High', 'Low', 'Close']
    return (time_bar, pd.DataFrame(ticktmp, columns=cols), pd.DataFrame(voltmp, columns=cols), pd.DataFrame(doltmp, columns=cols))

df = add_attributes(df)
tt, vt, dt = thresh_gen(df)
timebar, tickbar, volumebar, dollarbar = bar_gen(df,tt,vt,dt)

def fixed_hor(tickbar,rc):
    p0 = 0
    marker = []
    res = []
    for i, p in enumerate(tickbar['Close']):
        if i == 0:
            p0 = p
        r = ((p-p0)/p0)
        res.append((r))
        if r > rc:
            m = 1
            marker.append((m))
            p0 = p
        if r < -rc:
            m = -1
            marker.append((m))
            p0 = p
        else:
            m = 0
            marker.append((m))
            p0 = p
    cols = ['Marker']
    tickbar['Marker'] = pd.DataFrame(marker,columns=cols)
    return tickbar, res

rc = 0.0007685632562659526
tickbar, res = fixed_hor(tickbar,rc)
print(tickbar)


def triple(tickbar,df,rc):
    marker = []
    for i, (idx, base) in enumerate(zip(tickbar['Index'],tickbar['Open'])):
        if i == 0:
            start = 0
        else:
            start = tickbar.iloc[i-1,0]
        end = tickbar.iloc[i,0]

        for j in range(start,end+1):
            p = df.iloc[j,3]
            r = (p-base)/base
            if r > rc:
                m = 1
                marker.append((m))
                break
            if r < -rc:
                m = -1
                marker.append((m))
                break
            elif j == end:
                m = 0
                marker.append((m))
                break
    cols = ['Triple']
    tickbar['Triple'] = pd.DataFrame(marker, columns=cols)
    return tickbar

tickbar = triple(tickbar,df,rc)
volumebar = triple(volumebar,df,rc)
dollarbar = triple(dollarbar,df,rc)

Search This Blog

Machine Learning in Finance

The secret to labeling financial data

Comments

Post a Comment

Popular posts from this blog

1.2 Structured Data: Information Driven Bars

2.2 Labeling: Triple barrier method

Denoise a Covariance Matrix using Constant Residual Eigen Value Method