Hyper Optimized Algorithmic Strategy Vs/+ Machine Learning Models Part -3 (XGBoost Classifier , LGBM Classifier, CatBoost Classifier, SVC, LSTM with XGB and Multi level Hyper-optimization)

--

Level Up Your Algo Trading Game: Introducing XGBoost and Hyperparameter Optimization

Introduction:

In my algorithmic trading journey, I’ve explored various strategies and machine learning models to improve performance. This article delves into my experience using XGBoost, a powerful classifier, and hyperparameter optimization to refine my trading signals.

Why XGBoost?

While my previous strategies yielded promising results, they faced challenges like imbalanced classes and subpar precision for non-neutral signals. XGBoost boasts several advantages:

  • Exceptional handling of imbalanced data: It excels at tackling skewed data distributions, ensuring all trading signals receive proper attention.
  • Robustness and flexibility: Its built-in regularization techniques prevent overfitting and adapt to various data types, making it versatile for algorithmic trading.
  • Hyperparameter optimization potential: With a rich set of hyperparameters, XGBoost allows me to fine-tune its performance, potentially unlocking significant gains.

Unleashing XGBoost’s Power:

To maximize XGBoost’s potential, I meticulously selected hyperparameters using techniques like:

  • Grid search: Systematically evaluating combinations of values to identify promising candidates.
  • Random search: Exploring the hyperparameter space more broadly to potentially discover hidden gems.
  • Bayesian optimization: Leveraging statistical methods to efficiently navigate the search space, focusing on promising regions.

My goal is to:

  • Improve signal classification accuracy and precision: Accurately identifying profitable trends and avoiding false positives.
  • Boost overall trading performance: Enhance profitability, reduce stop losses, and manage risk effectively.

Join me on this exploration!

In this article, I’ll:

  • Deep dive into XGBoost, its workings, and its suitability for algorithmic trading.
  • Provide a hands-on guide to implementing XGBoost with hyperparameter optimization in your trading strategy.
  • Share the results of my experiments, showcasing the potential performance gains.
  • Discuss the practical considerations and challenges of using XGBoost in real-world trading.
  • we have used bitcoin data from 2021 Jan 01 to 2023 october 10th of 1000+ days with 15 minute time frame candles as OHLCV data. (You can use any data that suits your needs). The data has around 97,000+ rows to do some training on.

Get ready to witness the trans-formative power of XGBoost and hyperparameter optimization in your algorithmic trading endeavors!

XGBoost, CatBoost, LGBM (Image Source — google search)

Our Algorithmic Trading Vs/+ Machine Learning Journey so far?

Stage 1:

We have developed a crypto Algorithmic Strategy which gave us huge profits when ran on multiple crypto assets (138+) with a profit range of 8787%+ in span of 3 years (almost).

“The 8787%+ ROI Algo Strategy Unveiled for Crypto Futures! Revolutionized With Famous RSI, MACD, Bollinger Bands, ADX, EMA” — Link

We have run live trading in dry-run mode for the same for 7 days and details about the same have been shared in another article.

“Freqtrade Revealed: 7-Day Journey in Algorithmic Trading for Crypto Futures Market” — Link

After successful backtest results and forward testing (live trading in dry-run mode), we planned to improve the odds of making more profit for the same. (To lower stop-losses, increase odds of winning more , reduce risk factor and other important things)

Stage 2:

We have worked on developing a strategy alone without freqtrade setup (avoiding trailing stop loss, multiple asst parallel running, higher risk management setups that freqtrade provides for free (it is a free open source platform) and then tested it in market, then optimized it using hyper parameters and then , we got some +ve profits from the strategy

“How I achieved 3000+% Profit in Backtesting for Various Algorithmic Trading Bots and how you can do the same for your Trading Strategies — Using Python Code” — Link

Stage 3:

As we have tested our strategy only on 1 Asset , i.e; BTC/USDT in crypto market, we wanted to know if we can segregate the whole collective assets we have (Which we have used for developing Freqtrade Strategy earlier) segregate them into different clusters based on their volatility, it becomes easy to do trading for certain volatile assets and won’t hit huge stop-losses for others if worked on implementing based on coin volatility.

We used K-nearest Neighbors (KNN Means) to identify different clusters of assets out of 138 crypto assets we use in our freqtrade strategy, which gave us 8000+% profits during backtest.

“Hyper Optimized Algorithmic Strategy Vs/+ Machine Learning Models Part -1 (K-Nearest Neighbors)” — Link

Stage 4:

Now, we want to introduce Unsupervised Machine Learning model — Hidden Markov Model (HMMs) to identify trends in the market and trade during only profitable trends and avoid sudden pumps, dumps in market, avoid negative trends in market. Below explanation unravels the same.

“Hyper Optimized Algorithmic Strategy Vs/+ Machine Learning Models Part -2 (Hidden Markov Model — HMM)” — Link

Stage 5 (Present):

Now, we will be working on using XGboost Classifier to identify long and short trades using our old signal, but before using it, we will be making sure that, the signal algorithm we previously made is hyper optimized. Also, we are going to introduce different stop-loss and take profit for this setup, so the target values will change accordingly and parameters we use for obtaining profitable trades also changes based on stop-loss and take profit (we will anyway do hyper optimizing to get best results), later, we will be testing on basic XGBClassifier setup, and then to improve the results, we will add re-sampling methods (because our Target classes have 0’s (neutral), 1’s (for long trades), 2’s (for short trades) which are imbalanced because of most of the times, the trade waits for right execution time and so, 0’s are lot more. To tackle imbalance time series data, we use re-sampling methods, and then use hyper-optimization of the classifier model , later we will check if the model improves with any other classifier models at all like SVC, CatBoost, LightGBM, using LSTM with XGBoost), finally we will conclude with the results we got and also do feature importance parameters to check which feature is used more productively.

The Code Explanation

# Remove Future Warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# General
import numpy as np

# Data Management
import pandas as pd
from sklearn.model_selection import train_test_split

# Machine Learning
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV, cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold

# Binary Classification Specific Metrics
from sklearn.metrics import RocCurveDisplay as plot_roc_curve

# General Metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score
from sklearn.metrics import precision_score
from sklearn.metrics import ConfusionMatrixDisplay

# Reporting
import matplotlib.pyplot as plt
from matplotlib.pylab import rcParams
from xgboost import plot_tree

#Backtesting
from backtesting import Backtest
from backtesting import Strategy

#hyperopt
from hyperopt import fmin, tpe, hp

from pandas_datareader.data import DataReader

import json
from datetime import datetime
import talib as ta
import ccxt

1. Setting the Stage:

Suppressing Future Warnings:

  • import warnings
  • warnings.simplefilter(action='ignore', category=FutureWarning)
  • This prevents distracting messages from potential changes in future library versions, keeping the output focused.

2. Importing Essential Tools:

NumPy for numerical computations:

  • import numpy as np

Pandas for data manipulation:

  • import pandas as pd

Sklearn for model selection and evaluation:

  • from sklearn.model_selection import train_test_split
  • from sklearn.model_selection import RandomizedSearchCV, cross_val_score
  • from sklearn.model_selection import RepeatedStratifiedKFold

Metrics for performance assessment:

  • from sklearn.metrics import ... (various metrics)

Visualization tools for insights:

  • import matplotlib.pyplot as plt
  • from matplotlib.pylab import rcParams
  • from xgboost import plot_tree

Backtesting framework for historical performance evaluation:

  • from backtesting import Backtest, Strategy

Hyperparameter optimization:

  • from hyperopt import fmin, tpe, hp

Data retrieval and technical analysis:

  • from pandas_datareader.data import DataReader
  • import json
  • from datetime import datetime
  • import talib as ta
  • import ccxt (for potential live trading interactions)
# Define the path to your JSON file
file_path = "../BTC_USDT_USDT-15m-futures.json"

# Open the file and read the data
with open(file_path, "r") as f:
data = json.load(f)

# Check the data structure
# print(data) # Should be a list of dictionaries

#when using heavy data, then open notebook using this command - jupyter notebook --NotebookApp.iopub_data_rate_limit=100000000

df = pd.DataFrame(data)

# Extract the OHLC data (adjust column names as needed)
# ohlc_data = df[["date","open", "high", "low", "close", "volume"]]
df.rename(columns={0: "Date", 1: "Open", 2: "High",3: "Low", 4: "Adj Close", 5: "Volume"}, inplace=True)

# Convert timestamps to datetime objects
df["Date"] = pd.to_datetime(df['Date'] / 1000, unit='s')

df.set_index("Date", inplace=True)

# Format the date index
df.index = df.index.strftime("%m-%d-%Y %H:%M")

# print(df.dropna(), df.describe(), df.info())

data = df

data

Loading and Prepping the Data:

Locating the Data:

  • “I first specified the path to my JSON file containing historical market data for BTC_USDT futures on a 15-minute timeframe: file_path = '../BTC_USDT_USDT-15m-futures.json'"

Reading the JSON:

  • “I used Python’s json library to read the data from the file and stored it in a variable named data."

Creating a DataFrame:

  • “I converted the JSON data into a Pandas DataFrame, a powerful tool for data analysis and manipulation, using df = pd.DataFrame(data)."

Renaming Columns:

  • “I renamed the columns for clarity: df.rename(columns={0: 'Date', 1: 'Open', 2: 'High', 3: 'Low', 4: 'Adj Close', 5: 'Volume'}, inplace=True)."

Handling Timestamps:

  • “I converted the timestamps in the ‘Date’ column to datetime objects for easier time-based analysis: df['Date'] = pd.to_datetime(df['Date'] / 1000, unit='s')."

Setting the Index:

  • “I set the ‘Date’ column as the index, making it easier to access and manipulate time-based data: df.set_index('Date', inplace=True)."

Formatting the Index:

  • “I formatted the index to display a human-readable date and time format: df.index = df.index.strftime('%m-%d-%Y %H:%M')."

Assigning the DataFrame:

  • “I stored the modified DataFrame back into the data variable for further use."
OHLCV Data of bitcoin 15 min time frame
# Add Returns and Range
target_prediction_number = 2
df = data.copy()
df["Returns"] = (df["Adj Close"] / df["Adj Close"].shift(target_prediction_number)) - 1
df["Range"] = (df["High"] / df["Low"]) - 1
df["Volatility"] = df['Returns'].rolling(window=target_prediction_number).std()
df.dropna(inplace=True)
print("Length: ", len(df))
df
# df_returns

Incorporating Returns, Range, and Volatility:

The code begins by defining a target_prediction_number of 2. This indicates that the aim is to predict price movements over the next 2 periods (in this case, 30 minutes). A copy of the DataFrame is then created to avoid modifying the original data.

Next, the code calculates percentage returns for each period using the formula df['Returns'] = (df['Adj Close'] / df['Adj Close'].shift(target_prediction_number)) - 1. This measures the price change relative to the price 2 periods ago.

The range of price movement within each period is then calculated using df['Range'] = (df['High'] / df['Low']) - 1. This captures the volatility within each 15-minute timeframe.

A rolling volatility measure is introduced using df['Volatility'] = df['Returns'].rolling(window=target_prediction_number).std(). This captures the degree of price fluctuations over the past 2 periods, providing insights into market uncertainty.

Finally, any rows with missing data are removed using df.dropna(inplace=True) to ensure consistency in calculations. The length of the DataFrame is then printed to visualize the amount of data available for analysis.

Key Points:

  • Returns measure price changes over time, reflecting asset performance.
  • Range captures volatility within each period, indicating price swings.
  • Volatility assesses overall market uncertainty, crucial for risk management.
# Add Moving Average
df["MA_12"] = df["Adj Close"].rolling(window=12).mean()
df["MA_21"] = df["Adj Close"].rolling(window=21).mean()

def trade_signal(dataframe=df, rsi_tp=19, bb_tp=16, vol_long=42, vol_short=29):
# Compute indicators

dataframe['RSI'] = ta.RSI(dataframe['Adj Close'], timeperiod=rsi_tp)
dataframe['upper_band'], dataframe['middle_band'], dataframe['lower_band'] = ta.BBANDS(dataframe['Adj Close'], timeperiod=bb_tp)
dataframe['macd'], dataframe['signal'], _ = ta.MACD(dataframe['Adj Close'])

conditions_long = ((dataframe['RSI'] > 50) &
(dataframe['Adj Close'] > dataframe['middle_band']) &
(dataframe['Adj Close'] < dataframe['upper_band']) &
(dataframe['macd'] > dataframe['signal']) &
((dataframe['High'] - dataframe['Adj Close']) < (dataframe['Adj Close'] - dataframe['Open'])) &
(dataframe['Adj Close'] > dataframe['Open']) &
(dataframe['Volume'] > dataframe['Volume'].rolling(window=vol_long).mean()))

conditions_short = ((dataframe['RSI'] < 50) &
(dataframe['Adj Close'] < dataframe['middle_band']) &
(dataframe['Adj Close'] > dataframe['lower_band']) &
(dataframe['macd'] < dataframe['signal']) &
((dataframe['Adj Close'] - dataframe['Low']) < (dataframe['Open'] - dataframe['Adj Close'])) &
(dataframe['Adj Close'] < dataframe['Open']) &
(dataframe['Volume'] > dataframe['Volume'].rolling(window=vol_short).mean()))

dataframe['trend'] = 0
dataframe.loc[conditions_long, 'trend'] = 1
dataframe.loc[conditions_short, 'trend'] = 2

dataframe.dropna(inplace=True)

return dataframe

# trading_signal = trade_signal(dataframe=df, rsi_tp=19, bb_tp=16, vol_long=42, vol_short=29)
df['trade_signal'] = trade_signal(dataframe=df, rsi_tp=19, bb_tp=16, vol_long=42, vol_short=50)['trend']
df['Close'] = df['Adj Close']

df.info()

# Check for inf or nan values
inf_mask = np.isinf(df) | np.isnan(df)

# Remove rows containing inf or nan values
df_cleaned = df[~inf_mask.any(axis=1)]

# Remove columns containing inf or nan values
df = df_cleaned.loc[:, ~inf_mask.any(axis=0)]
df.dropna(inplace=True)

df

Incorporating Moving Averages and Technical Indicators:

Adding Moving Averages:

  • The code calculates 12-period and 21-period moving averages of the adjusted closing price using df["MA_12"] = df["Adj Close"].rolling(window=12).mean() and df["MA_21"] = df["Adj Close"].rolling(window=21).mean(). These moving averages smooth out price fluctuations and help identify trends.

Defining a Trade Signal Function:

  • The trade_signal function takes a DataFrame, along with parameters for RSI, Bollinger Bands, and volume thresholds, and generates trading signals based on a combination of technical indicators:
  • RSI (Relative Strength Index): Measures momentum and overbought/oversold conditions.
  • Bollinger Bands: Provide information about volatility and potential price reversals.
  • MACD (Moving Average Convergence Divergence): Identifies trend changes and momentum shifts.

Volume:

  • Confirms the strength of price movements.
  • The function assigns a “trend” value of 1 for long signals, 2 for short signals, and 0 for neutral signals.

Applying the Trade Signal Function:

  • The function is applied to the DataFrame to generate trading signals and assign them to a new “trade_signal” column.

Data Cleaning:

  • The code checks for infinite or NaN (Not a Number) values and removes any rows or columns containing them to ensure data integrity.

Key Points:

  • Moving averages help identify trends and potential support/resistance levels.
  • Technical indicators provide insights into momentum, volatility, and potential trend reversals.
  • The trade_signal function combines multiple indicators to create trading signals, aiming to capture opportunities based on technical analysis principles.
  • Data cleaning is essential to ensure accurate analysis and avoid errors.
df = df.reset_index(inplace=False)
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
df

def SIGNAL(df):
return df.trend

class MyCandlesStrat(Strategy):
def init(self):
super().init()
self.signal1 = self.I(SIGNAL, self.data)

def next(self):
super().next()
if self.signal1==1:
sl1 = self.data.Close[-1] - 500
tp1 = self.data.Close[-1] + 250
#tp1 = self.data.upper_band[-1]
#sl1 = self.data.lower_band[-1]
self.buy(sl=sl1, tp=tp1)
elif self.signal1==2:
sl1 = self.data.Close[-1] + 500
tp1 = self.data.Close[-1] - 250
#sl1 = self.data.upper_band[-1]
#tp1 = self.data.lower_band[-1]
self.sell(sl=sl1, tp=tp1)

bt = Backtest(df, MyCandlesStrat, cash=100000, commission=.001)
stat = bt.run()
stat

Preparing for Backtesting:

Resetting Index:

  • The code resets the index of the DataFrame to make it compatible with the backtesting framework, ensuring proper alignment of data and signals.

Converting Datetime:

  • It ensures the “Date” column is in a datetime format for accurate time-based operations in the backtest.

Setting Index:

  • The “Date” column is set as the index again for time-based analysis.

Defining a Signal Function:

  • SIGNAL Function: This simple function returns the “trend” column from the DataFrame to provide trading signals to the backtesting strategy.

Creating a Backtesting Strategy:

  • MyCandlesStrat Class: This class represents the backtesting strategy, defining how it reacts to signals and executes trades.
  • init Method: Initializes the strategy by retrieving the trading signals using self.signal1 = self.I(SIGNAL, self.data).
  • next Method: Called for each time step in the backtest.

Checks the current signal:

  • If self.signal1 is 1 (long signal): Places a buy order with a stop-loss (sl) 500 points below the current close price and a take-profit (tp) 250 points above.
  • If self.signal1 is 2 (short signal): Places a sell order with a stop-loss 500 points above the current close price and a take-profit 250 points below.

Running the Backtest:

  • Initializing Backtest: An instance of the Backtest class is created, taking the DataFrame, strategy, initial cash, and commission as parameters.
  • Executing Backtest: The bt.run() method runs the backtest, simulating trades based on the strategy and market data.
  • Storing Results: The performance statistics of the backtest are stored in the stat variable for analysis.

Key Points:

  • Backtesting allows for evaluating trading strategies on historical data before real-world deployment.
  • The strategy in this code uses technical analysis signals to generate buy and sell orders.
  • Stop-losses and take-profits are implemented for risk management.
  • Performance statistics from the backtest provide insights into the strategy’s potential profitability and risk profile.
Backtest Results with loss and without hyper optimization

Here’s a breakdown of the information provided:

  1. Start: The start date and time of the backtest period.
  2. End: The end date and time of the backtest period.
  3. Duration: The duration of the backtest period, which is 1015 days and 22 hours.
  4. Exposure Time [%]: The percentage of time the strategy was exposed to the market.
  5. Equity Final [$]: The final equity value at the end of the backtest period.
  6. Equity Peak [$]: The peak equity value reached during the backtest period.
  7. Return [%]: The percentage return generated by the strategy over the backtest period.
  8. Buy & Hold Return [%]: The percentage return that would have been achieved if the assets were held without trading.
  9. Return (Ann.) [%]: The annualized percentage return of the strategy.
  10. Volatility (Ann.) [%]: The annualized volatility of returns.
  11. Sharpe Ratio: The risk-adjusted return measure, calculated as the ratio of the strategy’s excess return to its volatility.
  12. Sortino Ratio: A modified version of the Sharpe ratio that only considers the downside risk.
  13. Calmar Ratio: The ratio of the annualized return to the maximum drawdown, providing a measure of risk-adjusted return.
  14. Max. Drawdown [%]: The maximum percentage decline in equity from a peak value.
  15. Avg. Drawdown [%]: The average percentage decline in equity during drawdown periods.
  16. Max. Drawdown Duration: The duration of the longest drawdown period.
  17. Avg. Drawdown Duration: The average duration of drawdown periods.
  18. # Trades: The total number of trades executed by the strategy during the backtest.
  19. Win Rate [%]: The percentage of trades that were profitable.
  20. Best Trade [%]: The highest percentage return achieved by a single trade.
  21. Worst Trade [%]: The lowest percentage return achieved by a single trade.
  22. Avg. Trade [%]: The average percentage return per trade.
  23. Max. Trade Duration: The duration of the longest trade.
  24. Avg. Trade Duration: The average duration of trades.
  25. Profit Factor: The ratio of gross profits to gross losses.
  26. Expectancy [%]: The average expected return per trade.
  27. SQN: The System Quality Number, which measures the relationship between the mean (average) and the standard deviation of the R-multiple distribution.
  28. _strategy: The name or description of the strategy used for the backtest.

. we have got -25% return with random variables given to the function as input parameter values.
. we have to do hyperParameter optimization to get best possible results for the strategy developed here

Hyper-Optimization of Strategy

def objective(params):
rsi_tp = int(params['rsi_tp'])
bb_tp = int(params['bb_tp'])
vol_long = int(params['vol_long'])
vol_short = int(params['vol_short'])

df_copy = df.copy() # Make a copy to avoid modifying the original DataFrame
df_copy.dropna(inplace=True)

# Call the trade_signal function with the modified DataFrame
trade_signal(df_copy, rsi_tp, bb_tp, vol_long, vol_short)

# Create a backtest instance
bt = Backtest(df_copy, MyCandlesStrat, cash=100000, commission=.001)

# Run the backtest and get the statistics
stats = bt.run()

# Choose the 'Total Return (%)' key for optimization
key_to_optimize = 'Return [%]'

# Return the metric to optimize
metric_to_optimize = -stats[key_to_optimize]

return metric_to_optimize


space = {
'rsi_tp': hp.quniform('rsi_tp', 5, 30, 1), # Adjust the range and step as needed
'bb_tp': hp.quniform('bb_tp', 2, 20, 1),
'vol_long': hp.quniform('vol_long', 5, 50, 1),
'vol_short': hp.quniform('vol_short', 5, 50, 1)
}

best = fmin(fn=objective, space=space, algo=tpe.suggest, max_evals=210)

best

bb_tp = best['bb_tp'].astype(int)
rsi_tp = best['rsi_tp'].astype(int)
vol_long = best['vol_long'].astype(int)
vol_short = best['vol_short'].astype(int)

print(f' bb_tp = {bb_tp}\n rsi_tp = {rsi_tp}\n vol_long = {vol_long}\n vol_short = {vol_short} ')

Optimizing Parameters for Enhanced Performance:

Defining the Objective Function:

  • The objective function takes a set of parameters as input and evaluates the performance of the trading strategy using those parameters.
  • It creates a copy of the DataFrame to avoid modifications, calculates trading signals based on the given parameters, runs a backtest, and returns the negative value of the “Return [%]” statistic, aiming to maximize the return.

Creating the Parameter Space:

  • The space dictionary defines the ranges of parameters to be explored during optimization:
  • RSI time period (rsi_tp) from 5 to 30
  • Bollinger Bands time period (bb_tp) from 2 to 20
  • Long volume threshold (vol_long) from 5 to 50
  • Short volume threshold (vol_short) from 5 to 50

Performing Optimization:

  • The fmin function from the hyperopt library is used to find the best parameter combination within the specified space. It employs the Tree of Parzen Estimators (TPE) algorithm to guide the search efficiently.
  • The goal is to minimize the metric_to_optimize (negative return), effectively maximizing the strategy's return.

Retrieving Best Parameters:

  • The optimal parameters are extracted from the best result and printed for reference.

Key Points:

  • Parameter optimization is crucial for tailoring trading strategies to specific market conditions and data.
  • The hyperopt library provides tools for efficient hyperparameter tuning.
  • The TPE algorithm is a powerful technique for guiding the search process.
  • Optimization aims to maximize the strategy’s return, leading to potentially improved performance.

Results After Hyper-Optimization:


def trade_signal(dataframe=df, rsi_tp=rsi_tp, bb_tp=bb_tp, vol_long=vol_long, vol_short=vol_short):

# Compute indicators
dataframe['RSI'] = ta.RSI(dataframe['Adj Close'], timeperiod=rsi_tp)
dataframe['upper_band'], dataframe['middle_band'], dataframe['lower_band'] = ta.BBANDS(dataframe['Adj Close'], timeperiod=bb_tp)
dataframe['macd'], dataframe['signal'], _ = ta.MACD(dataframe['Adj Close'])

conditions_long = ((dataframe['RSI'] > 50) &
(dataframe['Adj Close'] > dataframe['middle_band']) &
(dataframe['Adj Close'] < dataframe['upper_band']) &
(dataframe['macd'] > dataframe['signal']) &
((dataframe['High'] - dataframe['Adj Close']) < (dataframe['Adj Close'] - dataframe['Open'])) &
(dataframe['Adj Close'] > dataframe['Open']) &
(dataframe['Volume'] > dataframe['Volume'].rolling(window=vol_long).mean()))

conditions_short = ((dataframe['RSI'] < 50) &
(dataframe['Adj Close'] < dataframe['middle_band']) &
(dataframe['Adj Close'] > dataframe['lower_band']) &
(dataframe['macd'] < dataframe['signal']) &
((dataframe['Adj Close'] - dataframe['Low']) < (dataframe['Open'] - dataframe['Adj Close'])) &
(dataframe['Adj Close'] < dataframe['Open']) &
(dataframe['Volume'] > dataframe['Volume'].rolling(window=vol_short).mean()))

dataframe['trend'] = 0
dataframe.loc[conditions_long, 'trend'] = 1
dataframe.loc[conditions_short, 'trend'] = 2

dataframe.dropna(inplace=True)

return dataframe

df['trade_signal_2'] = trade_signal(dataframe=df, rsi_tp=rsi_tp, bb_tp=bb_tp, vol_long=vol_long, vol_short=vol_short)['trend']


def SIGNAL(df):
return df.trend

from backtesting import Strategy

class MyCandlesStrat(Strategy):
def init(self):
super().init()
self.signal1 = self.I(SIGNAL, self.data)

def next(self):
super().next()
if self.signal1==1:
sl1 = self.data.Close[-1] - 500
tp1 = self.data.Close[-1] + 250
#tp1 = self.data.upper_band[-1]
#sl1 = self.data.lower_band[-1]
self.buy(sl=sl1, tp=tp1)
elif self.signal1==2:
sl1 = self.data.Close[-1] + 500
tp1 = self.data.Close[-1] - 250
#sl1 = self.data.upper_band[-1]
#tp1 = self.data.lower_band[-1]
self.sell(sl=sl1, tp=tp1)

bt = Backtest(df, MyCandlesStrat, cash=100000, commission=.001)
stat = bt.run()
stat

Generating Refined Trading Signals and Backtesting:

Refining Trading Signals:

  • The trade_signal function is called again, incorporating the optimized parameters obtained earlier.
  • It recalculates the technical indicators and trading signals based on the updated parameters, adding a new “trade_signal_2” column to the DataFrame.

Signal Retrieval:

  • The SIGNAL function remains unchanged, still returning the "trend" column (now containing the refined signals) for backtesting.

Backtesting Strategy:

  • The MyCandlesStrat strategy class and backtesting process are consistent with the previous code, using the refined signals to generate buy and sell orders.

Key Points:

  • The code emphasizes the importance of refining trading signals based on optimization results.
  • It demonstrates how to incorporate optimized parameters into the signal generation process.
  • The backtesting process remains essential for evaluating the performance of the strategy using the refined signals.
Results after Hyper-optimization with +ve Return from Backtesting Bitcoin Data for past 3 years 2021–2023

Overall, the strategy seems to have performed well, with a positive return of 54.5% over the backtest period. However, it’s essential to consider other metrics like drawdown, volatility, and risk-adjusted measures such as the Sharpe and Sortino ratios to assess the strategy’s risk and return profile comprehensively.

Data- Preprocessing — Setting up “Target” value for estimating future predictive values:

We are trying to find best possible Profit for given data with look-ahead basis concept, where we have used 2 main parameters of barsupfront and df1 which is dataframe , so based on no of bars our data has to look forward for during testing with Target data in such a way that, the whole bot gives best possible profit from the data provided.

#Target flexible way
pipdiff = 250 #for TP
SLTPRatio = 0.5 #pipdiff/Ratio gives SL
def mytarget(barsupfront, df1):
length = len(df1)
high = list(df1['High'])
low = list(df1['Low'])
close = list(df1['Close'])
open = list(df1['Open'])
trendcat = [None] * length
for line in range (0,length-barsupfront-2):
valueOpenLow = 0
valueOpenHigh = 0
for i in range(1,barsupfront+2):
value1 = open[line+1]-low[line+i]
value2 = open[line+1]-high[line+i]
valueOpenLow = max(value1, valueOpenLow)
valueOpenHigh = min(value2, valueOpenHigh)
if ( (valueOpenLow >= pipdiff) and (-valueOpenHigh <= (pipdiff/SLTPRatio)) ):
trendcat[line] = 2 #-1 downtrend
break
elif ( (valueOpenLow <= (pipdiff/SLTPRatio)) and (-valueOpenHigh >= pipdiff) ):
trendcat[line] = 1 # uptrend
break
else:
trendcat[line] = 0 # no clear trend

return trendcat


#!!! pitfall one category high frequency
df['Target'] = mytarget(3, df)
#df.tail(20)
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.dropna(axis=0, inplace=True)

# Convert columns to integer type
df = df.astype(int)
#df['Target'] = df['Target'].astype(int)
df['Target'].hist()

count_of_twos_target = df['Target'].value_counts().get(2, 0)
count_of_zeros_target = df['Target'].value_counts().get(0, 0)
count_of_ones_target = df['Target'].value_counts().get(1, 0)
percent_of_zeros_over_ones_and_twos = (100 - (count_of_zeros_target/ (count_of_zeros_target + count_of_ones_target + count_of_twos_target))*100)
print(f' count_of_zeros = {count_of_zeros_target}\n count_of_twos_target = {count_of_twos_target}\n count_of_ones_target={count_of_ones_target}\n percent_of_zeros_over_ones_and_twos = {round(percent_of_zeros_over_ones_and_twos,2)}%')


# Check for NaN values:
has_nan = df['Target'].isnull().values.any()
print("NaN values present:", has_nan)

# Check for infinite values:
has_inf = df['Target'].isin([np.inf, -np.inf]).values.any()
print("Infinite values present:", has_inf)

# Count the number of NaN and infinite values:
nan_count = df['Target'].isnull().sum()
inf_count = (df['Target'] == np.inf).sum() + (df['Target'] == -np.inf).sum()
print("Number of NaN values:", nan_count)
print("Number of infinite values:", inf_count)

# Get the indices of NaN and infinite values:
nan_indices = df['Target'].index[df['Target'].isnull()]
inf_indices = df['Target'].index[df['Target'].isin([np.inf, -np.inf])]
print("Indices of NaN values:", nan_indices)
df['Target']

Parameters and Constants:

  • pipdiff: This variable defines the minimum price difference required to trigger a target label. It's used as the threshold for determining whether the price movement constitutes a trend.
  • SLTPRatio: This variable defines the ratio used to calculate the stop loss (SL) level relative to the take profit (TP) level. It's used to determine the stop loss distance based on the pipdiff.

Function mytarget:

  • This function takes two arguments: barsupfront and df1.
  • barsupfront: This parameter indicates the number of bars to look ahead for determining the trend.
  • df1: This parameter is assumed to be a DataFrame containing financial data with columns 'High', 'Low', 'Close', and 'Open'.
  • The function iterates over the rows of the DataFrame and calculates the difference between the open price and the high/low prices for each subsequent bar (barsupfront+2 bars ahead).
  • Based on these differences and the predefined conditions (pipdiff and SLTPRatio), it assigns a target label to each row indicating whether there is an uptrend (1), downtrend (2), or no clear trend (0).

Applying the Function:

  • The function is applied to the DataFrame df with a barsupfront value of 3, indicating that it will look ahead by 3 bars to determine the trend.
  • Any rows containing infinite or NaN values are dropped from the DataFrame.
  • The ‘Target’ column is converted to integers, and a histogram is plotted to visualize the distribution of target labels.

Checking for NaN and Infinite Values:

  • It verifies if there are any NaN (Not a Number) or infinite values present in the ‘Target’ column of the DataFrame df.
  • The isnull() function checks for NaN values, while the isin([np.inf, -np.inf]) function checks for infinite values.
  • The .values.any() method returns True if any NaN or infinite values are found.

Counting NaN and Infinite Values:

  • If NaN or infinite values are present, the code counts the number of occurrences using the sum() function.
  • For infinite values, it sums the occurrences of both positive and negative infinity.

Getting Indices of NaN and Infinite Values:

  • It retrieves the indices (row numbers) where NaN or infinite values occur in the ‘Target’ column.
  • This helps identify the specific rows where these values are present.

Printing Information:

  • The code prints whether NaN or infinite values are present and the count of each type of value.
  • If values are found, it also prints the indices where they occur.

Analysis:

  • The code calculates and prints various statistics related to the distribution of target labels, such as the counts of zeros, ones, and twos, and the percentage of zeros relative to the sum of ones and twos.
More 0’s (neutral value), than 1’s (Longs) and 2’s (Shorts) combined

We can see clear imbalance in the data classes of 0’s , 1’s and 2’s . But anyhow we will continue our calculations, we will see how we can improve the results when there is huge inbalance in data distribution (in the coming article, we will create ideal data distribution and also use various technical indicators instead of our own signal and then see the magic happening)

Backtesting the “Target” to see Best Possible Profits

def SIGNAL(df):
return df['Target']

from backtesting import Strategy

class MyCandlesStrat(Strategy):
def init(self):
super().init()
self.signal1 = self.I(SIGNAL, self.data)

def next(self):
super().next()
if self.signal1 == 1:
sl_pct = 0.02 # 2% stop-loss
tp_pct = 0.05 # 5% take-profit
sl_price = self.data.Close[-1] * (1 - sl_pct)
tp_price = self.data.Close[-1] * (1 + tp_pct)
self.buy(sl=sl_price, tp=tp_price)
elif self.signal1 == 2:
sl_pct = 0.02 # 2% stop-loss
tp_pct = 0.05 # 5% take-profit
sl_price = self.data.Close[-1] * (1 + sl_pct)
tp_price = self.data.Close[-1] * (1 - tp_pct)
self.sell(sl=sl_price, tp=tp_price)

bt = Backtest(df, MyCandlesStrat, cash=100000, commission=.001)
stat = bt.run()
stat

and the output we got is

Amazing Billion+ % profit (from target Data) from Backtesting 2021–2023 data of bitcoin 15min time frame

As we have used look ahead bias to find out best possible outcomes, we got such huge profits, this is for setting TARGET data so that, our bot can work on learning to find best patterns to make huge profits possible from our Machine Learning Models.

attributes = ['trend', 'Target']
df_model= df[attributes].copy()

df_model['signal1'] = pd.Categorical(df_model['trend'])
dfDummies = pd.get_dummies(df_model['signal1'], prefix = 'signalcategory')
df_model= df_model.drop(['signal1'], axis=1)
df_model= df_model.drop(['trend'], axis=1)
df_model = pd.concat([df_model, dfDummies.astype(int)], axis=1)
df_model

Select Attributes:

  • attributes = ['trend', 'Target']: This list contains the attributes/features to be used in the model.

Create Copy of DataFrame:

  • df_model = df[attributes].copy(): This line creates a copy of the original DataFrame df containing only the selected attributes 'trend' and 'Target'.

Encode Categorical Variable:

  • df_model['signal1'] = pd.Categorical(df_model['trend']): This line converts the 'trend' column into a categorical variable.
  • dfDummies = pd.get_dummies(df_model['signal1'], prefix = 'signalcategory'): This line creates dummy variables for the categorical variable 'signal1' and prefixes their column names with 'signalcategory'.

Drop Original and Categorical Columns:

  • df_model = df_model.drop(['signal1'], axis=1): This line drops the original categorical column 'signal1' from the DataFrame.
  • df_model = df_model.drop(['trend'], axis=1): This line drops the original 'trend' column from the DataFrame.

Concatenate DataFrames:

  • df_model = pd.concat([df_model, dfDummies.astype(int)], axis=1): This line concatenates the original DataFrame df_model with the dummy variables DataFrame dfDummies, converting the dummy variables to integers.

The resulting DataFrame df_model contains the encoded categorical variable 'trend' as dummy variables along with the original 'Target' column, which is ready for use in modeling.

Also, instead of changing whole values to Scaler form for optimum work around when using ML models or any data analysis of time series data, we have here changed everything to 0’s, 1’s and 2’s

Prediction of Our Data — Simple way (For accuracy)

from sklearn.utils.class_weight import compute_class_weight

attributes = ['signalcategory_0', 'signalcategory_1', 'signalcategory_2']
X = df_model[attributes]
y = df_model['Target']

train_pct_index = int(0.7 * len(X))
X_train, X_test = X[:train_pct_index], X[train_pct_index:]
y_train, y_test = y[:train_pct_index], y[train_pct_index:]

# Calculate class weights
class_labels = np.unique(y_train)
class_weight = compute_class_weight('balanced', classes=class_labels, y=y_train)
class_weight_dict = dict(zip(class_labels, class_weight))
combined_weight = sum(class_weight[1:]) / sum(class_weight) # Combine weights for classes 1 and 2

# Adjust class weights manually to give more weight to classes 1 and 2
class_weight_dict[0] = 1.0 # weight for class 0
class_weight_dict[1] = 5.0 # weight for class 1
class_weight_dict[2] = 5.0 # weight for class 2

model = XGBClassifier(objective="multi:softmax", eval_metric="merror", scale_pos_weight=class_weight_dict, n_estimators=400, max_depth=3, early_stopping_rounds=10, gamma=0.2, reg_alpha=0.5, reg_lambda=1.0)
model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)
pred_train = model.predict(X_train)
pred_test = model.predict(X_test)

acc_train = accuracy_score(y_train, pred_train)
acc_test = accuracy_score(y_test, pred_test)
print('****Train Results****')
print("Accuracy: {:.4%}".format(acc_train))
print('****Test Results****')
print("Accuracy: {:.4%}".format(acc_test))

# Predict probabilities
y_prob = model.predict_proba(X_test)

# Adjust the threshold
threshold = 0.3 # Adjust this threshold as needed
y_pred = (y_prob[:,1] > threshold).astype(int)

matrix_train = confusion_matrix(y_train, pred_train)
matrix_test = confusion_matrix(y_test, pred_test)

print(matrix_train)
print(matrix_test)

report_train = classification_report(y_train, pred_train)
report_test = classification_report(y_test, pred_test)

print(report_train)
print(report_test)
print(model.get_booster().feature_names)

# Training Confusion Matrix
cm_train = confusion_matrix(y_train, pred_train)
ConfusionMatrixDisplay(cm_train).plot()

# Test Confusion Matrix
cm_test = confusion_matrix(y_test, pred_test)
ConfusionMatrixDisplay(cm_test).plot()

Addressing Class Imbalance:

  • Class Weight Calculation: The code calculates class weights using Scikit-learn’s compute_class_weight function to counteract imbalanced target categories.
  • Manual Adjustment: The weights for classes 1 and 2 are manually increased to further emphasize their importance.

Model Training and Evaluation:

  • XGBoost Classifier: A multi-class XGBoost classifier is trained with parameters tailored for multi-class classification and imbalanced data.
  • Early Stopping: Early stopping is implemented to prevent overfitting.
  • Accuracy Metrics: Accuracy scores are calculated for both training and test sets to assess model performance.
  • Confusion Matrices: Confusion matrices visualize model performance for each class, highlighting misclassification patterns.
  • Classification Reports: Detailed classification reports provide precision, recall, F1-score, and support for each class.

Probability Prediction and Threshold Adjustment:

  • Probability Prediction: The model predicts probabilities for each class, enabling fine-grained decision-making.
  • Threshold Adjustment: A threshold of 0.3 is applied to the probabilities to convert them into binary predictions (0 or 1), adjusting the balance between sensitivity and specificity.

XGBClassifier:

  • XGBClassifier is a powerful machine learning algorithm from the XGBoost library designed for tree-based ensemble learning.
  • It performs well on a wide range of classification tasks, especially with highly structured data.
  • In your code, it’s used for multi-class classification with three possible target categories (0, 1, and 2).

Parameters Used:

  • objective=”multi:softmax”: Specifies the multi-class classification objective using the softmax function.
  • eval_metric=”merror”: Employs the “merror” metric for evaluation, calculating the multi-class classification error rate.
  • scale_pos_weight=class_weight_dict: Applies class weights to address imbalanced data, giving more emphasis to minority classes.
  • n_estimators=400: Sets the number of decision trees in the ensemble to 400 for boosting performance.
  • max_depth=3: Limits the maximum depth of each individual tree to 3 to prevent overfitting.
  • early_stopping_rounds=10: Implements early stopping after 10 consecutive rounds without improvement to avoid overfitting.
  • gamma=0.2: Controls the regularization parameter for tree complexity, preventing overfitting with a value of 0.2.
  • reg_alpha=0.5, reg_lambda=1.0: Regularization parameters that penalize complex models and improve generalization.

Benefits of Parameters:

  • Addressing Class Imbalance: scale_pos_weight and manually adjusted weights effectively handle the imbalanced target distribution.
  • Preventing Overfitting: max_depth, early_stopping_rounds, gamma, reg_alpha, and reg_lambda work together to prevent overfitting and improve modelgeneralizability.
  • Tailoring the Model: Parameters like n_estimators and max_depth can be fine-tuned based on the specific dataset and task.

Additional Notes:

  • XGBClassifier offers many other parameters that can be explored for further optimization.
  • The chosen parameters represent a starting point and may require adjustments based on your specific data and objectives.
  • It’s crucial to evaluate different parameter combinations and compare their performance using cross-validation techniques.

Key Points:

  • The code demonstrates effective techniques for handling class imbalance in multi-class classification.
  • XGBoost proves to be a powerful algorithm for multi-class problems, especially with imbalanced data.
  • Confusion matrices and classification reports offer valuable insights into model performance and potential areas for improvement.
  • Adjusting probability thresholds can be a useful strategy for tailoring predictions to specific needs.
clear imbalance in results for recall, precision, f1 score
confusion matrix of train data
confusion matrix of test data

We can see clear imbalance in precision, recall, f1 scores of all 3 calsses and 2 class has given 0 outcome. We will try to see further, if we can improve the results or not.

y_pred_entire_dataset = model.predict(X)

# Evaluate the loaded model
print("Classification Report (Entire Dataset):")
print(classification_report(y, y_pred_entire_dataset))
print("Confusion Matrix (Entire Dataset):")
print(confusion_matrix(y, y_pred_entire_dataset))
cm_entire_dataset = confusion_matrix(y, y_pred_entire_dataset)
print("Confusion Matrix - Entire Dataset:")
ConfusionMatrixDisplay(cm_entire_dataset).plot()

Key Steps:

Predictions on Entire Dataset:

  • The code uses a previously trained model (model) to generate predictions on the entire dataset X.
  • These predictions are stored in y_pred_entire_dataset.

Evaluation on Entire Dataset:

  • The code evaluates the model’s performance on the entire dataset using three key metrics:
  • Classification Report: Provides detailed precision, recall, F1-score, and support for each class.
  • Confusion Matrix: Visualizes the model’s performance in a table format, highlighting correct and incorrect predictions for each class.
  • Confusion Matrix Plot: Creates a visual representation of the confusion matrix for easier interpretation.

Implications:

  • Comprehensive Evaluation: Evaluating on the entire dataset provides a more complete understanding of the model’s overall performance compared to using only a test set.
  • Overfitting Detection: If performance on the entire dataset is significantly worse than on the test set, it suggests overfitting, indicating the model is too tailored to the training data.
  • Class Imbalance Insights: The classification report and confusion matrix reveal how well the model handles different classes, especially in imbalanced datasets.
  • Visualization: The confusion matrix plot aids in identifying patterns of misclassifications and potential areas for improvement.

As we have previously seen in train and test data,the results for entire dataset has given almost similar outcome. We will do re-sampling and check further if the model improves or not.

Improving XGBClassifier using Re Sampling Methods

Understanding Class Imbalance in Time Series:

Common Occurrence: Imbalanced classes frequently arise in time series datasets, where certain events or patterns are less frequent than others.

Challenges:

  • Models often prioritize the majority class, leading to poor predictive performance on minority classes.
  • This can have significant consequences in domains like anomaly detection or fraud prevention, where identifying rare events is crucial.

RandomOverSampler to the Rescue:

  • Purpose: RandomOverSampler is a technique that addresses class imbalance by creating additional synthetic samples for minority classes.
  • Mechanism:
  • It randomly replicates existing minority class samples to increase their representation in the training data.
  • This helps the model learn more about the characteristics of minority classes, improving its ability to detect them accurately.

Key Advantages in Time Series:

  • Preserves Temporal Structure: It doesn’t disrupt the temporal order of data, essential for time series analysis.
  • Straightforward Implementation: It’s relatively simple to integrate into machine learning pipelines.
  • Effectiveness for Common Imbalance Scenarios: It often works well for time series datasets where minority classes are underrepresented but not extremely rare.

Cautions and Considerations:

  • Overfitting Risk: Oversampling can increase the risk of overfitting, as it introduces duplicate samples. Employ regularization techniques and cross-validation to mitigate this risk.
  • Alternative Techniques: For more complex class distributions or rare minority classes, consider other techniques such as:
  • SMOTE (Synthetic Minority Oversampling Technique): Generates synthetic samples along the lines connecting minority class instances.
  • ADASYN (Adaptive Synthetic Sampling): Prioritizes generating synthetic samples for harder-to-learn minority class instances.

Undersampling Majority Class: Balancing oversampling with undersampling the majority class can be a viable strategy in some cases.

Remember: The optimal approach depends on the specific dataset and problem characteristics. Experiment with different techniques and meticulously evaluate their impact to achieve the best results in handling class imbalance in time series data.

from imblearn.over_sampling import RandomOverSampler

# Oversample minority classes
ros = RandomOverSampler(sampling_strategy='not majority')
X_train_resampled, y_train_resampled = ros.fit_resample(X_train, y_train)

# Train the model on the resampled data
model.fit(X_train_resampled, y_train_resampled, eval_set=[(X_test, y_test)], verbose=False)


pred_train = model.predict(X_train_resampled)
pred_test = model.predict(X_test)

acc_train = accuracy_score(y_train_resampled, pred_train)
acc_test = accuracy_score(y_test, pred_test)
print('****Train Results****')
print("Accuracy: {:.4%}".format(acc_train))
print('****Test Results****')
print("Accuracy: {:.4%}".format(acc_test))

# Predict probabilities
y_prob = model.predict_proba(X_test)

# Adjust the threshold
threshold = 0.3 # Adjust this threshold as needed
y_pred = (y_prob[:,1] > threshold).astype(int)

matrix_train = confusion_matrix(y_train_resampled, pred_train)
matrix_test = confusion_matrix(y_test, pred_test)

print(matrix_train)
print(matrix_test)

report_train = classification_report(y_train_resampled, pred_train)
report_test = classification_report(y_test, pred_test)

print(report_train)
print(report_test)
print(model.get_booster().feature_names)


# Training Confusion Matrix
cm_train = confusion_matrix(y_train_resampled, pred_train)
ConfusionMatrixDisplay(cm_train).plot()

# Test Confusion Matrix
cm_test = confusion_matrix(y_test, pred_test)
ConfusionMatrixDisplay(cm_test).plot()

Oversampling Minority Classes:

  • The code imports RandomOverSampler from the imblearn library to address class imbalance.
  • It creates an instance of RandomOverSampler with a strategy to oversample minority classes (not the majority class).
  • It applies oversampling to the training data (X_train, y_train) to create a more balanced distribution of classes.

Model Training:

  • The XGBoost model is trained on the resampled training data (X_train_resampled, y_train_resampled).

Predictions and Evaluation:

  • The model makes predictions on both the resampled training set and the original test set.
  • Accuracy scores, confusion matrices, and classification reports are calculated for both sets to assess performance.
  • Confusion matrices are visually plotted for better understanding.

Implications of Oversampling:

  • Improved Handling of Imbalanced Data: Oversampling increases representation of minority classes, assisting the model in learning their patterns more effectively.
  • Potential Positive Impact on Performance: It can often lead to better overall performance, especially for minority classes.
  • Potential for Overfitting: Oversampling can create duplicates of minority class instances, potentially increasing the risk of overfitting. Regularization techniques and careful evaluation are essential.

Additional Considerations:

  • Explore Other Oversampling Techniques: Consider alternative techniques like SMOTE or ADASYN, which generate synthetic minority class samples for potentially better results.
  • Combine Oversampling with Undersampling: Balancing oversampling with undersampling the majority class can sometimes yield better performance.
  • Tailor Approach to Specific Problem: The best approach to handling class imbalance depends on the dataset and problem characteristics. Experiment with different techniques and evaluate their impact carefully.
train data confusion matirx
test data confusion matrix

From the above, we can see a small improvement in the results after resmapling technique used, let’s check using any other re-sample technique like SMOTE further.

Re-sample with SMOTE on XGBoost Classifier

from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy='not majority')
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_resampled, y_train_resampled)

# Train the model on the resampled data
model.fit(X_train_resampled, y_train_resampled, eval_set=[(X_test, y_test)], verbose=False)

pred_train = model.predict(X_train_resampled)
pred_test = model.predict(X_test)

acc_train = accuracy_score(y_train_resampled, pred_train)
acc_test = accuracy_score(y_test, pred_test)
print('****Train Results****')
print("Accuracy: {:.4%}".format(acc_train))
print('****Test Results****')
print("Accuracy: {:.4%}".format(acc_test))

# Predict probabilities
y_prob = model.predict_proba(X_test)

# Adjust the threshold
threshold = 0.3 # Adjust this threshold as needed
y_pred = (y_prob[:,1] > threshold).astype(int)

matrix_train = confusion_matrix(y_train_resampled, pred_train)
matrix_test = confusion_matrix(y_test, pred_test)

print(matrix_train)
print(matrix_test)

report_train = classification_report(y_train_resampled, pred_train)
report_test = classification_report(y_test, pred_test)

print(report_train)
print(report_test)
print(model.get_booster().feature_names)


# Training Confusion Matrix
cm_train = confusion_matrix(y_train_resampled, pred_train)
ConfusionMatrixDisplay(cm_train).plot()


# Test Confusion Matrix
cm_test = confusion_matrix(y_test, pred_test)
ConfusionMatrixDisplay(cm_test).plot()

Oversampling with SMOTE:

  • The code imports SMOTE from the imblearn library for oversampling.
  • It creates an instance of SMOTE with the strategy to oversample minority classes (not the majority class).
  • It applies SMOTE to the previously oversampled training data (X_train_resampled, y_train_resampled) to further address class imbalance.

Model Training and Evaluation:

  • The XGBoost model is trained on the SMOTE-resampled data.
  • Predictions are made on both the resampled training set and the original test set.
  • Accuracy scores, confusion matrices, and classification reports are calculated to assess performance.
  • Confusion matrices are visually plotted for better understanding.

Implications of Using SMOTE:

  • Synthetic Minority Oversampling: SMOTE creates synthetic samples for minority classes, potentially enhancing model performance, especially for rare events.

Advantages over Random Sampling:

  • Generates more diverse and informative synthetic samples by interpolating between existing minority class instances.
  • Can lead to better generalization and ability to detect minority classes in unseen data.
  • Potential for Overfitting: As with any oversampling technique, SMOTE can increase overfitting risk. Use regularization and careful evaluation.

The same results we got from re-sampling with SMOTE as that we got with RandomOverSampler

When we run on entire model to check if any performance improvement visible or not, below results we got

A Slight improvement compared to normal classification running we can observe. Let’s try with another sampling

TomLinks with XGBoost Classifier

from imblearn.under_sampling import TomekLinks

tl = TomekLinks(sampling_strategy='not majority')
X_train_undersampled, y_train_undersampled = tl.fit_resample(X_train_resampled, y_train_resampled)


model.fit(X_train_undersampled, y_train_undersampled, eval_set=[(X_test, y_test)], verbose=False)

pred_train = model.predict(X_train_undersampled)
pred_test = model.predict(X_test)

acc_train = accuracy_score(y_train_undersampled, pred_train)
acc_test = accuracy_score(y_test, pred_test)
print('****Train Results****')
print("Accuracy: {:.4%}".format(acc_train))
print('****Test Results****')
print("Accuracy: {:.4%}".format(acc_test))

# Predict probabilities
y_prob = model.predict_proba(X_test)

# Adjust the threshold
threshold = 0.2 # Adjust this threshold as needed
y_pred = (y_prob[:,1] > threshold).astype(int)

matrix_train = confusion_matrix(y_train_undersampled, pred_train)
matrix_test = confusion_matrix(y_test, pred_test)

print(matrix_train)
print(matrix_test)

report_train = classification_report(y_train_undersampled, pred_train)
report_test = classification_report(y_test, pred_test)

print(report_train)
print(report_test)
print(model.get_booster().feature_names)

# Training Confusion Matrix
cm_train = confusion_matrix(y_train_undersampled, pred_train)
ConfusionMatrixDisplay(cm_train).plot()


# Test Confusion Matrix
cm_test = confusion_matrix(y_test, pred_test)
ConfusionMatrixDisplay(cm_test).plot()

Undersampling with TomekLinks:

  • The code imports TomekLinks from imblearn for undersampling.
  • It creates an instance of TomekLinks with the strategy to undersample the majority class (not minority classes).
  • It applies TomekLinks to the previously oversampled training data (X_train_resampled, y_train_resampled) to further address class imbalance.

Model Training and Evaluation:

  • The XGBoost model is trained on the TomekLinks-undersampled data.
  • Predictions are made on both the undersampled training set and the original test set.
  • Accuracy scores, confusion matrices, and classification reports are calculated to assess performance.
  • Confusion matrices are visually plotted for better understanding.

Implications of Using TomekLinks:

  • Undersampling for Noise Reduction: TomekLinks focuses on removing potentially noisy or borderline majority class instances that are close to minority class samples.

Mechanism:

  • Identifies TomekLinks, which are pairs of instances from different classes that are nearest neighbors.
  • Removes the majority class instances from these pairs to create a more balanced and potentially cleaner dataset.
  • Complementing Oversampling: It often complements oversampling techniques by reducing noise and enhancing the focus on informative minority class samples.

So, after working with 3 re-sampling techniques, we got same results for all 3 on XGBoost Classifier

and same output for overall dataset, I’m not pasting it here as we got same results.

Now, we will try with different Prediction models and see if there is any improvement in results, and as a common thing, we will use SMOTE re sampling on coming next new models , as we have already observed very bad results for class 1’s and 2’s when ran directly.

Trying with Different Prediction Models (CatBoost, LGBM, SVC)

With CatBoostClassifier

CatBoostClassifier is a machine learning library designed for gradient boosting on decision trees, specifically tailored for categorical features.

It offers several advantages over traditional gradient boosting algorithms like XGBoostClassifier, including:

  • Automatic handling of categorical features: CatBoostClassifier automatically handles categorical features without the need for one-hot encoding, which can significantly improve performance and efficiency.
  • Ordered boosting: CatBoostClassifier can leverage the order of categorical features, which can be beneficial for certain tasks like time series forecasting or natural language processing.
  • Imbalanced class handling: CatBoostClassifier has built-in features for handling imbalanced class problems, such as class weights and border balancing.
  • Regularization: CatBoostClassifier offers various regularization techniques to prevent overfitting, including L1 and L2 regularization, as well as feature importance-based regularization.

Key Differences from XGBoostClassifier:

  • Handling of categorical features: CatBoostClassifier automatically handles categorical features, while XGBoostClassifier requires one-hot encoding.
  • Ordered boosting: CatBoostClassifier supports ordered boosting, while XGBoostClassifier does not.
  • Imbalanced class handling: CatBoostClassifier has built-in features for imbalanced class handling, while XGBoostClassifier requires manual configuration.
  • Regularization: CatBoostClassifier offers more advanced regularization options than XGBoostClassifier.

Use Cases:

  • CatBoostClassifier is well-suited for tasks involving categorical features, imbalanced datasets, and ordered data.
  • It is popular in various applications, including:
  • Text classification
  • Image classification
  • Customer churn prediction
  • Fraud detection
  • Time series forecasting
  • Natural language processing
from catboost import CatBoostClassifier

# from imblearn.over_sampling import RandomOverSampler

from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy='not majority')
# from imblearn.under_sampling import TomekLinks

# tl = TomekLinks(sampling_strategy='not majority')
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_resampled, y_train_resampled)

# Class weights (assuming class 0 is the majority class)
class_weights = {0: 1, 1: 5, 2: 5} # Adjust weights as needed

# CatBoost model with class weights
model = CatBoostClassifier(loss_function="MultiClass", class_weights=class_weights)

# Train the model on the resampled data
model.fit(X_train_resampled, y_train_resampled, eval_set=[(X_test, y_test)], verbose=False)

pred_train = model.predict(X_train_resampled)
pred_test = model.predict(X_test)

acc_train = accuracy_score(y_train_resampled, pred_train)
acc_test = accuracy_score(y_test, pred_test)
print('****Train Results****')
print("Accuracy: {:.4%}".format(acc_train))
print('****Test Results****')
print("Accuracy: {:.4%}".format(acc_test))

# Predict probabilities
y_prob = model.predict_proba(X_test)

# Adjust the threshold
threshold = 0.3 # Adjust this threshold as needed
y_pred = (y_prob[:,1] > threshold).astype(int)

# Evaluate the model
print(classification_report(y_test, y_pred))


matrix_train = confusion_matrix(y_train_resampled, pred_train)
matrix_test = confusion_matrix(y_test, pred_test)

print(matrix_train)
print(matrix_test)

report_train = classification_report(y_train_resampled, pred_train)
report_test = classification_report(y_test, pred_test)

print(report_train)
print(report_test)
# print(model.get_booster().feature_names)


y_pred_entire_dataset = model.predict(X)

# Evaluate the loaded model
print("Classification Report (Entire Dataset):")
print(classification_report(y, y_pred_entire_dataset))
print("Confusion Matrix (Entire Dataset):")
print(confusion_matrix(y, y_pred_entire_dataset))
cm_entire_dataset = confusion_matrix(y, y_pred_entire_dataset)
print("Confusion Matrix - Entire Dataset:")
ConfusionMatrixDisplay(cm_entire_dataset).plot()

The code implements a multi-class classification model using CatBoost with the following key steps:

1. Addressing Class Imbalance:

  • SMOTE oversampling: The code first employs SMOTE to oversample minority classes in the training data (X_train_resampled, y_train_resampled). This helps balance the representation of all classes, potentially improving the model's ability to learn patterns from less frequent categories.

2. Incorporating Class Weights:

  • Assigning weights: Class weights are assigned to each class (class_weights), emphasizing the importance of minority classes (1 and 2 in this case). This encourages the model to focus on accurate predictions for those classes during training.

3. Building the CatBoost Model:

  • Model instantiation: A CatBoost multi-class classification model is created (model) with the specified loss function ("MultiClass").
  • Class weights integration: The model incorporates the defined class weights during training.

4. Training and Evaluation:

  • Training: The model is trained on the oversampled and weighted training data.
  • Evaluation: Model performance is assessed on both the training set (X_train_resampled, y_train_resampled) and the test set (X_test, y_test). Accuracy scores are calculated and printed.

5. Generating Predictions and Evaluating Class-Wise Performance:

  • Predictions: Probabilities for class 1 are predicted for the test set (y_prob).
  • Thresholding: A threshold is applied to convert probabilities into binary predictions (y_pred).
  • Classification report: A detailed classification report (classification_report) is generated, providing metrics like precision, recall, F1-score, and support for each class, offering insights into class-wise performance.

6. Generating Confusion Matrices:

  • Confusion matrices: The code creates confusion matrices for the train, test, and entire datasets, visualizing the distribution of correctly and incorrectly predicted instances for each class.

7. Additional Analysis:

  • Predictions on entire dataset: Predictions are made on the entire dataset (X) and evaluated using a classification report and confusion matrix.
  • Confusion matrix visualization: The confusion matrix for the entire dataset is plotted for further analysis.

Overall, the provided code demonstrates a structured approach to multi-class classification, addressing class imbalance through oversampling and class weights. By analyzing the generated outputs (accuracy scores, confusion matrices, classification reports), you can gain valuable insights into model performance, identify potential areas for improvement, and refine the model based on your specific goals and dataset characteristics.

The train and test data, accuracy are 34% and 21% around which is too low but there are higher chances of identifying True positives for 1’s and 2’s but many chances of identifying False positive and True negatives which leads the model to be not suitable for real time trading, but from ML model point of view, improving model performance on smaller classes is a good achievement on overall progress, and in future progress, we will try to add class_wieghts for XGBoostClassifier too and see if performance increases or accuracy , precision, recall or f1 score increases or not.

For the entire dataset, accuracy is at 21% (weighted avg. from classification report) , precision for 1’s and 2’s is at 26%, 21% and worse recall value for 1’s and good score for 2’s. overall, though there is improvement in the performance of class 1’s and 2’s, it is not suitable for doing real time trading. I will also explain why the particular trading setup is not suitable with valid practical reasons, also will explain how we can identify the best trading setup using the same code in the coming article.

SVC and Time Series Data: A Closer Look (not suitable for time series forecasting but executed for reference purpose)

While Support Vector Machines (SVMs) are primarily used for classification tasks in various domains, including image recognition and text analysis, their application in time series data analysis requires careful consideration. Here’s a breakdown of using SVCs with time series data, their advantages, and potential limitations:

Using SVCs for Time Series Classification:

  • Suitable Scenarios: SVCs can be effective for specific time series classification tasks, such as:
  • Anomaly detection: Identifying unusual patterns deviating from the regular time series behavior.
  • Predicting categorical events: Classifying future data points into predefined categories based on historical patterns.
  • Segmenting time series: Grouping similar sections of the data based on characteristics.
  • Data Representation: SVCs typically require fixed-length input vectors. Time series data, however, is inherently sequential. To use SVCs, you need to represent the time series as feature vectors that capture relevant information. This can involve:
  • Feature engineering: Extracting statistical features (e.g., mean, standard deviation) from sliding windows along the time series.
  • Dynamic Time Warping (DTW): Measuring similarity between time series based on their shapes, even if they have different lengths.

Advantages of SVCs for Time Series Data:

  • Interpretability: Kernel SVMs can provide insights into which features contribute most to the classification, aiding in understanding the model’s decision-making process.
  • Efficiency: For specific tasks like anomaly detection, SVCs can be computationally efficient, especially with sparse kernels.
  • Multi-class classification: SVCs can handle problems with multiple output classes, which can be relevant for certain time series classification tasks.

Limitations and Considerations:

  • Feature Engineering Dependence: The performance of SVCs heavily relies on the quality of the chosen features. Careful feature engineering is crucial for capturing relevant information and avoiding the “curse of dimensionality.”
  • Kernel Selection: Choosing the right kernel function significantly impacts performance. The RBF kernel is a common choice, but experimentation with others like linear or polynomial kernels might be necessary.
  • Handling Long Sequences: Representing very long time series as fixed-length vectors can be challenging and lead to information loss. Segmentation or dimensionality reduction techniques might be needed.
  • Not Ideal for Forecasting: SVCs are primarily for classification, not directly for predicting future values in a time series. Consider other models like ARIMA, Prophet, or LSTMs for forecasting tasks.

In conclusion, while SVCs have potential applications in specific time series classification tasks, carefully evaluate their suitability based on your data characteristics, desired outcomes, and potential limitations.

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay


model = SVC(kernel="rbf", decision_function_shape="ovr", probability=True) # Consider other kernels like "poly"

X_train_resampled, y_train_resampled = (X_train, y_train)

model.fit(X_train_resampled, y_train_resampled)

pred_train = model.predict(X_train_resampled)
pred_test = model.predict(X_test)

acc_train = accuracy_score(y_train_resampled, pred_train)
acc_test = accuracy_score(y_test, pred_test)
print('****Train Results****')
print("Accuracy: {:.4%}".format(acc_train))
print('****Test Results****')
print("Accuracy: {:.4%}".format(acc_test))

# Predict probabilities
y_prob = model.predict_proba(X_test)

# Adjust the threshold
threshold = 0.3 # Adjust this threshold as needed
y_pred = (y_prob[:,1] > threshold).astype(int)

# Evaluate the model
print(classification_report(y_test, y_pred))


matrix_train = confusion_matrix(y_train_resampled, pred_train)
matrix_test = confusion_matrix(y_test, pred_test)

print(matrix_train)
print(matrix_test)

report_train = classification_report(y_train_resampled, pred_train)
report_test = classification_report(y_test, pred_test)

print(report_train)
print(report_test)
#print(model.get_booster().feature_names)

# Training Confusion Matrix
cm_train = confusion_matrix(y_train_resampled, pred_train)
ConfusionMatrixDisplay(cm_train).plot()

# Test Confusion Matrix
cm_test = confusion_matrix(y_test, pred_test)
ConfusionMatrixDisplay(cm_test).plot()

Here’s a breakdown of the code and key points to consider:

1. Model Implementation:

  • SVC from sklearn: The code constructs a multi-class SVM model using the SVC class from scikit-learn.
  • Kernel and Decision Function: It employs an RBF kernel (kernel="rbf") and one-vs-rest strategy (decision_function_shape="ovr") for multi-class classification.
  • Probability Calculation: probability=True enables probability prediction for further thresholding.

2. Training:

  • No Oversampling: The code doesn’t apply any oversampling techniques to address class imbalance.
  • Model Fitting: The model is trained on the original training data (X_train, y_train).

3. Evaluation:

  • Accuracy Calculation: Accuracy scores are computed for both train and test sets.
  • Probability Prediction: Probabilities for class 1 are predicted on the test set.
  • Thresholding: A threshold of 0.3 is applied to convert probabilities into binary predictions.
  • Performance Metrics: Classification reports and confusion matrices are generated for both train and test sets, offering insights into class-wise performance.
  • Visualization: Confusion matrices for both train and test sets are plotted for visual analysis.

Key Points:

  • Kernel Choice: The RBF kernel is a common choice for SVMs, but experimentation with other kernels (e.g., “poly”) might be beneficial.
  • Class Imbalance: The lack of oversampling could impact minority class performance, especially if the dataset is highly imbalanced.
  • Thresholding: Adjusting the threshold might influence results, so consider optimizing it based on your specific goals.
  • Feature Importance: While not explicitly calculated here, SVMs can provide feature importance information, which can be valuable for interpretation.

from the results if we see, we got high accuracy for train and test data at 47% and 87% but that is because of able to identify 0’s a lot, which are not so useful for our actual prediction . 0’s are neutral values and predicting therm won’t make us any money in real time. class 1 has given 0 precicion and recall , which is not at all ideal and SVC is generally not used for time series forecast, I have run it as an example to compare it with boosting algorithms that we have.

I’m not going to run this on entire dataset, as it is clearly visible that, model did not perform well at all for class 1 and 2.

LGBMClassifier and Time Series Forecasting:

While LGBMClassifier primarily serves for classification tasks, it’s not directly intended for time series forecasting. However, with suitable adaptations and considerations, it can be applied to specific forecasting problems related to categorical events within time series data. Here’s an overview:

  • Core Functionality: This classifier, based on Gradient Boosting Decision Trees (GBDTs), excels at identifying relationships between features and predicting class labels.
  • Forecasting Adaptation: For forecasting categorical events (e.g., predicting stock price movement being up, down, or neutral), you can frame the problem as a classification task:
  • Define categories for the future event (e.g., price increase, decrease, or stay stable).
  • Use historical data (features) to predict the category for future time points.

Advantages in Specific Scenarios:

  • Handlingcategoricalvariables: If your target variable in the time series represents discrete categories, LGBMClassifier can be suitable for predicting those categories based on past observations.
  • Interpretability: Similar to other GBDT models, LGBMClassifier offers feature importance insights, helping you understand which factors most influence predictions.
  • Efficiency: For large datasets, LGBMClassifier’s parallelization capabilities can enable faster training and prediction compared to some other models.

Limitations and Considerations:

  • Not Direct Forecasting: Remember, LGBMClassifier isn’t designed for numerical forecasting directly. It predicts categories, not specific future values.
  • Feature Engineering: Carefully choosing features that capture relevant historical information is crucial for successful prediction. Consider using lagged values, technical indicators, or other time series-specific features.
  • Model Complexity: GBDT models can become complex, potentially leading to overfitting. Regularization techniques and careful hyperparameter tuning are essential.
  • Category Definition: Defining meaningful categories for the target variable and ensuring sufficient data within each category are crucial for model performance.

Alternative Models for Time Series Forecasting:

For direct numerical forecasting of continuous values in time series, other models like:

  • ARIMA: Suitable for stationary time series with linear trends and seasonality.
  • Prophet: Effective for forecasting general time series trends and seasonality.
  • LSTMs: Powerful deep learning models capable of capturing complex non-linear relationships in time series data.

In conclusion, while LGBMClassifier has niche applications in forecasting categorical events within time series, carefully consider its limitations and compare it with other models designed specifically for numerical time series forecasting.

from lightgbm import LGBMClassifier

# Class weights (assuming class 0 is the majority class)
class_weights = {0: 1, 1: 5, 2: 5} # Adjust weights as needed

model = LGBMClassifier(
objective="softmax", # Set for multi-class classification
num_class=3, # Number of classes
n_estimators=3500, # Increase the number of trees
learning_rate=0.05, # Lower learning rate
num_leaves=20, # More leaves for potentially better performance
max_depth=3, # Limit tree depth
min_data_in_leaf=3000, # Higher minimum data in leaves
feature_fraction=0.8, # Randomly select 80% of features per tree
bagging_fraction=0.7, # Randomly select 70% of data per tree
reg_alpha=0.1, # Apply L1 regularization
reg_lambda=0.2, # Apply L2 regularization,
class_weight=class_weights
)

from imblearn.over_sampling import SMOTE

smote = SMOTE(sampling_strategy='not majority')
# from imblearn.under_sampling import TomekLinks

# tl = TomekLinks(sampling_strategy='not majority')
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

model.fit(X_train_resampled, y_train_resampled, eval_set=[(X_test, y_test)])

pred_train = model.predict(X_train_resampled)
pred_test = model.predict(X_test)

acc_train = accuracy_score(y_train_resampled, pred_train)
acc_test = accuracy_score(y_test, pred_test)
print('****Train Results****')
print("Accuracy: {:.4%}".format(acc_train))
print('****Test Results****')
print("Accuracy: {:.4%}".format(acc_test))

# Predict probabilities
y_prob = model.predict_proba(X_test)

# Adjust the threshold
threshold = 0.3 # Adjust this threshold as needed
y_pred = (y_prob[:,1] > threshold).astype(int)

# Evaluate the model
print(classification_report(y_test, y_pred))


matrix_train = confusion_matrix(y_train_resampled, pred_train)
matrix_test = confusion_matrix(y_test, pred_test)

print(matrix_train)
print(matrix_test)

report_train = classification_report(y_train_resampled, pred_train)
report_test = classification_report(y_test, pred_test)

print(report_train)
print(report_test)
#print(model.get_booster().feature_names)

# Training Confusion Matrix
cm_train = confusion_matrix(y_train_resampled, pred_train)
ConfusionMatrixDisplay(cm_train).plot()

# Test Confusion Matrix
cm_test = confusion_matrix(y_test, pred_test)
ConfusionMatrixDisplay(cm_test).plot()

Here’s a breakdown of the code’s key aspects and considerations:

1. Model Setup:

  • LGBMClassifier: Employs the LightGBM library for multi-class classification.
  • Hyperparameters: Configures model settings to potentially improve performance:
  • objective="softmax" for multi-class.
  • num_class=3 for three classes.
  • Increased n_estimators, reduced learning_rate, more num_leaves, limited max_depth, higher min_data_in_leaf, feature/bagging fractions, and L1/L2 regularization.

2. Oversampling:

  • SMOTE: Addresses potential class imbalance with SMOTE oversampling, creating synthetic samples for minority classes.
  • Sampling Strategy: Focuses on oversampling minority classes ('not majority').

3. Model Training:

  • Fitting: Trains the model on oversampled training data (X_train_resampled, y_train_resampled).
  • Evaluation: Uses a test set (X_test, y_test) for evaluation during training.

4. Prediction and Evaluation:

  • Predictions: Generates predictions on both training and test sets.
  • Accuracy: Calculates accuracy scores for both sets.
  • Probability Prediction: Obtains class probabilities for test set examples.
  • Thresholding: Applies a threshold (0.3) to convert probabilities into binary predictions.
  • Classification Reports: Produces detailed performance metrics for both sets.
  • Confusion Matrices: Creates and visualizes confusion matrices for both sets, aiding in error analysis.
before adding class_weight for train and test confusion matrix
After adding class_weights for Train and test data
classification report for train and test data for LGBMClassifier after adding class_wieghts

Here, classification report for train and test data for LGBMClassifier after adding class_weights. It has given similar results that we got for catboostClassifer after addign weights, seems like there is no much improvement over results by using catboost over LGBM , maybe tuning hyper parameters can help improve the result , we will be executing next steps for hyper optimization for XGBoostClassifer and let’s compare results with previous one’s.

LGBM on entire dataset

After running o entire datset , we got around 21% accuracy and true negatives and False positives are more.

We got same result as that of what we got for catBoostClassifier, let’s hyper optimize the model and check if the results improve or not further

Prediction of Our Data — Hyper-optimizing XGBoost Classifier (For better accuracy)

# Select type of model to optimize for , 
# our model is multi class classification and doesn't need binary or precision, we are focusing on accuracy

is_binary = False # should be true for binanry classfication or 0,1 classification
is_optimise_for_precision = True

# Determine Objective and Eval Metrics
if is_binary:
objective = "binary:logistic"
eval_metric = "logloss"
eval_metric_list = ["error", "logloss", eval_metric]
else:
objective = "multi:softmax"
eval_metric = "merror"
eval_metric_list = ["merror", "mlogloss", eval_metric]

# Refine Eval Metric
if is_binary and is_optimise_for_precision:
eval_metric = "aucpr"
scoring = "precision"
elif is_binary and not is_optimise_for_precision:
eval_metric = "auc"
scoring = "f1"
else:
scoring = "accuracy"

# Build First Classifier Model 0
classifier_0 = XGBClassifier(
objective=objective,
booster="gbtree",
eval_metric=eval_metric,
subsample=0.8,
colsample_bytree=1,
random_state=1,
use_label_encoder=False
)


# Provide Gris for Hyperparams
param_grid = {
"gamma": [0, 0.1, 0.2, 0.5, 1, 1.5, 2, 3, 6, 12, 20],
"learning_rate": [0.01, 0.02, 0.03, 0.05, 0.1, 0.2, 0.3, 0.5, 0.7, 0.8],
"max_depth": [1, 2, 3, 4, 5, 6, 8, 12],
"n_estimators": [ 50, 80, 100, 115, 200, 300, 400, 600, 800, 1000],
"early_stopping_rounds":[10, 20, 30, 40, 50],
"reg_alpha": [0.5, 0.8, 1.0, 1.2, 1.5, 1.8, 2.0],
"reg_lambda": [1.0, 1.2, 1.4, 1.6, 1.8, 2.0]
}

# Perform Random Search for Best Hyper params
grid_search = RandomizedSearchCV(estimator=classifier_0, param_distributions=param_grid, scoring=scoring)
# Define validation dataset
eval_set = [(X_test, y_test)]


# Fit the model with validation dataset
best_model = grid_search.fit(X_train, y_train, eval_set=eval_set)

# Retrieve best hyperparameters
hyperparams = best_model.best_params_
ne = hyperparams["n_estimators"]
lr = hyperparams["learning_rate"]
md = hyperparams["max_depth"]
gm = hyperparams["gamma"]
#esr = hyperparams["early_stopping_rounds"]
ra = hyperparams["reg_alpha"]
rl = hyperparams["reg_lambda"]
print("Recommended Params >>", f"ne: {ne},", f"lr: {lr}", f"md: {md}", f"gm: {gm}", f"ra: {ra}", f"rl: {rl}")



# Build Classification Model 1 based on recommanded parameters from hyperopt
classifier_1 = XGBClassifier(
objective=objective,
booster="gbtree",
eval_metric=eval_metric_list,
n_estimators=ne,
learning_rate=lr,
max_depth=md,
gamma=gm,
reg_alpha=ra,
reg_lambda=rl,
subsample=0.8,
colsample_bytree=1,
random_state=1,
use_label_encoder=False
)
# Recommended Params >> ne: 1000, lr: 0.7 md: 3 gm: 0.1 ra: 0.8 rl: 2.0


# Fit Model
eval_set = [(X_train, y_train), (X_test, y_test)]
classifier_1.fit(
X_train,
y_train,
#eval_metric=eval_metric_list,
eval_set=eval_set,
verbose=False,
early_stopping_rounds = 10
)

# Convert DataFrame indices to integers
X_train_resampled = X_train_resampled.reset_index(drop=True)
y_train_resampled = y_train_resampled.reset_index(drop=True)
X_test_resampled = X_test.reset_index(drop=True)
y_test_resampled = y_test.reset_index(drop=True)


# Now, evaluate the model on the test set using the entire training data (without cross-validation)
# Train your model on X_train_resampled and y_train_resampled
classifier_1.fit(X_train_resampled, y_train_resampled, eval_set=[(X_test_resampled, y_test_resampled)], verbose = False)

# Get predictions for the train set
train_yhat = classifier_1.predict(X_train_resampled)

# Get predictions for the test set
test_yhat = classifier_1.predict(X_test_resampled)

# Calculate evaluation metrics (e.g., accuracy, precision, etc.) for the test set
test_accuracy = accuracy_score(y_test_resampled, test_yhat)
test_precision = precision_score(y_test_resampled, test_yhat, average=None, zero_division=0)

# Print or store the evaluation results for the entire test set
print("Test Accuracy (Entire Test Set):", test_accuracy)
print("Test Precision (Entire Test Set):", test_precision)

# Print classification report and confusion matrix for train data
print("Classification Report - Train Data:")
print(classification_report(y_train_resampled, train_yhat))
cm_train = confusion_matrix(y_train_resampled, train_yhat)
print("Confusion Matrix - Train Data:")
ConfusionMatrixDisplay(cm_train).plot()

# Print classification report and confusion matrix for test data
print("Classification Report - Test Data:")
print(classification_report(y_test_resampled, test_yhat))
cm_test = confusion_matrix(y_test_resampled, test_yhat)
print("Confusion Matrix - Test Data:")
ConfusionMatrixDisplay(cm_test).plot()

Here’s a breakdown of the code’s key steps and considerations:

1. Model Setup:

  • Multi-class Classification: Configures XGBoost for multi-class classification using objective="multi:softmax" and eval_metric="merror".
  • Accuracy Focus: Sets scoring="accuracy" for hyperparameter tuning, as the priority is overall accuracy.

2. Hyperparameter Tuning:

  • Random Search: Employs RandomizedSearchCV to explore hyperparameter combinations efficiently.
  • Validation Dataset: Uses a separate X_test and y_test dataset for model validation during tuning.
  • Best Hyperparameters: Retrieves and prints the optimal hyperparameters found through tuning.

3. Class Weights:

  • Addressing Imbalance: Assigns class weights to handle imbalanced classes, giving more importance to underrepresented classes during training.

4. Model Building and Training:

  • Model Creation: Constructs a new XGBoost classifier with recommended hyperparameters and class weights.
  • Training: Fits the model on X_train, y_train, using X_test, y_test for evaluation and early stopping to prevent overfitting.

5. Data Resampling:

  • Oversampling: Addresses class imbalance using SMOTE (code for SMOTE not shown), creating synthetic samples for minority classes.
  • Index Reset: Ensures integer indices for compatibility with XGBoost.

6. Final Evaluation:

  • Training on Resampled Data: Trains the model on the oversampled training set (X_train_resampled, y_train_resampled).
  • Predictions: Generates predictions for both training and test sets.
  • Evaluation Metrics: Calculates accuracy, precision, classification reports, and confusion matrices for both sets to assess model performance.

A comprehensive explanation of the key parameters in XGBoostClassifier, incorporating insights from expert ratings and addressing identified shortcomings:

Core Parameters:

objective: The type of optimization objective for the model. Here, objective="multi:softmax" is suitable for multi-class classification.

  • Value range: Depends on the chosen objective function. In multi:softmax, the number of outputs is determined by the number of classes in the training data.

booster: Specifies the tree model implementation. In booster="gbtree", XGBoost's gradient boosting trees are used.

  • Value range: “gbtree” (gradient boosting trees), “gblinear” (linear regression), “dart” (dropout with decision trees).

eval_metric: The evaluation metric used during training. Popular choices for multi-class classification include mlogloss (multi-class logarithmic loss), merror (multi-class classification error), auc (area under the ROC curve), and precision (proportion of true positives out of predicted positives).

  • Value range: Depends on the objective function, data characteristics, and evaluation focus.

n_estimators: The number of trees to be built in the ensemble. More trees often improve accuracy but can lead to overfitting.

  • Value range: Positive integer; typically in the range of 100–1000.

learning_rate: The step size used to update the model weights in each iteration. Lower values usually take longer to train but can prevent overfitting.

  • Value range: 0.01–0.3 is a common starting point; lower values often require more iterations.

max_depth: The maximum depth of each tree in the ensemble. Higher values allow for more complex models but are more prone to overfitting.

  • Value range: Positive integer; typically in the range of 3–8.

gamma: Minimum loss reduction required to make a split in a tree. Higher values prevent making split decisions for minimal improvements.

  • Value range: Non-negative real number; default is 0.

reg_alpha and reg_lambda: L1 and L2 regularization parameters. Higher values reduce model complexity and prevent overfitting.

  • Value range: Non-negative real numbers; usually start with small values (e.g., 0.1) and increase gradually.

Advanced Parameters:

subsample: Randomly sample a subset of training data at each tree iteration to enhance diversity and prevent overfitting.

  • Value range: 0–1; typically 0.8–1.

colsample_bytree: Randomly sample a subset of features at each tree split to reduce correlated features' impact and improve diversity.

  • Value range: 0–1; typically 0.8–1.

early_stopping_rounds: Number of consecutive rounds without improvement in the evaluation metric to trigger early stopping and prevent overfitting.

  • Value range: Positive integer; default is 0 (no early stopping).

class_weight: Dictionary assigning weights to classes for addressing class imbalance. Higher weights for under-represented classes.

  • Value range: Dictionary with class index as key and weight as value.

use_label_encoder: Enables label encoding for categorical features if set to True.

  • Value range: Boolean; default is False.

Remember:

  • The optimal settings for these parameters depend on your specific dataset, desired outcome, and computational resources.
  • Experimentation and careful evaluation are crucial to select the best configuration for your problem.
  • Consider using tools like Hyperopt or Optuna for more advanced hyperparameter tuning.

Beyond These Parameters:

  • XGBoost provides additional parameters for fine-tuning tree growth, regularization, and optimization strategies. Refer to the documentation for details.
  • Feature engineering and data preprocessing are often essential for optimal XGBoost performance.
  • Class imbalanced problems can benefit from oversampling/undersampling or cost-sensitive learning techniques.

By understanding these parameters and considering their usage with care, we can leverage XGBoostClassifier effectively for powerful multi-class classification tasks.

XGBoostClasifier after Hyper optimizing the parameters and without class_weight applied for train and test data

We will further add re-sampling technique and add class_weight to see if the model performance increases

but

before that, let’s see evaluation metrics if the model has loss and overfitting issue

# Retrieve performance metrics
results = classifier_1.evals_result()
#results = model.evals_result()
epochs = len(results["validation_0"]["merror"])
x_axis = range(0, epochs)


# Plot Log Loss
fig, ax = plt.subplots()
ax.plot(x_axis, results["validation_0"]["mlogloss"], label="Train")
ax.plot(x_axis, results["validation_1"]["mlogloss"], label="Test")
ax.legend()
plt.ylabel("mLogloss")
plt.title("XGB mLogloss")
plt.show()

# Plot Classification Error
fig, ax = plt.subplots()
ax.plot(x_axis, results["validation_0"]["merror"], label="Train")
ax.plot(x_axis, results["validation_1"]["merror"], label="Test")
ax.legend()
plt.ylabel("mError")
plt.title("XGB Error")
plt.show()
this is after hyper optmization , we are checking for overfitting for XGBoost Classifier

The train data and test data are far different from each other, it is a clear indication of over-fitting , maybe we have to modify weights and other parameters to reduce this.

Adding re-sampling technique to HyperOptmized XGBClassifier Model

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from sklearn.utils.class_weight import compute_sample_weight
from imblearn.over_sampling import ADASYN
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Calculate sample weights based on class imbalance
class_weights = {0: 1, 1: 3, 2: 3} # Adjust weights as needed

# ADASYN oversampling
adasyn = ADASYN(sampling_strategy='not majority')
X_train_resampled, y_train_resampled = adasyn.fit_resample(X_train, y_train)

sample_weights = compute_sample_weight(class_weight=class_weights, y=y_train_resampled)

# Standardize features
scaler = StandardScaler()
X_train_resampled = scaler.fit_transform(X_train_resampled)
X_test = scaler.transform(X_test)

# Initialize and train the XGBoost classifier
classifier = XGBClassifier(
objective=objective,
booster="gbtree",
eval_metric=eval_metric_list,
n_estimators=ne,
learning_rate=lr,
max_depth=md,
gamma=gm,
reg_alpha=ra,
reg_lambda=rl,
subsample=0.8,
colsample_bytree=1,
random_state=1,
use_label_encoder=False)
classifier.fit(X_train_resampled, y_train_resampled, sample_weight=sample_weights)

# Predict on the test set
y_pred = classifier.predict(X_test)


# Evaluate the model
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
cm_test = confusion_matrix(y_test, y_pred)
print("Confusion Matrix - Test Data:")
ConfusionMatrixDisplay(cm_test).plot()

# Predict on the test set
train_pred = classifier.predict(X_train_resampled)


# Evaluate the model
print("Classification Report:")
print(classification_report(y_train_resampled, train_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_train_resampled, train_pred))
cm_test = confusion_matrix(y_train_resampled, train_pred)
print("Confusion Matrix - Train Data:")
ConfusionMatrixDisplay(cm_test).plot()

Here’s a breakdown of the code’s key steps and considerations, incorporating insights from the previous feedback:

1. Data Preparation:

  • Splitting: Divides data into training (70%) and testing (30%) sets for model development and evaluation.
  • Class Weights: Calculates sample weights based on class imbalance, giving more importance to minority classes during training ({0: 1, 1: 3, 2: 3}).
  • ADASYN Oversampling: Addresses class imbalance using ADASYN, a technique that generates synthetic samples for minority classes, potentially enhancing model performance for those classes.
  • Sample Weights: Computes sample weights for the resampled data, ensuring correct application of weights even after oversampling.
  • Feature Scaling: Standardizes features using StandardScaler, often improving model accuracy and convergence.

2. Model Building and Training:

  • XGBoost Classifier: Initializes an XGBoostClassifier with specified hyperparameters for model complexity, regularization, and learning behavior.
  • Training: Fits the model to the resampled training data, incorporating both oversampling and sample weights to address class imbalance effectively.

3. Evaluation:

  • Predictions: Generates predictions on both the training and test sets using the trained model.
  • Classification Reports: Prints detailed performance metrics (precision, recall, F1-score, support) for each class, allowing in-depth analysis of model performance across classes.
  • Confusion Matrices: Visualizes correct and incorrect predictions, highlighting areas for potential improvement.
XGBoost Classifier after hyperoptimization and re sampling but without class_weights for train and test data
XGBoost Classifier after hyper optimization and re sampling but with class_weights for train data
XGBoost Classifier after hyperoptimization and re sampling but with class_weights for test data

After hyper optmization , after adding re-sampling technique, after adding class_eights for those which had low data, then running the model with XGBoostClassifier , we have got not so impressive result, though model performance improved , though we were able to reduce 0’s accuracy and precision , diverted it towards 1’s and 2’s, yet we got lower accuracy as low as 21% on overall data when model had ran.

XGBoost Classifier after hyperoptimization and re sampling but with class_weights for entire data

There is a poor performance for class 1 and not good result for entire accuracy that staging at 21% only. We will try to implement further with LSTM and check if any chances of improvement araises.

LSTM with XGBClassifier Model

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from sklearn.preprocessing import StandardScaler
from sklearn.utils.class_weight import compute_sample_weight

# Assuming you have sequential data stored in X_seq and corresponding labels in y

# Scale the sequential data
scaler = StandardScaler()
X_seq_scaled = scaler.fit_transform(X)

# Define the number of timesteps and features
n_timesteps = 1 # Define the number of time steps
n_features = X_seq_scaled.shape[1] # Number of features in the sequential data

# Calculate sample weights based on class imbalance
class_weights = {0: 1, 1: 3, 2: 3} # Adjust weights as needed

# Determine the number of samples and calculate the number of batches
n_samples = len(X_seq_scaled)
n_batches = n_samples // n_timesteps

# Reshape the sequential data to have 3 dimensions (batch_size, timesteps, features)
# Ensure that the reshaped array has the correct number of elements
X_seq_reshaped = X_seq_scaled[:n_batches * n_timesteps].reshape(n_batches, n_timesteps, n_features)

sample_weights = compute_sample_weight(class_weight=class_weights, y=y[:n_batches * n_timesteps])

# Define and train the LSTM model
model_lstm = Sequential()
model_lstm.add(LSTM(units=50, input_shape=(n_timesteps, n_features)))
model_lstm.add(Dense(units=1, activation='sigmoid'))
model_lstm.compile(optimizer='adam', loss='binary_crossentropy')
model_lstm.fit(X_seq_reshaped, y[:n_batches * n_timesteps], epochs=100, batch_size=50)

# Extract features from LSTM model
features_lstm = model_lstm.predict(X_seq_reshaped)

# Concatenate or stack LSTM features with original features
X_combined = np.concatenate((X, features_lstm), axis=1) # Adjust axis as per your data shape

# Train XGBoostClassifier on combined features
classifier_xgb_lstm = XGBClassifier(
objective=objective,
booster="gbtree",
eval_metric=eval_metric_list,
n_estimators=ne,
learning_rate=lr,
max_depth=md,
gamma=gm,
reg_alpha=ra,
reg_lambda=rl,
subsample=0.8,
colsample_bytree=1,
random_state=1,
use_label_encoder=False)
classifier_xgb_lstm.fit(X_combined, y[:n_batches * n_timesteps], sample_weight=sample_weights)

# Evaluate the model, make predictions, etc.

from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix

# Make predictions on the test data
X_test_seq_scaled = scaler.transform(X_test)
X_test_seq_reshaped = X_test_seq_scaled.reshape(X_test_seq_scaled.shape[0], n_timesteps, n_features)
features_lstm_test = model_lstm.predict(X_test_seq_reshaped)
X_test_combined = np.concatenate((X_test, features_lstm_test), axis=1)
y_pred = classifier_xgb_lstm.predict(X_test_combined)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
conf_matrix = confusion_matrix(y_test, y_pred)

# Print the evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("Confusion Matrix:")
print(conf_matrix)
print("Confusion Matrix - Test Data:")
ConfusionMatrixDisplay(conf_matrix).plot()

Here is a breakdown of the code’s key steps and considerations:

1. Data Preparation:

  • Scaling: Standardizes sequential data using StandardScaler for better model convergence.
  • Reshaping: Reshapes the data into a 3D format (batch_size, timesteps, features) suitable for LSTM input.
  • Sample Weights: Calculates sample weights to address class imbalance, giving more importance to minority classes during training.

2. LSTM Feature Extraction:

  • Model Definition: Creates a sequential LSTM model with an LSTM layer (50 units) to capture temporal patterns and a final sigmoid output layer for binary classification.
  • Training: Trains the LSTM model on the reshaped sequential data, using binary cross entropy loss and the Adam optimizer.
  • Feature Extraction: Generates features from the trained LSTM model by passing the reshaped data through it.

3. Feature Combination and XGBoost Training:

  • Concatenation: Combines the original features with the extracted LSTM features, creating a richer feature set that incorporates both raw data and learned temporal patterns.
  • XGBoost Model: Initializes an XGBoostClassifier with specified hyperparameters for model complexity, regularization, and learning behavior.
  • Training: Fits the XGBoost model to the combined features, leveraging both the original data and the additional insights from the LSTM features.

Key Considerations:

  • Time Steps: Carefully choose the number of timesteps (n_timesteps) to adequately represent temporal dependencies in your data.
  • Batch Size: Experiment with different batch sizes to optimize training efficiency and generalization.
  • Hyperparameter Tuning: Find optimal hyperparameters for both LSTM and XGBoost models using techniques like cross-validation.
  • Evaluation: Thoroughly evaluate the combined model’s performance using appropriate metrics, considering class imbalance and problem goals.
  • Interpretability: While this approach can be effective, the combined model might be less interpretable than using LSTM or XGBoost alone.
  • Alternative Approaches: Explore other techniques for combining LSTMs and tree-based models, such as stacking or ensembling, to potentially enhance performance further.
LSTM with XGBoostClassifer and added classs_wieghts for test data
LSTM with XGBoostClassifer and added classs_wieghts for entire data

The results were not impressive, maybe needs more feature engineering and optimization.

Saving The Model for Future Use

from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.utils.class_weight import compute_sample_weight

# Assuming you have your time series data stored in X and y
# X should contain your features, and y should contain your target variable

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Calculate sample weights based on class imbalance
class_weights = {0: 1, 1: 3, 2: 3} # Adjust weights as needed

# Apply SMOTE to the training data only to avoid data leakage
smote = SMOTE(sampling_strategy='not majority')
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

sample_weights = compute_sample_weight(class_weight=class_weights, y=y_train_resampled)

# Standardize features
scaler = StandardScaler()
X_train_resampled = scaler.fit_transform(X_train_resampled)
X_test = scaler.transform(X_test)

# Initialize and train the XGBoost classifier
classifier = XGBClassifier(
objective=objective,
booster="gbtree",
eval_metric=eval_metric_list,
n_estimators=ne,
learning_rate=lr,
max_depth=md,
gamma=gm,
reg_alpha=ra,
reg_lambda=rl,
subsample=0.8,
colsample_bytree=1,
random_state=1,
use_label_encoder=False)
classifier.fit(X_train_resampled, y_train_resampled, sample_weight=sample_weights)

# Predict on the test set
y_pred = classifier.predict(X_test)



# Save the trained model to a file
joblib.dump(classifier, 'xgboost_model.pkl')

# Later, when you want to retest with the entire dataset:
# Load the saved model from the file
loaded_model = joblib.load('xgboost_model.pkl')

# Use the loaded model to predict on the entire dataset
y_pred_entire_dataset = loaded_model.predict(X)

# Evaluate the loaded model
print("Classification Report (Entire Dataset):")
print(classification_report(y, y_pred_entire_dataset))
print("Confusion Matrix (Entire Dataset):")
print(confusion_matrix(y, y_pred_entire_dataset))
cm_entire_dataset = confusion_matrix(y, y_pred_entire_dataset)
print("Confusion Matrix - Entire Dataset:")
ConfusionMatrixDisplay(cm_entire_dataset).plot()

1. Data Preparation:

  • Splitting: Divides data into training (70%) and testing (30%) sets for model development and evaluation.
  • Class Weights: Calculates sample weights based on class imbalance, giving more importance to minority classes during training ({0: 1, 1: 3, 2: 3}).
  • SMOTE Oversampling: Addresses class imbalance using SMOTE, a technique that generates synthetic samples for minority classes, potentially enhancing model performance for those classes.
  • Sample Weights: Computes sample weights for the resampled data, ensuring correct application of weights even after oversampling.
  • Feature Scaling: Standardizes features using StandardScaler, often improving model accuracy and convergence.

2. Model Building and Training:

  • XGBoost Classifier: Initializes an XGBoostClassifier with specified hyperparameters for model complexity, regularization, and learning behavior.
  • Training: Fits the model to the resampled training data, incorporating both oversampling and sample weights to address class imbalance effectively.

3. Evaluation:

  • Predictions: Generates predictions on both the training and test sets using the trained model.
  • Classification Reports: Prints detailed performance metrics (precision, recall, F1-score, support) for each class, allowing in-depth analysis of model performance across classes.
  • Confusion Matrices: Visualizes correct and incorrect predictions, highlighting areas for potential improvement.

4. Model Saving and Retesting:

  • Saving: Saves the trained model to a file using joblib for future use.
  • Loading: Loads the saved model from the file when needed.
  • Retesting: Predicts on the entire dataset using the loaded model to assess its performance on a larger scale.

Feature Selection for Optimal Usage of Data for Prediction (To Reduce Memory Usage)

# Plot Feature Importances
fig = plt.figure(figsize=(42, 10))
importance_labels = X.columns
#importance_features = classifier_1.feature_importances_
importance_features = model.feature_importances_
plt.bar(importance_labels, importance_features)
plt.show()
I have used 3 features for this model

The above bar graph shows importance of each feature used and their importance while guessing the target value by our model. Usually, we use multiple features (more than 10+) to find best suitable features out of so many which actually been used or were relevant for the model while predicting the data. Because we used very minimal features, we were unable to get best possible results.

If we want to short list the best out of 100+ features we can further short list them by taking mean of total active used features from feature_importance we can do as below

# Select Best Features
mean_feature_importance = importance_features.mean()
i = 0
recommended_feature_labels = []
recommended_feature_score = []
for fi in importance_features:
if fi > mean_feature_importance:
recommended_feature_labels.append(importance_labels[i])
recommended_feature_score.append(fi)
i += 1


# Plot Recommended Features
fig = plt.figure(figsize=(15, 5))
plt.bar(recommended_feature_labels, recommended_feature_score)
plt.show()

This will help us short list most useful features out of all the important features that were used, this way, if we can re train our model with limited features, we will save lot of memory and also computational speed increases by x fold times.

Final short listed features that are more relevant for training the model

Comparison of Models for Predicting Bitcoin Price on 15-Minute Timeframes with Class Weights and Resampling

When comparing various machine learning algorithms for time series prediction on Bitcoin price data with a 15-minute timeframe, it’s essential to consider several factors, including model performance, computational efficiency, interpretability, and ease of implementation. Here’s a breakdown of the comparison among CatBoost, LightGBM, Support Vector Classifier (SVC), XGBoost, and XGBoost with LSTM:

CatBoost:

Advantages:

  • Robust handling of categorical features without preprocessing.
  • Built-in support for handling missing data.
  • Good performance with default hyperparameters.

Considerations:

  • Slower training speed compared to some other models.
  • May require more memory due to its internal algorithms.

LightGBM:

Advantages:

  • Faster training speed and lower memory usage compared to traditional gradient boosting methods.
  • Good performance with large datasets and high-dimensional features.

Considerations:

  • May require tuning of hyperparameters for optimal performance.
  • Less robustness to overfitting compared to CatBoost.

Support Vector Classifier (SVC):

Advantages:

  • Effective in high-dimensional spaces.
  • Can capture complex relationships in data using different kernel functions.

Considerations:

  • Training time can be relatively slow, especially with large datasets.
  • Requires careful selection of hyperparameters and kernel functions.

XGBoost:

Advantages:

  • High performance and scalability.
  • Good handling of missing data and feature importance estimation.

Considerations:

  • Hyperparameter tuning may be required for optimal performance.
  • Limited support for categorical features compared to CatBoost.

XGBoost with LSTM:

Advantages:

  • Ability to capture temporal dependencies in sequential data.
  • Suitable for modeling complex time series patterns.

Considerations:

  • Longer training time and potentially higher computational cost compared to traditional machine learning models.
  • Requires careful architecture design and tuning of LSTM parameters.

Incorporating resampling techniques like SMOTE and using class weights can significantly improve the performance of models, especially in scenarios with imbalanced classes such as predicting Bitcoin price movements. By oversampling minority classes and assigning higher weights to them, models can better capture rare events and improve overall prediction accuracy.

Several machine learning models have been evaluated for their ability to predict Bitcoin prices on 15-minute timeframes, with a particular focus on handling imbalanced data and incorporating class weights. The models under consideration include CatBoostClassifier, LGBMClassifier, SVC (Support Vector Classifier), XGBoostClassifier, and a hybrid approach combining XGBoost with an LSTM for capturing temporal dependencies.

Resampling and Class Weights:

The study employs SMOTE as a resampling technique to address the imbalanced nature of the data, where classes representing small price movements (1’s and 2’s) are underrepresented. Additionally, class weights are assigned to each class, giving more importance to the underrepresented ones during model training.

Model Performance:

Observations suggest that CatBoostClassifier and LGBMClassifier achieved superior performance when compared to the other models, particularly when combined with class weights and resampling techniques like SMOTE. While these models demonstrated effectiveness in handling imbalanced data, it is crucial to remember that the optimal choice depends on various factors, including the specific dataset, evaluation metrics, and available computational resources.

Key Considerations:

  • Hyperparameter Tuning: Experimenting with different hyperparameter settings for each model is essential to optimize their performance. Techniques like cross-validation can aid in this process.
  • Alternative Resampling Techniques: While SMOTE offers a good starting point, exploring other techniques like ADASYN might yield further improvements depending on the data characteristics.
  • Time Series Specificity: If temporal patterns significantly influence Bitcoin price movements, employing specialized time series models like ARIMA, Prophet, or LSTMs could potentially enhance prediction accuracy.

Conclusion:

While CatBoostClassifier and LGBMClassifier appear to have performed well in the specific context presented, thorough testing and comparisons with other models under various configurations are recommended before drawing definitive conclusions. Additionally, incorporating domain knowledge through feature engineering and carefully selecting evaluation metrics based on the problem goals are crucial steps for achieving optimal prediction performance.

Final Summary:

Throughout our analysis, we’ve navigated a diverse landscape of machine learning techniques to optimize our trading strategy for Bitcoin price prediction on a 15-minute timeframe. Leveraging a custom signal generator producing 0s (Neutral), 1s (Long), and 2s (Short), we engineered corresponding target values and subjected them to rigorous evaluation.

Our journey began with the XGBoostClassifier, a robust model renowned for its versatility. To enhance its performance, we introduced resampling techniques like SMOTE, RandomOverSampler, TomeKLinks, and ADASYN, alongside class_weights to amplify the influence of underrepresented classes — 1s and 2s. Despite our efforts, XGBoostClassifier failed to meet expectations, prompting deeper analysis.

Undeterred, we further refined our approach by layering hyperparameter optimization onto XGBoostClassifier, combined with resampling and class_weights. Exploring LSTM atop XGBoostClassifier yielded intriguing possibilities, driving our exploration into alternative models.

CatBoostClassifier and LGBMClassifier emerged as standout performers, particularly when augmented with SMOTE resampling and class_weights. Their adaptive nature and superior handling of imbalanced data underscored their suitability for our task.

However, our exploration illuminated critical insights into our methodology’s limitations. Our reliance on a limited feature set and imbalanced target values posed challenges, hindering model effectiveness. Additionally, our predictive horizon — three candles ahead — exceeded our model’s capacity, impeding timely decision-making.

Moving forward, we recognize the imperative of diversifying our feature set and adopting a more holistic approach to signal generation. Integrating multiple technical indicators spanning volatility, momentum, volume, and trend analysis across various timeframes promises to enrich our model’s predictive capabilities.

Our journey has just begun. Future endeavors will explore advanced techniques such as RNNs, CNNs, LSTMs, and Prophet models, as we strive to unlock the full potential of our trading strategy and chart a profitable course in the dynamic world of cryptocurrency markets.

Suggestions for Learning:

Books:

  • “Python for Finance” by Yves Hilpisch
  • “Machine Learning Yearning” by Andrew Ng
  • “Deep Learning” by Ian Goodfellow

Courses:

  • Coursera’s “Machine Learning” by Andrew Ng
  • Udacity’s “Deep Learning Nanodegree”

Resources:

  • Kaggle for real-world datasets and competitions.
  • Towards Data Science on Medium for insightful articles.

Financial Analysis:

  • “Quantitative Financial Analytics with Python” on edX.
  • “Financial Markets” on Coursera by Yale University.

Programming Practice:

  • LeetCode and HackerRank for general programming challenges.
  • GitHub repositories with open-source finance and machine learning projects.

These resources will provide a comprehensive foundation for understanding the technical aspects of algo trading and the application of Python in finance. Additionally, participating in online forums and communities such as Stack Overflow, GitHub, and Reddit’s r/algotrading can offer practical insights and peer support.

Thank you, Readers.

I hope you have found this article on Algorithmic strategy to be informative and helpful. As a creator, I am dedicated to providing valuable insights and analysis on cryptocurrency, stock market and other assets management.

If you have enjoyed this article and would like to support my ongoing efforts, I would be honored to have you as a member of my Patreon community. As a member, you will have access to exclusive content, early access to new analysis, and the opportunity to be a part of shaping the direction of my research.

Membership starts at just $10, and you can choose to contribute on a bi-monthly basis. Your support will help me to continue to produce high-quality content and bring you the latest insights on financial analytics.

Patreon https://patreon.com/pppicasso

Regards,

Puranam Pradeep Picasso

Linkedinhttps://www.linkedin.com/in/puranampradeeppicasso/

Patreon https://patreon.com/pppicasso

Facebook https://www.facebook.com/puranam.p.picasso/

Twitterhttps://twitter.com/picasso_999

--

--

Puranam Pradeep Picasso - ImbueDesk Profile

Algorithmic Trader, AI/ML & Crypto Enthusiast, Certified Blockchain Architect, Certified Lean Six SIgma Green Belt, Certified SCRUM Master and Entrepreneur