Selection and Data Mining Bias in Trading System Development

Selection and data-mining bias in trading system development often lead to random results. In a nutshell, selection bias tricks the process of validation and falsely attributes significance to random results. The only chance of identifying robust trading systems that exploit robust market anomalies and provide an edge is by minimizing selection and data-mining bias. Even then, there may be other factors, known or unknown, that limit system potential.

In the book Hedge Fund Market Wizards (May 2012), there is an interview of Jaffray Woodriff, the founder of QIM. When he was asked during the interview about data-mining he said amongst other things  that  “…the human tendency is to take the models that continue to do well in the out-of-sample data and choose those models for trading.” He also went on to explain how “this  process simply turns the out-of-sample data into part of the training data because it cherry-picks the models that did best in the out-of-sample period.” He then argued that instead one must look for patterns “where, on average, all the models out-of-sample continue to do well” and that “you are really getting somewhere if the out-of-sample results are more than 50 percent of the in-sample.”

The widely (ab)used practice of selection bias

Among other things, selection bias involves ranking results according to some metrics and then selecting the best performer (most stock rotation strategies fall other this category and this suffer from bias.) If the selection is made only from results from the in-sample, then this is just selection bias. But if the selection is also made from results obtained from the out-of-sample, then data-snooping is introduced in addition. In both cases, selection is bad practice. In the former case, the process usually continues until a system selected from the in-sample happens to perform well in the out-of-sample. Trading system developers often get fooled by randomness through selection bias. If the trading system development process involves testing a large combination of alternatives (securities, indicators, factors, etc.,) then the probability that some random selection will validate successfully is for all practical purposes 1,, or in different words this processes guarantees that something will be found to satisfy the performance criteria defined by a set of metrics in the in-sample and out-of-sample.

Minimizing selection bias

Selection bias can be minimized by actually not selecting anything and accepting all results. This is the simplest and most straightforward solution although there is still possibility of curvefitting. However, this is not a practical approach in most cases and also this is not possible with some processes that can internally generate billions or even trillions of combinations, such as machine learning algorithms. In those cases we can relax the strict condition of no selection bias by requiring that ALL systems that satisfy the development objectives are validated as a single system. Although this does not eliminate completely the effects of selection bias and curvefitting, it may nevertheless minimize their negative bias effects.

Selection bias is not involved when a specific system is conceived as a hypothesis without looking at the data. This must be a specific system with specific parameters that completely describe its entry and exit points. An example of such system is the 50-200 SMA cross. Although the process that conceived the system is possibly subject to data-mining bias, there is no selection bias involved. Three important points must be made here: (A) Some confuse data-mining and selection bias, attributing the latter to a single, well-defined, trading system hypothesis. However, selection bias is only involved when the same set of well-defined criteria give rise to multiple hypotheses and a selection is made based on maximum performance, minimum drawdown, etc. (B) The assumption that all conceivable systems, even in the form of single arbitrary hypotheses, are the result of selection bias is unsound because (1) such claim is a universally quantified proposition of the form All swans are white and it cannot be proven but only falsified, (2) the process that gives rise to the hypotheses must be explicitly defined and in the absence of such definition no such claim can be made,  i.e., it is a vacuous claim. (C) The data mining bias involved in the case of a single hypotheses can be minimized using out-of-sample validation. According to some notable experts in the field, for example, Jaffray Woodriff, the founder of QIM mentioned above, if the out-of-sample results are more than 50 percent of the in-sample, then there is a good chance that the hypothesis may not be random. But this only holds in the case of a unique hypothesis and not in the case of multiple testing.

It is also very important in my opinion that the data-mining process is inherently deterministic in the sense that it always generates the same output when it is supplied with the same data. Otherwise, the results are by design random and any application of cross-validation  may be affected by the randomness. If one tests trillions of combinations of patterns, it is guaranteed that several random combinations will pass cross-validation and even portfolio backtesting tests. Every random combination essentially amounts to a random selection and introduces bias. One way to minimize data-mining bias and eliminate selection bias involves identifying all possible combinations and apply cross -validation to the combined set. But this is an impossible task, especially in the case of evolutionary algorithms. A solution may involve limiting the available set to just a few indicators and patterns and trying to establish convergence of the algorithm. But this defeats the purpose for which some of these algorithms were developed, essentially rendering them useless because something like that can be done either manually or with application of combinatorics.

Example of data-mining application to trading system development with no selection bias involved

Price Action Lab is a program that identifies price patterns in historical data based on user-defined performance criteria. It was a deliberate choice back in the early 1990s when development of the algorithm of Price Action Lab began to consider only price patterns because by that time it was becoming evident to most experienced traders that the effectiveness of most technical analysis indicators that were a derivative of price and involved parameters was decreasing rapidly. Limiting the choices to price patterns basically resulted in reducing data-mining bias to start with. In addition, price patterns involve no free parameters and any issues of curve-fitting through optimization were limited to the size of the exits, something that is controlled by the user and not by the data-mining process. What was left then to deal with was selection bias and this can be greatly reduced by actually selecting all patterns that are generated by the program above a certain threshold established by a handful of key metrics, like the win rate, the number of trades, the number of consecutive losers and the profit factor. The limited choice of metrics also reduces significantly data-mining bias by reducing the possibilities of identifying spurious formations. As a matter of fact, none of the patterns found by Price Action Lab are spurious in the sense that they have already occurred in the in-sample at a rate defined by their win rate. The issue is not one of spurious formations because no real price action should be called spurious. The issue is whether these formations are robust enough to maintain their profitability in the future. This question can be partly answered by cross-validating a system consisting of all patterns in the results in the out-of-sample. Of course, cross-validation is always subject to TYPE-I and TYPE-II errors. Portfolio backtests can be used to minimize TYPE-II errors.  In my opinion, no effort should be made to minimize TYPE-I errors as that would come in conflict with the very essence of the back-testing hypothesis.

Step 1. Defining the market, timeframe and profit target and stop loss

In this example a trading system will be developed for DIA using an in-sample from the ETF inception (01/28/1998) to 12/31/2009. The out-of-sample will cover the period 01/04/2010 to 03/28/2013.

The profit target is set to 4% and the stop-loss is set at 3% to result in a reward/risk ratio of 1.30 on the average. The exit levels are determined based on historical average volatility. A T/S file is created and saved in Price Action Lab as follows:

  DIAT4S3_TS

Step 2. Creation of a pattern search workspace

DIAT4S3_WKS

The workspace is created by selecting the following:

a) The T/S file we created by the name T4S3.trs
b) The in-sample data file DIA. txt
c) The trade parameters: ”%” is marked to indicate that the values in the selected T/S file stand for percentages of the entry price, which is selected as the open of the next bar under Inputs. The Delay input is kept marked off. (Using delay inputs will be the subject of another post).
d) Under search parameters we input 50.00 for the percent minimum win rate for both long and short patterns, 29 for the minimum number of trades, 1.50 for the minimum profit factor and 3 for the maximum number of consecutive losers (< 3). The other parameters stay at default values.
e) The date range in the data file is shown under File Date Range. In this case it corresponds to the in-sample range. The Search Range is left to 500. This means that Price Action Lab will demand that all patterns that are found to satisfy the performance criteria set in (a) – (d) must have at least one historical trade in the most recent 500 bars. Finally, the Extended search option is checked.

Step 3. In-sample results

Price Action Lab will run for an interval of time depending on computer CPU speed but in this particular case it will complete the search after about 15-20 minutes on the average.  The output should look like the one below (sorted for highest win rate):

DIAT4S3_IN

Each line in the above results corresponds to a price pattern that satisfies the performance parameters specified by the user.  Index and Index Date are used internally to classify patterns. Trade on is the entry point, in this case the Open of next bar. P is the success rate of the pattern, PF is the profit factor, Trades is the number of historical trades, CL is the maximum number of consecutive losers, Type is LONG for long patterns and SHORT for short patterns , Target is the profit target,  Stop is the stop-loss and C indicates whether % or points for the exits, in this case it is “%”. Last Date and First Date are the last and first date in the historical data file.

It may be seen from the results that Price Action Lab found 10 distinct patterns, 8 long and 2 short, which satisfied the performance criteria specified on the workspace in the in-sample. One could argue that these patterns are random and a result of survivorship bias, i.e. they survived by chance alone and they have no predictive power. This is the reason an in-sample was used to search for the patterns and the out-of-sample will be used for cross-validation. Please note that:

– All patterns will be used in the out-of-sample testing as components of a trading system. There is no selection made  from the results obtained.
Price Action Lab does not look at the out-of-sample when searching the in-sample. Actually it does not even know if such sample is available. This is very important because if a program selects patterns from the in-sample that also worked well in the out-of-sample this is an extremely bad (and even deceiving) practice that is known as data-snooping. Unfortunately, several programs that are sold to traders for discovering trading systems do allow this bad practice.
– Price Action Lab does not find systems that later allow varying their parameters for fitting their performance in the out-of-sample. All patterns found are parameter-free, eliminating an important but serious concern dealing with curve-fitting.
– Price Action Lab does not use genetic programming, neural networks, permutations or any other methods employed by some other programs that very often produce random and curve-fitted systems. Price Action Lab is based on a proprietary deterministic algorithm that produces the same output each time it encounters the same conditions and that is in compliance with the standards of scientific testing and analysis.

The Test Patterns function is used next to get a quick calculation of key performance statistics for the in-sample and out-of-sample performance.

In-sample results 

DIAT4S3_IN_RES

As expected by design, the system of 10 patterns is profitable in the in-sample when no multiple open positions are allowed and the equity curve is upward slopping and relatively smooth. The win rate is 57.02%, the profit factor is equal to 1.79 and the payoff ratio (reward/risk ratio) is slightly better than expected at 1.35. A more detailed backtesting analysis can be obtained by generating code for one of the supported platforms and testing the system of patterns there.

Out-of-sample results

The system of 10 patterns is tested in the out-of-sample for no multiple positions after selecting the relevant data file:

DIAT4S3_OUT_RES

The win rate is 53.70%, the profit factor is equal to 1.51 and the payoff ratio (reward/risk ratio) is as expected at 1.30. The out-of-sample equity curve is quite acceptable with an upward slope. A more detailed backtesting analysis can be obtained by generating code for one of the supported platforms and testing the system of patterns there.

In the out of sample, 3 patterns failed, two long and one short, as shown below:

DIAT4S3_OUT

One long failed with a marginal profit factor of 0.98. Thus, 7 out of the 10 patterns performed well in the out-of-sample.

Portfolio backtest

A portfolio backtest on 12 popular ETFs  (DBC, DIA, EEM, GLD, IWM, QQQ, SLV, SPY, TLT, USO, XLE, XLF) is used to minimize the probability of a TYPE-II error. However this probability can never become 0. The results are shown below:

DIAT4S3_OUT_PB

Two of the patterns that failed the out-of-sample test are also the ones that fail the portfolio backtest, maintaining consistency. In addition, most of the profits factor values are close or above 1.30.

Comments and FAQ

How long will these patterns remain profitable? This is unknown. However, the real value in using software like Price Action Lab is that the search can be repeated every six months or every year, for example, and the systems can be upgraded. This is not like getting a black-box and have to live with it.

Can software tools make traders a lot of money? No tool can make anyone any money. It is the knowledge in using the tool properly and the hard work put that makes the money. For example, one can use a hammer to make a beautiful house and another to vandalize some work of art.

Related posts

http://www.priceactionlab.com/Blog/2012/06/fooled-by-randomness-through-selection-bias/
http://www.priceactionlab.com/Blog/2013/04/passive-investing-in-stock-indices-involves-substantial-risks/
http://www.priceactionlab.com/Blog/2012/08/fooled-by-multiple-comparisons-when-developing-systems-for-single-securities-or-portfolios/
http://www.priceactionlab.com/Blog/2011/09/curve-fitting-and-optimization/
http://www.priceactionlab.com/Blog/2010/10/proper-use-of-back-testing/

Disclosure: no relevant position at the time of this post.

Disclaimer

This entry was posted in Price Action Strategies, Strategy Synthesis and tagged , , . Bookmark the permalink.