It has been more than 30 years since the first software programs for backtesting trading strategies were made available to the public but misconceptions about this analysis process are still widespread even among quants.
Software programs for backtesting trading strategies for public use first appeared in the market in the late 80s. These programs offered quants advanced capabilities and significantly reduced strategy development and testing timeframes. The advanced capabilities came at the risk of data-mining bias and rapid production of spurious strategies: testing a large number of strategies and keeping the few ones that performed best ignored the majority that failed in the process. This is in a nutshell the problem that arises from data-mining: the strategies that pass all validation tests may be in reality random and performance may be accidental. Only actual use provides the ultimate test.
Despite the fact that many articles and papers have been written about the perils of backtesting, it is clear from posts in financial media that many, even quants, do not understand them. Note that for most people the message conveyed in academic papers is often obscured by fancy mathematics and terminology. The purpose of this article is to present the problems of backtesting in “plain language.” There are many of them but the three listed below are the most serious everyone should know and also the ones rarely discussed in articles, books or academic papers.
1. Backtesting can only be used to reject strategies but never accept them
Some people apparently believe that if they followed all the proper validation steps, then good backtest performance implies something about future performance. No, it does not. Backtesting will only tell you if you should reject a strategy and that may also involve an error, as discussed below in more detail. The idea here is that if a strategy did not perform well in a sufficiently long period of time in the past, then it has low chances of performing well in the future. This is a reasonable assumption that fails when there is a regime change and a strategy that performed well in the past fails, or even a strategy that performed badly in the past, starts performing well. The former is called a false positive, or Type-I error, and the latter is called a false negative, or a Type-II error. Both errors may be detrimental to future success of quants: using a bad strategy may be as bad as missing a good strategy. For more information see this paper.
2. Backtesting cannot be used for point forecasts in most cases
I see this error repeated almost every day in financial social media with people backtesting a pattern to try to decide if a trade is going to be profitable. Ensemble averages calculated by backtesting, such as the profit factor and expectation, apply only to the limit of sufficient samples, or in the ensemble domain, and usually say very little about performance in the next step in the time domain. For example, even if the win rate is 80% and the annualized return is 10%, the next trade may be a loser and that may not affect much ensemble averages in the longer-term. Therefore, backtesting results may apply to point forecasts only in a certain context depended on prevailing market conditions. Those who are trying to obtain short-term forecasts using backtests often do not understand the limitations of backtesting.
3. The higher the frequency of backtesting, the higher becomes the probability of spurious results
If one keeps on backtesting strategies, there is high probability of getting fooled by one that passes all validation tests (out-of-sample, cross-validation, etc.) For most people this is counter-intuitive as they expect that the more they try, the higher are the chances of success but this is not how backtesting works. In fact, less frequent backtesting with unique hypotheses where out-of-sample data are used only once has the highest probability of success. Frequent backtesting runs into the problems of multiple comparisons and odds of spurious results increase rapidly. This is also the main program with using genetic algorithms and machine learning in identifying trading strategies.
The table below summarizes backtesting outcomes and their frequencies according to the experience of this author.
Positive backtesting results (after validation) can result in positive or negative actual performance. Positive performance has low frequency, about 2.5%. Negative backtest results can also result in positive or negative performance. The former possibility in this case has also about 2.5% frequency. Overall, negative actual results are present 95% of the time and positive only 5% of the time. In reality, since Type-II errors (missed discoveries) are known only in hindsight, the rate of success with backtesting over the longer-term is 2.5% or lower due to high Type-I errors (false discoveries) and the fact that many strategies are spurious from start.
Finding strategies via the use of backtesting is both an art and a science. This is not a programming problem, or having special skills in using some software language and its libraries. This is a difficult problem that goes beyond data science. In fact, an experienced trader with only excel or Basic language skills can succeed where quants with excellent knowledge of data science may fail. Let that sink in.