More on the Perils of Statistical Hypothesis Testing – Part II

Hypothesis tests based on one-sample t-tests and on bootstraps can be misleading, especially when returns are generated by an over-fitted trading strategy designed via machine learning and multiple comparisons.

In Part I I discussed with an example how the pair-sample t-test can be abused to generate misleading results. In that example, the hypothesis test involved the difference of the mean return from in-sample and out-of-sample tests. The result was not significant but the strategy was risky due to excessive drawdown. The point made was that when presented with hypothesis testing results, we may also have to look at other important performance parameters, such as the maximum drawdown. We should also have an idea of how the strategy was generated. If the generation process involved multiple comparisons, then any results based on hypothesis tests can be misleading unless proper adjustments are made to reduce the bias due to data-snooping.

In this part and with another simple example I discuss how one-sample t-tests and non-parametric bootstrap tests can be misleading in the case of over-fitted systems.

Let us consider the example of a 50-200 moving average crossover system, taking long and short positions in SPY, with data since inception to 10/02/2015. This system is profitable but the maximum drawdown is -35% and this is highly undesirable. The equity performance and related statistics are shown on the chart below:


The above equity curve was generated by 20 trades with mean return of 14.16% and standard deviation of 31.82%. Although CAR is 9.55%, which is above the SPY buy and hold total return CAR of 8.80% in the same period, the large maximum drawdown makes this a risky strategy. Normally, one would like to decrease the drawdown while the return is not affected or it is even increased.

Nowadays there is software that applies evolutionary algorithms for parameter optimization and even synthesizes strategies so that return is maximized and drawdown is minimmized, among other possibilities.  Note that this can also be done by exhaustive permutations. Actually, evolutionary algorithms and permutation methods are related in some important ways but this will be the subject of another blog.

After doing the optimization, we find that a fast moving average of 28 days and a slow moving average of 295 days produce higher CAR of 12.40% at a maximum drawdown of -19.90%. Sharpe ratio increases from 0.43 to 0.53. The mean trade return is 25.66% and the standard deviation is 40.31% for a sample of 14 trades. The equity curve with the statistics is shown below:


The results of a one-tailed hypothesis test using the one-sample t-test and a bootstrap of trade returns are shown on the table below:

Test T-value P-value
One-sample t-test 2.382035 0.016591
Bootstrap         – 0.0117

It may be seen from the above table that the result is significant. The bootstrap p-value is smaller than that obtained by the t-test, as expected, because non-parametric tests tend to produce more optimistic results in most cases.

However, despite the rejection of the null hypothesis that the returns were drawn from a population with zero mean, the fact is the strategy was over-fitted. Actually, the results are not significant due to data-snooping. In practice, the P-values must be adjusted to account for that. Methods for accomplishing that have been proposed but I doubt their effectiveness. The reason for that is that some over-fitted systems can produce P-values equal to 0, as my extensive research has shown. In my opinion, in the era of machine learning and data-mining, hypothesis tests have little or no value and what matters is the size of the samples: the larger the size, the higher the exposure to market conditions and the higher the probability that averages will converge to the mean of the population, also known as the expectation. Therefore, I believe that many are being fooled by hypothesis tests, including well-known quant traders and book authors, when the objective should be an increase of trade samples to several thousand for the purpose of testing significance. That imposes severe constraints on the type of strategies that are viable. For example, trend-following strategies cannot attain significance no matter how many markets are included due to small trade samples. That does not mean that winning strategies for trend-following do not exists but that their significance is very hard to establish. I discuss practical ways of increasing sample size in my new book, Fooled by Technical Analysis: The perils of charting, backtesting and data-mining. 

You can subscribe here to notifications of new posts by email.

Charting program: Amibroker

Detailed technical and quantitative analysis of Dow-30 stocks and popular ETFs can be found in our Weekly Premium Report.

New book release

Publisher: Michael Harris
Date: September 1, 2015
Language: English
270+ pages (6″ x 9″ trim)
74 high quality charts
Available online only
Table of Contents


© 2015 Michael Harris. All Rights Reserved. We grant a revocable permission to create a hyperlink to this blog subject to certain terms and conditions. Any unauthorized copy, reproduction, distribution, publication, display, modification, or transmission of any part of this blog is strictly prohibited without prior written permission. 

This entry was posted in Quantitative trading, Trading Strategies and tagged , , . Bookmark the permalink.