Backtesting Conundrum

About a week ago I ran into a backtesting conundrum while analyzing a simple relative strength ETF rotation strategy. A call to quants to compare results received only one answer but the cooperation led to a solution and to a deeper understanding.

In the process of studying the impact of selecting the close as the entry point instead of the open, I realized that I could not reproduce the results of a popular web-based backtester. Then I wrote this article and I challenged quants in Twitter to run their own tests to determine the source of the discrepancy. I have no problem being proved wrong and I consider this part of a learning process. But it appears that not many quants are interested in challenges even though they frequently use the word “community”. Probably, the word community has different meaning for different people, depending on objectives of course.

The only response was from Cesar Alvarez. He also posted his analysis in an article. We exchanged several emails with results and comments and finally discovered the reason for the discrepancies.

I could not have done it without the help from Cesar because it did not occur to me to also look at the value of the score together with trade results, something that he suggested I should do. Considering negative values of the score was a hidden assumption that is only evident if one focuses on this particular aspect of the strategy in relation to results. Thus, the problem was that I was considering only positive values of relative strength for ranking ETFs while the web-based backtester, Portfolio Visualyzer, also considers the negative if no other options are set (actually there is only one other option about a moving average filter). As an extreme example, if all ETFs are down 10% in a three month period, my code will not allocate any capital and stay flat but the web based backtester will still choose the top two ETFs.

Does it make sense to invest in ETFs that are on a downtrend? The answer is that it may or may not and actually the results from this are random. With the particular ETF portfolio I considered, the results are better but with another portfolio, such as the IVY 5, they are not.

Below are the updated results for the strategy. The close is used as the entry point. The period of the backtest is from 01/2007 to 10/31/2015:

Parameter	Score > 0	All score values	Portfolio Visualizer
CAR	11.89%	14.51%	14.52%
Max. DD	-18.29%	-19.5%	-20.25%
Sharpe	0.60	0.64	0.91

It may be seen that when all values of the score are taken into account, both positive and negative, the backtest results match closely those from Portfolio Visualizer. In this case it is better to also consider negative score values when ranking ETFs. Note that the differences in Sharpe can be attributed to the use of different values for the risk-free interest rate. I used 0 for the value.

Below are the details of the same strategy with a different ETF portfolio, also known as the IVY 5 in the blogosphere:

Monthly Data: 01/2008 – 10/31/2015
ETFs: BND, DBC, VEU, VNQ, VTI
Score: 3-month ROC
Two open positions only
50% allocation
Long-only
No rebalancing
No commissions
$100K starting capital
Adjusted data from Yahoo Finance

Parameter	Score > 0	All score values	Portfolio Visualizer
CAR	9%	8.04%	8.17%
Max. DD	-24.73%	-31.87%	-32.2%
Sharpe	0.54	0.44	0.60

Also in this case, the backtest results for all score values are close to those generated by Portfolio Visualizer.

However, in the above case the results for positive only score are better, both the CAR and the maximum drawdown. Thus, the hidden assumption of considering all score values may or may not always result in better performance. Nevertheless, it is important to know if there are hidden assumptions in backtests.

This article presents another proof that backtesting results may hide assumptions that are often not directly evident or even understood. A more serious corollary is that often the backtest results reported in papers or blogs may not actually correspond to a well-defined strategy. It is a fact that a significant percentage of backtest results in popular books, trading magazines and blogs about the attractiveness of certain strategies are pre-fabricated, in the sense that they are the outcome of data-mining and curve-fitting. Some relative strength ETF rotation strategies fit this category: both the parameters for the score function but also the universe of ETFs used are selected post hoc after testing possibly thousands of combinations. Therefore, despite any impressive statistics and outperformance of benchmarks, these strategies do not reflect a potential edge but only random artifacts of the dangerous practice of data-mining and selection bias.

Follow @mikeharrisNY

You can subscribe here to notifications of new posts by email.

Charting program: Amibroker
Disclaimer

Detailed technical and quantitative analysis of Dow-30 stocks and popular ETFs can be found in our Weekly Premium Report.

New book release

Publisher: Michael Harris
Date: September 1, 2015
Language: English
270+ pages (6″ x 9″ trim)
74 high quality charts
Available online only
Table of Contents

© 2015 Michael Harris. All Rights Reserved. We grant a revocable permission to create a hyperlink to this blog subject to certain terms and conditions. Any unauthorized copy, reproduction, distribution, publication, display, modification, or transmission of any part of this blog is strictly prohibited without prior written permission.

Backtesting Conundrum

Michael Harris

Subscribe by email

Subscribe and receive a free book

CFTC RULE 4.41

Copyright Notice

Categories

Backtesting Conundrum

Long-Short Factor ETF Rotation Strategy

9 Trading Strategies Plus a Free Bonus

Leveraged ETF Seasonality Strategy

Michael Harris

Long-Short Factor ETF Rotation Strategy

9 Trading Strategies Plus a Free Bonus

Leveraged ETF Seasonality Strategy

Subscribe by email

Subscribe and receive a free book

CFTC RULE 4.41

Copyright Notice

Categories