System Parameter Permutation and its recent variant, System Parameter Randomization, called SPx, are methods of evaluating trading strategies that I mention in my paper “Limitations of Quantitative Claims About Trading Strategy Evaluation.” In this article I respond to comments I received from Dave Walton, the developer of these methods.
The link to a paper by Dave Walton can be found in the reference section of my paper.
In my paper, I state three potential problem areas for SPx
- The first problem with SPx is that it requires selecting ex-ante a range of system parameters and this can cause selection bias.
- The second problem with SPx is that if it is repeatedly used under multiple trials, then it loses its effectiveness due to data snooping.
- The third and more serious problem with SPx is that all tests are conditioned on historical data and the probability of a Type I error (false discovery) is high.
Dave Walton (DW) made the following comments regarding 1 – 3 above in an email to me. I include part of his email here with his permission:
1. DW: While this is true, selection bias would only be an issue if repeated SPx runs were done and then the “optimal” range was chosen. If one follows Braga’s advice, as you state, this would not be done. Stochastic modeling is just a tool and can be misused just like any tool can.
I agree about the last part. My point is that while repeated trials will introduce selection bias if parameters are chosen, upfront selection of parameters is also a source of bias, especially when it is made with knowledge of the performance of other systems and what parameters have worked in the past. One problem here is that a tight range of parameters may exclude performance spoilers, i.e., strategies with highly undesirable performance, and a wide range may include a large number of them. There is no way of knowing the best range of parameters to use and any specific selection may ignore other selections that might not have worked well or include only models with good performance, something that causes bias. Needless to say that a restart of the process after a failure to identify a suitable parameter range will introduce data-snooping bias and it is also a fact that some users will be tempted to do that more than once.
2. DW: Same comment as above.
If SPx is not repeated under multiple trials, then validation on an out-of-sample is required for determining the significance of the strategy. A unique hypothesis that is not data-mined requires out-of-sample testing, often in multiple blocks of data from the out-of-sample, to establish its statistical significance. Out-of-sample testing makes no sense with multiple trials, for example when trying to develop a strategy via machine learning.
In the case of a unique hypothesis, SPx, or a subset of it, could be applied in the out-of-sample but the power of any tests will be lower. If the developer believes that a set of parameters are appropriate, SPx can be used to evaluate the robustness of the strategy in the full sample. Since the probability of a Type I error (false discovery) is high, this method is probably useful in asserting the null hypothesis. Actually, I have used similar methods in my book “Fooled By Technical Analysis” in the process of trying to debunk strategies. In that respect, SPx can be useful but not for deciding which strategy to use. This is because, strategies fail mostly when market conditions change, which is also the message in my paper. The SPx process cannot determine when market conditions will change.
3 DW: Absolutely correct however not a specific problem to SPP/SPR.
This is true, but SPx was presented as a method of establishing future performance limits although it essentially cannot do that and this was the point. With a change in market conditions the probability of a false positive (false discovery) is close to 1. But this is true with all evaluation methods.
Dave Walton also made the following comments:
“As you correctly assert, no method of strategy evaluation can guard against this. With that said, I don’t see how your recommendation of “limiting the use of backtesting to a small class of models that have similar characteristics” addresses this either. With that said, I do discuss a method of detecting a major regime shift (which I call strategy failure) in the paper. The method involves breaking the historical data into blocks (say 6 months worth) and running stochastic modeling on these blocks to build an empirical probability distribution of performance attributes. Then in forward testing over 6-months if your worst case scenario (decided ex-ante) is violated you would call the strategy failed-the market conditions have shifted significantly enough that performance expectations cannot be met. To me this is a better approach than trying to determine another metric (you use autocorrelation in the paper) that would be indicative of the market conditions leading to acceptable performance.”
Limiting backtesting to a small class of models with similar characteristics is a variation of one of Aronson’s heuristics for limiting data-mining bias. Let us keep in mind that SPx will not be used only once and then it will be forgotten forever but the developer will probably try another strategy or several. The Type I error (false positive) grows fast with multiple trials involving uncorrelated strategies and this was the purpose of the recommendation.
Furthermore, strategy failure on historical data may or may not be due to a change in market conditions. When we refer to a change in market conditions, we mean conditions that were not encountered during testing. One can always fit a strategy on any past market conditions in the in-sample and even in the out-of-sample if repeated trails are made with enough parameters. Another problem is that one may wait for 6 months and everything will appear working fine and the strategy may fail in the following six months and even cause uncle point (ruin.)
As a last note, academic papers with results from (re)sampling can be easily criticized and for this reason passing peer-review is hard (This comes from a comment by Prof. V. Zakamulin in one of our conversation.) One problem is that it is not clear what (re)sampling does to original price series. Results may be over conservative or too pessimistic. For example, if a strategy depends on negative serial correlation to perform well, as it is the case with the strategy in Section 3.2 of my paper, then (re)sampling may destroy that and the strategy may fail, resulting in a Type II error (missed discovery) in case negative serial correlation persists for much longer.
Nevertheless, I find value in SPx in providing a framework for testing strategies for robustness but only for the purpose of asserting the null hypothesis of no profitability. Similar tests are already popular but traders often misuse them because they think they say something about the future. Traders should always start with a sound process or idea and have a good knowledge of parameter ranges.
For someone who is skilled in the art and science of trading strategy development, SPx may be more useful than traditional statistical significance because among other things it can assist with determining capital requirements from worst case drawdown conditions and testing the performance of various position sizing schemes. Of course, academic papers always strive to prove the efficient market hypothesis and assert the null hypothesis and therefore never get to that practical level practitioners are interested in when applying their strategies.
Finally, you can find an article about SPP in the QUSMA blog. There are some interesting comments at the end.
Technical and quantitative analysis of Dow-30 stocks and 30 popular ETFs is included in our Weekly Premium Report. Market signals for longer-term traders are offered by our premium Market Signals service.