System Parameter Permutation and its recent variant, System Parameter Randomization, called SPx, are methods of evaluating trading strategies that I mention in my paper “Limitations of Quantitative Claims About Trading Strategy Evaluation.” In this article I respond to comments I received from Dave Walton, the developer of these methods.

The link to a paper by Dave Walton can be found in the reference section of my paper.

In my paper, I state three potential problem areas for SPx

- The first problem with SPx is that it requires selecting ex-ante a range of system parameters and this can cause selection bias.
- The second problem with SPx is that if it is repeatedly used under multiple trials, then it loses its effectiveness due to data snooping.
- The third and more serious problem with SPx is that all tests are conditioned on historical data and the probability of a Type I error (false discovery) is high.

Dave Walton (DW) made the following comments regarding 1 – 3 above in an email to me. I include part of his email here with his permission:

**1. DW:** While this is true, selection bias would only be an issue if repeated SPx runs were done and then the “optimal” range was chosen. If one follows Braga’s advice, as you state, this would not be done. Stochastic modeling is just a tool and can be misused just like any tool can.

I agree about the last part. My point is that while repeated trials will introduce selection bias if parameters are chosen, upfront selection of parameters is also a source of bias, especially when it is made with knowledge of the performance of other systems and what parameters have worked in the past. One problem here is that a tight range of parameters may exclude performance spoilers, i.e., strategies with highly undesirable performance, and a wide range may include a large number of them. There is no way of knowing the best range of parameters to use and any specific selection may ignore other selections that might not have worked well or include only models with good performance, something that causes bias. Needless to say that a restart of the process after a failure to identify a suitable parameter range will introduce data-snooping bias and it is also a fact that some users will be tempted to do that more than once.

**2. DW:** Same comment as above.

If SPx is not repeated under multiple trials, then validation on an out-of-sample is required for determining the significance of the strategy. A unique hypothesis that is not data-mined requires out-of-sample testing, often in multiple blocks of data from the out-of-sample, to establish its statistical significance. Out-of-sample testing makes no sense with multiple trials, for example when trying to develop a strategy via machine learning.

In the case of a unique hypothesis, SPx, or a subset of it, could be applied in the out-of-sample but the power of any tests will be lower. If the developer believes that a set of parameters are appropriate, SPx can be used to evaluate the robustness of the strategy in the full sample. Since the probability of a Type I error (false discovery) is high, this method is probably useful in asserting the null hypothesis. Actually, I have used similar methods in my book “Fooled By Technical Analysis” in the process of trying to debunk strategies. In that respect, SPx can be useful but not for deciding which strategy to use. This is because, strategies fail mostly when market conditions change, which is also the message in my paper. The SPx process cannot determine when market conditions will change.

**3 DW:** Absolutely correct however not a specific problem to SPP/SPR.

This is true, but SPx was presented as a method of establishing future performance limits although it essentially cannot do that and this was the point. With a change in market conditions the probability of a false positive (false discovery) is close to 1. But this is true with all evaluation methods.

Dave Walton also made the following comments:

“As you correctly assert, no method of strategy evaluation can guard against this. With that said, I don’t see how your recommendation of “limiting the use of backtesting to a small class of models that have similar characteristics” addresses this either. With that said, I do discuss a method of detecting a major regime shift (which I call strategy failure) in the paper. The method involves breaking the historical data into blocks (say 6 months worth) and running stochastic modeling on these blocks to build an empirical probability distribution of performance attributes. Then in forward testing over 6-months if your worst case scenario (decided ex-ante) is violated you would call the strategy failed-the market conditions have shifted significantly enough that performance expectations cannot be met. To me this is a better approach than trying to determine another metric (you use autocorrelation in the paper) that would be indicative of the market conditions leading to acceptable performance.”

Limiting backtesting to a small class of models with similar characteristics is a variation of one of Aronson’s heuristics for limiting data-mining bias. Let us keep in mind that SPx will not be used only once and then it will be forgotten forever but the developer will probably try another strategy or several. The Type I error (false positive) grows fast with multiple trials involving uncorrelated strategies and this was the purpose of the recommendation.

Furthermore, strategy failure on historical data may or may not be due to a change in market conditions. When we refer to a change in market conditions, we mean conditions that were not encountered during testing. One can always fit a strategy on any past market conditions in the in-sample and even in the out-of-sample if repeated trails are made with enough parameters. Another problem is that one may wait for 6 months and everything will appear working fine and the strategy may fail in the following six months and even cause uncle point (ruin.)

As a last note, academic papers with results from (re)sampling can be easily criticized and for this reason passing peer-review is hard (This comes from a comment by Prof. V. Zakamulin in one of our conversation.) One problem is that it is not clear what (re)sampling does to original price series. Results may be over conservative or too pessimistic. For example, if a strategy depends on negative serial correlation to perform well, as it is the case with the strategy in Section 3.2 of my paper, then (re)sampling may destroy that and the strategy may fail, resulting in a Type II error (missed discovery) in case negative serial correlation persists for much longer.

Nevertheless, I find value in SPx in providing a framework for testing strategies for robustness but only for the purpose of asserting the null hypothesis of no profitability. Similar tests are already popular but traders often misuse them because they think they say something about the future. Traders should always start with a sound process or idea and have a good knowledge of parameter ranges.

For someone who is skilled in the art and science of trading strategy development, SPx may be more useful than traditional statistical significance because among other things it can assist with determining capital requirements from worst case drawdown conditions and testing the performance of various position sizing schemes. Of course, academic papers always strive to prove the efficient market hypothesis and assert the null hypothesis and therefore never get to that practical level practitioners are interested in when applying their strategies.

Finally, you can find an article about SPP in the QUSMA blog. There are some interesting comments at the end.

Subscribe via RSS or Email, or follow us on Twitter.

Technical and quantitative analysis of Dow-30 stocks and 30 popular ETFs is included in our Weekly Premium Report. Market signals for longer-term traders are offered by our premium Market Signals service.

Michael, thanks for doing a blog post on this. Rather than jumping into your comments line-by-line, I want to assert something directly to frame the problem.

There are many reasons why trading system performance moving forward could be significantly different (poor) than simulated historic performance. There are factors in the control of developers/evaluators, factors that can only be estimated, and there are factors beyond control and estimation. As an illustrative point, good developers eliminate factors such as lookahead bias and survivorship bias. Great developers take steps to reduce data mining bias (which as you mention is a combination of selection bias, overfitting bias), and estimate slippage/market impact. All these steps simply close the gap between simulated results and what might have occurred if the system was traded in the past. Stochastic modeling (SPR as one method) is a tool to help in the second case.

However, there is no method of system development (and likewise evaluation) that can ensure positive future performance and that is because market conditions can shift in the future in ways that have not been seen in the past. There is no way around this. Markets are driven by decisions of people (or by algorithms in proxy) and their aggregate behavior can change. The best we can do is try to exploit some human bias or hard-to-change institutional behavior. We can only hope these biases and behaviors don't change enough to render trading systems failures. There are no guarantees such a change won't happen and result in a major shift in market behavior.

Now to the individual points…

#1: I discuss theses points in a podcast at BetterSystemTrader: http://bettersystemtrader.com/051-dave-walton/ . The net is that this is where the art of the process comes in.

#2 (and your last comments): My preferred method is to use SPx in a Bayesian context. The initial SPx distribution would be the prior. As the strategy is forward tested, the subsequent data is incorporated and becomes the posterior. This is an ongoing process, not a one-shot like out-of-sample. This specifically addresses your point about the first 6-month period being OK and then failing later. Note I would also test performance metrics that are absolute vs. over time such as max drawdown. In my opinion having both is superior to detect failure.

#3 It is not false discovery if market conditions change and the trading system no longer performs. SPx provides a range of outcomes from historical simulation to which forward performance can be compared.

Dave, thanks for the reply. You wrote:

"However, there is no method of system development (and likewise evaluation) that can ensure positive future performance and that is because market conditions can shift in the future in ways that have not been seen in the past. There is no way around this. "

I could not agree more with this. I mentioned SPx in my paper because I thought it represents an alternative method of evaluating historical performance of strategies and investigating the presence of bias from the practitioner side.

I have used parts of the method in my book "Fooled By Technical Analysis" to investigate the "intelligence" of various strategies, for example replacing the entry logic with a random signal in a moving average crossover system, removing the exit rule of the RSI2 system and replacing it with a fixed target and stop, performing a randomization study and adding additional filters, and also using comparable securities in the case of the z-score system to introduce and variety of market conditions.

In my opinion the proper use of any evaluation method is for providing support to the null hypothesis of zero profitability. I guess we agree that future profitability cannot be supported from historical data by either statistical significance tests, resampling methods or stochastic modeling because the probability of an error is high.

Now, about #2 and #3 and forward performance, I have the following objection: Forward testing with real money can become a costly endeavor. Obviously, there are many candidate strategies and forward testing is not economically justifiable Traders need to have a good idea of whether a strategy represents a robust alternative before risking any money.

As far as the Bayesian process, I really like that but whether it is successful depends on the nature of the tail of the distribution of returns. The problem with probabilities computed as ratios is that they may be meaningless although the ratio theoretically exists. See this article for more details:

http://www.priceactionlab.com/Blog/2016/03/traders-abuse-probability/

Therefore, these limitation seriously limit the effectiveness of all evaluation methods, not only of SPx, and my main point is that evaluation can only be used to support the null, i.e., not to decide which strategy is good to use but which strategy to discard. As you also say, this is an art and science. Stochastic modeling is a broad concept and application will vary depending on the strategy configuration and objectives. But as I say in the article, it is a much better method than classical statistical significance of limited scope although its application demands high level of quantitative skills and experience.

Best.