Premium Market Analysis

Trading Strategies

Fooled by Multiple Comparisons When Developing Systems For Single Securities or Portfolios

The idea that systems developed on historical data of a portfolio of securities have better chance of being non-random as compared to systems that are developed for a single security is based on the assumption that the size of the rule set used in the data-mining process is minimal. The process of multiple comparisons that is involved when this assumption does not hold increases data-mining bias and the resulting systems have no chance of maintaining their profitability even if they pass an out-of-sample test.

In a post titled “Fooled by Randomness Through Selection Bias” I showed randomly generated equity curves, like the one above, that look quite similar to equity curves that are obtained when one applies a very large number of combinations of rules to historical data and then selects one or more of them that validate on an out-of-sample. This is related to selection bias and it is a common sense reality: if one applies many combinations of rules to the same data many times, then the probability of finding a specific combination of those rules that passes both the in-sample and out-of-sample test is high and if the number of rule combinations is very high, then for all practical purposes the probability is equal to 1.

But what happens if one tries the same set of rules and their combinations on a portfolio of securities? In this case, the rules and their combinations will be selected so that the portfolio equity performance is positive and, possibly in addition, the equity performance for each individual security is also positive. Does this process increase the probability of finding a statistically significant, non-random system?

Actually, there is no difference in the portfolio case from the single security case if the assumption of a minimal rule set does not hold. This is related to the notorious problem of multiple comparisons in statistics.  It is also know as the problem of multiple testing.

The single security case

Let us  assume that random systems can be represented by a fair coin for modelling purposes only. We toss the coin many times and we observe a high win rate for heads. Is this an unfair coin? Similarly, if we test a single system in-sample and the performance is satisfactory and then we test it out-of-sample and it  generates 100 trades out of which 60 are winning (60% win rate) with high profit factor, is this a non-random system?

In the coin toss case, one can calculate that the probability of tossing 60 or more heads (winning trades) in a total of 100 trials (trades) is equal to 0.02844. This is low probability indicating a high statistical significance for the system, i.e., the hypothesis that the system is random (the coin is fair) can be rejected.

Let us now consider a system that is developed after multiple tests resulting from combining indicators and exit strategies from a large set. This is modelled by tossing many coins. The process generates n=10 profitable equity curves in the out-of-sample, each with 100 trades and win rate > 60% for simplicity. The probability that all coins are identified as fair is (1-0.02844)^10 = 0.75. Therefore, there is high probability one or more coins will be identified as unfair (biased).

It is evident that the calculated probability decreases as a function of the number of equity curves considered. For example, if the number of equity curves increases to 100, the probability that all coins are identified as fair decreases to 0.055. Thus, the probability that some equity curves (systems) will be identified as unfair is close to 1.

The portfolio case

Now, if we repeat the same process on N securities, there is no change in principle. If the set of rules is relatively large, resulting in millions or even billions of combinations of entries and exits, some may work on all securities in the in-sample and the out-of-sample. When we get a system that performs well in all securities, in both the in-sample and the out-of-sample we are still faced with the same problem: is this system random or not?

Due to the complexity of the problem, a closed form solution is difficult to obtain. We have to go back to the basic assumption and see if it is satisfied. The most basic assumption about the bias arising from the application of many combinations of rules to the same data repeatedly until one gets one or more that satisfy some metrics or function of metrics is that it decreases with the number of rules employed. This is obvious from the single rule case. As more rules are added, bias increases because the probability of finding random combinations that are identified as significant increases. Therefore, one way to insure the integrity of the process of finding non-random systems is by reducing the number of rules. But what is a reasonable number of rules to employ?

Let us first start from the exit side. Using many exit rules increases the possibility for curve-fitting. When developing trading systems we are interested to see if their signals have some predictive capacity. This can be determined with a small profit target and stop-loss outside the daily volatility range. Anything besides that, like for example trailing stops, will only contribute to curve-fitting

The number of rules on the entry side is a much more complicated issue.  I do not have a complete answer to this question but I would argue no more than 3 non-correlated, non-interacting rules may avoid increased bias. For example, one moving average, one RSI and one DMI but not an average of an RSI, etc.

Price patterns that involve no mathematical operation other than Boolean comparisons are essentially a single rule with a parameter vector that determines which of the OHLC to consider within a tight range of bars. If mathematical operations are added and the number of bars is left unconstrained, the number of resulting rules becomes virtually infinite and curve-fitting is highly probable. This type of naïve application of price patterns has been the cause of failure for many systematic traders, but also for chartists, because a chart pattern can be thought of as a price pattern with no limit in the number of bars and additional operations besides Boolean comparisons.


Developing systems on portfolio of securities does not eliminate any of the fundamental problems of data-mining unless the employed rule set is absolutely minimal. If  one tests billions of rule combinations, at the end of the run getting one or more that work well in out-of-sample for any security or for a portfolio of any size is highly probable. The significance of discovered rules through data-mining processes of any kind depends on the quantity and quality of the rule set employed. For example, one should not expect to find anything significant when combining lagging indicators because these indicators are useless for market timing with no predictive capacity and cannot gain such capacity when combined with other indicators with no predictive capacity. Therefore, it is pointless to even include them in the process. When I started in 1993 developing a program that could discover trading systems automatically, I used everything that one could find in a technical analysis book. I slowly started discarding lagging indicators and related rules and after 3 years I was left only with price patterns with small profit target and stop-loss. This happened because I was actually trading the systems I was developing and I was able to see what works and what does not. Some indicators may work for longer-term trading with probability of a large drawdown but this was a style of trading I was never interested in for the simple reason that it places a heavy discipline burden on a trader and requires large capital. On the other hand, short-term and intraday trading using lagging indicators are doomed because markets are faster than those indicators. The market and our experience of it tells us in what ways to restrict the rule set when data-mining for profitable systems.