Data-mining is widely used nowadays for trading algo development. There are several myths about how to deal with data-mining bias. This blog exposes five such myths.

**What is data-mining bias?**

At the highest level, data-mining bias results from testing multiple hypotheses on historical data. As the number of tested hypotheses increases, so are the chances of accepting a random rule as genuine. Data-mining bias has a huge adverse effect on the quality of the data-mining process. The myths this article exposes are the result of efforts to deal with a qualitative feature of a process from a quantitative perspective.

**Myth No. 1: Data-mining bias can be measured**

Data-mining bias cannot be effectively measured because it refers to the quality of a process and not to some specific parameter. Any definitions of data-mining bias that attribute to it a specific nature, or measure, are the result of unjustified attempts to quantify a non-quantifiable notion. For example, some quants attempt to measure the data-mining bias by generating random data and then applying data-mining on them. They hope that by repeating this process many times and by getting a distribution of some metric of the random results they will be able to correct the original result from the actual data and compensate for the data-mining bias. However, such methods essentially rank the performance of an algo with respect to the performance of a large number of algos mined on random data and nothing more than that. There is no justification for the claim that an algo that ranks high will perform better in actual trading. For example, if the original data-mined algo was well-fitted on historical data, then it still has low chances of performing well if market condition change regardless of whether or not it ranked high with respect to algos obtained from random data.

**Myth No. 2: Out-of-sample testing can deal with data-mining bias**

Out-of-sample tests are part of an overall algo development process that is plagued by data-mining bias. Therefore, they cannot be used to deal with it. If many hypotheses are tested on historical data, some of them will pass out-of-sample tests by chance alone. Therefore, out-of-sample testing does not suffice to guarantee that an algo is genuine and not random. In addition, it is false to call out-of-sample testing a “cross-validation” method. This type of testing is only a validation method that cannot deal with the data-mining bias in the presence of multiple hypothesis testing. Note that “cross-validation” methods are in general hard or impossible to use in trading algo machine learning due to the nature of the process and data used.

**Myth No. 3: Forward testing can deal with data-mining bias**

When machine learning software vendors are confronted by customers who have used out-of-sample testing without success, they recommend in addition forward testing. However, forward testing is just another form of out-of-sample testing and essentially its continuation. If one tests many hypotheses, then the probability of finding a random one that passes all tests in both the out-of-sample and forward sample is high. As the number of hypotheses becomes large, this probability tends to 1, i.e., finding a random algo that passes all tests is the certain event.

**Myth No. 4: Monte Carlo Analysis can deal with data-mining bias**

Some algo developers use Monte Carlo analysis to determine the effect of algo parameter variations on performance. If the algo is robust enough when subjected to such variations, then it is perceived as genuine. However, Monte Carlo analysis becomes part of the data-mining process as soon as it is applied. Even if Monte Carlo analysis is performed on unseen data, when a large number of hypotheses are tested, then the probability of finding one that passes all out-of-sample, forward sample and Monte Carlo analysis tests is high. Actually, the probability gets close to 1 with time and as same data are reused with same rules. Any machine learning process when used repeatedly and extensively will generate algos that are random but pass all the tests, even when Monte Carlo analysis is used. In addition, note that this type of analysis has many flaws and it is inapplicable to trading algo analysis except in the case of a small set of them.

**Myth No. 5: If you do not use data-mining, then there is no data-mining bias**

Leda Braga, who is a famous trading algo developer, repeated this myth in a recent interview. However, the only difference between thinking of a hypothesis and having a computer to generate it is the process speed. Data-mining bias is always present as long as the hypothesis is tested on historical data. Unless one can think of a **truly unique** hypothesis that no one else has, then chances are it is already data-mined by a computer. Researchers started applying machine learning to the markets in the mid 1980s and since then they have mined data relentlessly. Only those who suddenly discovered machine learning in the last 5-10 years are impressed by it. Most abandoned it in the mid 2000s. Machine learning has little application in developing algos for trading the markets. **One of the most naive things one could do nowadays is to try to combine some indicators with exit rules using some machine learning algorithm** hoping to find gold. Although some results may work for a period of time, inherently they are random because when one adjusts for the multiple comparisons, the significance of the results is diminished.

**Methods for dealing with data-mining bias**

Nowadays, methods for dealing with data-mining bias define the nature of the process that can be used to discover trading algos. Therefore, those methods constitute the edge and the data-mining process. For example, the founder of a well-known fund has supported a specific method for measuring data-mining bias and because of that many quants are trying to replicate it. However, if one looks at the performance of that fund it has been flat or negative in the last several years. Thus, it is evident that the fund has no edge.

The conclusion is that machine learning without an underline rationale and philosophy for dealing with data-mining bias that is not based on methods that become part of the loop, as discussed above, is indistinguishable from a random process and gambling.

You can subscribe here to notifications of new posts by email.

Detailed technical and quantitative analysis of Dow-30 stocks and popular ETFs can be found in our Weekly Premium Report.

© 2015 Michael Harris. All Rights Reserved. We grant a revocable permission to create a hyperlink to this blog subject to certain terms and conditions. Any unauthorized copy, reproduction, distribution, publication, display, modification, or transmission of any part of this blog is strictly prohibited without prior written permission.

Awesome post. I've had numerous arguments with peers along similar lines, especially regarding walk forward testing.

Thanks David. Have a good weekend!

Hi,

You said about myth 5

"One of the most naive things one could do nowadays is to try to combine some indicators with exit rules using some machine learning algorithm hoping to find gold. Although some results may work for a period of time, inherently they are random because when one adjusts for the multiple comparisons, the significance of the results is diminished."

So the question is: How you can prove it and distinguish that the system stopped to work because of data-mining bias and not e.g. decreased signal/noise ratio of market signal ?

Krzysztof

Hello Krzysztof,

Remember what I wrote: data-mining bias refers to the quality of the process. The adverse effect of data-mining bias is that one may select a random model with high probability that will latter fail due to any reasons, including the low signal/noise ratio you mentioned. Therefore, what I'm trying to clarify here is that data-mining bias is not some specific cause but an inherent flaw of a process of developing trading systems.

Great post, thanks! So, how can someone avoid data mining bias?

Disregard my last comment, I didn't see the last part of the article.

Hello Nikita,

Data-mining bias cannot be avoided. It is there always as an integral part of the process and continuously increases. One could try to minimize it. For example, minimizing the number of different rules tried has a large positive impact. Also, trying only rules of similar "nature", for example moving averages. Price Action Lab, my software, also suffers from data-mining bias but the effect is orders of magnitude smaller than in cases where many different indicators are available or price patterns are identified by algos with more degrees of freedom, such as some genetic programming and NN tools. See the first link in the article text for more details:

http://www.priceactionlab.com/Blog/2012/06/fooled-by-randomness-through-selection-bias/

Hi,

I too agree that DMB cannot be measured, even approximately.

I also looked briefly at Price Action Lab and I'm impressed with it. Lacks some bells and whistles of other programs but it’s based on a sound approach.

Thanks for the interesting articles!