Sharpe Ratio

Mar 17, 2019

Symmetric Confidence Intervals, and Choosing Sides

Consider the problem of computing confidence intervals on the Signal-Noise ratio, which is the population quantity $\zeta = \mu/\sigma$, based on the observed Sharpe ratio $\hat{\zeta} = \hat{\mu}/\hat{\sigma}$. If returns are Gaussian, one can compute 'exact' confidence intervals by inverting the CDF of the non-central $t$ distribution with respect to its parameter. Typically instead one often uses an approximate standard error, using either the formula published by Johnson & Welch (and much later by Andrew Lo), or one using higher order moments given by Mertens, then constructs Wald-test confidence intervals.

Using standard errors yields symmetric intervals of the form $$ \hat{\zeta} \pm z_{\alpha/2} s, $$ where $s$ is the approximate standard error, and $z_{\alpha/2}$ is the normal $\alpha/2$ quantile. As typically constructed, the 'exact' confidence intervals based on the non-central $t$ distributionare not symmetric in general, but are very close, and can be made symmetric. The symmetry condition can be expressed as $$ \mathcal{P}\left(|\zeta - \hat{\zeta}| \ge c\right) = \alpha, $$ where $c$ is some constant.

Picking sides

Usually I think of the Sharpe ratio as a tool to answer the question: Should I invest a predetermined amount of capital (long) in this asset? The Sharpe ratio can be used to construct confidence intervals on the Signal-Noise ratio to help answer that question.

Pretend instead that you are more opportunistic: instead of considering a predetermined side to the trade, you will observe historical returns of the asset. Then if the Sharpe ratio is positive, you will consider investing in the asset, and if the Sharpe is negative, you will consider shorting the asset. Can we rely on our standard confidence intervals now? After all, we are now trying to perform inference on $\operatorname{sign}\left(\hat{\zeta}\right) \zeta$, which is not a population quantity. Rather it mixes up the population Signal-Noise ratio with information from the observed sample (the sign of the Sharpe). (Because of this mixing of a population quantity with information from the sample, real statisticians get a bit indignant when you try to call this a "confidence interval". So don't do that.)

It turns out that you can easily adapt the symmetric confidence intervals to this problem. Because you can multiply the inside of $\left|\zeta - \hat{\zeta}\right|$ by $\pm 1$ without affecting the absolute value, we have $$ \left|\zeta - \hat{\zeta}\right| \ge c \Leftrightarrow \left| \operatorname{sign}\left(\hat{\zeta}\right) \zeta - \left|\hat{\zeta}\right|\right| \ge c. $$ Thus $$ \left|\hat{\zeta}\right| \pm z_{\alpha/2} s $$ are $1-\alpha$ confidence intervals on $\operatorname{sign}\left(\hat{\zeta}\right) \zeta$.

Although the type I error rate is maintained, the 'violations' of the confidence interval can be asymmetric. When the Signal Noise ratio is large (in absolute value), type I errors tend to occur on both sides of the confidence interval equally, because the Sharpe is usually the same sign as the Signal-Noise ratio. When the Signal-Noise ratio is near zero, however, typically the type I errors occur only on the lower side. (This must be the case when the Signal-Noise ratio is exactly zero.) Of course, since the Signal-Noise ratio is the unknown population parameter, you do not know which situation you are in, although you have some hints from the observed Sharpe ratio.

Before moving on, here we test the symmetric confidence intervals. We vary the Signal Noise ratio from 0 to 2.5 in 'annual units', draw two years of daily normal returns with that Signal-Noise ratio, pick a side of the trade based on the sign of the Sharpe ratio, then build symmetric confidence intervals using the standard error estimator $\sqrt{(1 + \hat{\zeta}^2/2)/n}$. We build the 95% confidence intervals, then note any breaches of the upper and lower confidence bounds. We repeat this 10000 times for each choice of SNR.

We then plot the type I rate for the lower bound of the CI, the upper bound and the total type I rate, versus the Signal Noise ratio. We see that the total empirical type I rate is very near the nominal rate of 5%, and this is entirely attributable to violations of the lower bound up until a Signal Noise ratio of around 1.4 per square root year. At around 2.5 per square root year, the type I errors are observed in equal proportion on both sides of the CI.

suppressMessages({
  library(dplyr)
  library(tidyr)
  # https://cran.r-project.org/web/packages/doFuture/vignettes/doFuture.html
  library(doFuture)
  registerDoFuture()
  plan(multicore)
})
Error in library(doFuture): there is no package called 'doFuture'
# run one simulation of normal returns and CI violations
onesim <- function(n,pzeta,zalpha=qnorm(0.025)) {
    x <- rnorm(n,mean=pzeta,sd=1)
    sr <- mean(x) / sd(x)
    se <- sqrt((1+0.5*sr^2)/n)
    cis <- abs(sr) + se * abs(zalpha) * c(-1,1)
    pquant <- sign(sr) * pzeta
    violations <- c(pquant < cis[1],pquant > cis[2])
}
# do a bunch of sims, then sum the violations of low and high;
repsim <- function(nrep,n,pzeta,zalpha) {
  jumble <- replicate(nrep,onesim(n=n,pzeta=pzeta,zalpha=zalpha))
  retv <- t(jumble)
  colnames(retv) <- c('nlo','nhi')
  retv <- as.data.frame(retv) %>%
        summarize_all(.funs=sum)
    retv$nrep <- nrep
    invisible(retv)
}
manysim <- function(nrep,n,pzeta,zalpha,nnodes=7) {
  if (nrep > 2*nnodes) {
    # do in parallel.
    nper <- table(1 + ((0:(nrep-1) %% nnodes))) 
    retv <- foreach(i=1:nnodes,.export = c('n','pzeta','zalpha','onesim','repsim')) %dopar% {
      repsim(nrep=nper[i],n=n,pzeta=pzeta,zalpha=zalpha)
    } %>%
      bind_rows() %>%
            summarize_all(.funs=sum) 
  } else {
    retv <- repsim(nrep=nrep,n=n,pzeta=pzeta,zalpha=zalpha)
  }
    # turn sums into means
    retv %>%
        mutate(vlo=nlo/nrep,vhi=nhi/nrep) %>%
        dplyr::select(vlo,vhi)
}

# run a bunch
ope <- 252
nyr <- 2
alpha <- 0.05

# simulation params
params <- data_frame(zetayr=seq(0,2.5,by=0.0625)) %>%
    mutate(pzeta=zetayr/sqrt(ope)) %>%
    mutate(n=round(ope*nyr))

# run a bunch
nrep <- 100000
set.seed(4321)
system.time({
results <- params %>%
  group_by(zetayr,pzeta,n) %>%
    summarize(sims=list(manysim(nrep=nrep,nnodes=7,
                                pzeta=pzeta,n=n,zalpha=qnorm(alpha/2)))) %>%
  ungroup() %>%
  tidyr::unnest() 
})
Error in `summarize()`:
ℹ In argument: `sims = list(manysim(nrep = nrep, nnodes = 7, pzeta = pzeta, n = n, zalpha = qnorm(alpha/2)))`.
ℹ In group 1: `zetayr = 0`, `pzeta = 0`, `n = 504`.
Caused by error in `foreach(i = 1:nnodes, .export = c("n", "pzeta", "zalpha", "onesim", "repsim")) %dopar%
    {
      repsim(nrep = nper[i], n = n, pzeta = pzeta, zalpha = zalpha)
    }`:
! could not find function "%dopar%"
suppressMessages({
  library(dplyr)
  library(tidyr)
  library(ggplot2)
})
ph <- results %>%
    mutate(vtot=vlo+vhi) %>%
    gather(key=series,value=violations,vlo,vhi,vtot) %>%
    mutate(series=case_when(.$series=='vlo' ~ 'below lower CI',
                                                    .$series=='vhi' ~ 'above upper CI',
                                                    .$series=='vtot' ~ 'outside CI',
                                                    TRUE ~ 'error')) %>%
    ggplot(aes(zetayr, violations, colour=series)) + 
    geom_line() + geom_point(alpha=0.5) + 
    geom_hline(yintercept=c(alpha/2,alpha),linetype=2,alpha=0.5) +
    labs(x='SNR (per square root year)',y='type I rate',
             color='error type',title='rates of type I error when trade side is sign of Sharpe')
Error in `mutate()`:
ℹ In argument: `vtot = vlo + vhi`.
Caused by error:
! object 'vlo' not found
print(ph)
plot of chunk error_plot

plot of chunk error_plot

A Bayesian Donut?

Of course, this strategy seems a bit unrealistic: what's the point of constructing confidence intervals if you are going to trade the asset no matter what the evidence? Instead, consider a fund manager whose trading strategies are all above average: she/he observes the Sharpe ratio of a backtest, then only trades a strategy if $|\hat{\zeta}| \ge c$ for some sufficiently large $c$, and picks a side based on $\operatorname{sign}\left(\hat{\zeta}\right)$. This is a 'donut'.

Conditional on observing $|\hat{\zeta}| \ge c$, can one construct a reliable confidence interval on $\operatorname{sign}\left(\hat{\zeta}\right) \zeta$? Perhaps our fund manager thinks there is no point in doing so if $c$ is sufficiently large. I think to do so you have to make some assumptions about the distribution of $\zeta$ and rely on Baye's law. We did not say what would happen if the junior quant at this shop developed a strategy where $|\hat{\zeta}| < c$, but presumably the junior quants were told to keep working until they beat the magic threshold. If the junior quants only produce strategies with small $\zeta$, one suspects that the $c$ threshold does very little to reject bad strategies, rather it just slows down their deployment. (In response the quants will surely beef up their backtesting infrastructure, or invent automatic strategy generation.)

Generalizing to higher dimensions

The real interesting question is what this looks like in higher dimensions. Now one observes $p$ assets, and is to construct a portfolio on those assets. Can we construct good confidence intervals on the Sharpe ratio of the chosen portfolio? In this setting we have many more possible choices, so a general purpose analysis seems unlikely. However, if we restrict ourselves to the Markowitz portfolio, I suspect some progress can be made. (Although I have been very slow to make it!) I hope to purse this in a followup blog post.

Click to read and post comments

Mar 03, 2019

A Sharper Sharpe: Its Biased.

Note: This blog post previously analyzed Damien Challet's 'Sharper estimator' of the Signal-Noise Ratio. In a series of blog posts, I had found that the implementation of this estimator appeared to be biased, giving perhaps illusory gains in efficiency. I had noted in this post that I contacted Challet to present my concerns about his code. Since the time I wrote that post, Challet updated his sharpeRratio, package to version 1.2. We will perform an investigation of the fixed drawdown estimator, and link to it here.

Click to read and post comments

Jul 14, 2018

Distribution of Maximal Sharpe, the Markowitz Approximation

In a previous blog post we looked at a statistical test for overfitting of trading strategies proposed by Lopez de Prado, which essentially uses a $t$-test threshold on the maximal Sharpe of backtested returns based on assumed independence of the returns. (Actually it is not clear if Lopez de Prado suggests a $t$-test or relies on approximate normality of the $t$, but they are close enough.) In that blog post, we found that in the presence of mutual positive correlation of the strategies, the test would be somewhat conservative. It is hard to say just how conservative the test would be without making some assumptions about the situations in which it would be used.

This is a trivial point, but needs to be mentioned: to create a useful test of strategy overfitting, one should consider how strategies are developed and overfit. There are numerous ways that trading strategies are, or could be developed. I will enumerate some here, roughly in order of decreasing methodological purity:

  1. Alice the Quant goes into the desert on a Vision Quest. She emerges three days later with a fully formed trading idea, and backtests it a single time to satisfy the investment committee. The strategy is traded unconditional on the results of that backtest.

  2. Bob the Quant develops a black box that generates, on demand, a quantitative trading strategy, and performs a backtest on that strategy to produce an unbiased estimate of the historical performance of the strategy. All strategies are produced de novo, without any relation to any other strategy ever developed, and all have independent returns. The black box can be queried ad infinitum. (This is essentially Lopez de Prado's assumed mode of development.)

  3. The same as above, but the strategies possibly have correlated returns, or were possibly seeded by published anomalies or trading ideas.

  4. Carole the Quant produces a single new trading idea, in a white box, that is parametrized by a number of free parameters. The strategy is backtested on many settings of those parameters, which are chosen by some kind of design, and the settings which produce the maximal Sharpe are selected.

  5. The same as above, except the parameters are optimized based on backtested Sharpe using some kind of hill-climbing heuristic or an optimizer.

  6. The same as above, except the trading strategy was generally known and possibly overfit by other parties prior to publication as "an anomaly".

  7. Doug the Quant develops a gray box trading idea, adding and removing parameters while backtesting the strategy and debugging the code, mixing machine and human heuristics, and leaving no record of the entire process.

  8. A small group of Quants separately develop a bunch of trading strategies, using common data and tools, but otherwise independently hillclimb the in-sample Sharpe, adding and removing parameters, each backtesting countless unknown numbers of times, all in competition to have money allocated to their strategies.

  9. The same, except the fund needs to have a 'good quarter', otherwise investors will pull their money, and they really mean it this time.

The first development mode is intentionally ludicrous. (In fact, these modes are also roughly ordered by increasing realism.) It is the only development model that might result in underfitting. The division between the second and third modes is loosely quantifiable by the mutual correlation among strategies, as considered in the previous blog post. But it is not at all clear how to approach the remaining development modes with the maximal Sharpe statistic. Perhaps a "number of pseudo-independent backtests" could be estimated and then used with the proposed test, but one cannot say how this would work with in-sample optimization, or the diversification benefit of looking in multidimensional parameter space.

The Markowitz Approximation

Perhaps the maximal Sharpe test can be salvaged, but I come to bury Caesar, not to resuscitate him. Some years ago, I developed a test for overfitting based on an approximate portfolio problem. I am ashamed to say, however, that while writing this blog post I have discovered that this approximation is not as accurate as I had remembered! It is interesting enough to present, I think, warts and all.

Suppose you could observe the time series of backtested returns from all the backtests considered. By 'all', I want to be very inclusive if the parameters were somehow optimized by some closed form equation, say. Let $Y$ be the $n \times k$ matrix of returns, with each row a date, and each column one of the backtests. We suppose we have selected the strategy which maximizes Sharpe, which corresponds to picking the column of $Y$ with the largest Sharpe.

Now perform some kind of dimensionality reduction on the matrix $Y$ to arrive at $$ Y \approx X W, $$ where $X$ is an $n \times l$ matrix, and $W$ is an $l \times k$ matrix, and where $l \ll k$. The columns of $X$ approximately span the columns of $Y$. Picking the strategy with maximal Sharpe now approximately corresponds to picking a column of $W$ that has the highest Sharpe when multiplied by $X$. That is, our original overfitting approximately corresponded to the optimization problem $$ \max_{w \in W} \operatorname{Sharpe}\left(X w\right). $$

The unconstrained version of this optimization problem is solved by the Markowitz portfolio. Moreover, if the returns $X$ are multivariate normal with independent rows, then the distribution of the (squared) Sharpe of the Markowitz portfolio is known, both under the null hypothesis (columns of $X$ are all zero mean), and the alternative (the maximal achievable population Sharpe is non-zero), via Hotelling's $T^2$ statistic.

If $\hat{\zeta}$ is the (in-sample) Sharpe of the (in-sample) Markowitz portfolio on $X$, assumed i.i.d. Normal, then $$ \frac{(n-l) \hat{\zeta}^2}{l (n - 1)} $$ follows an F distribution with $l$ and $n-l$ degrees of freedom. I wrote the psropt and qsropt functions in SharpeR to compute the CDF and quantile of the maximal in-sample Sharpe to support this kind of analysis.

I should note there are a few problems with this approximation:

  1. There is no strong theoretical basis for this approximation: we do not have a model for how correlated returns should arise for a particular population, nor what the dimension $l$ should be, nor what to expect under the alternative, when the true optimal strategy has positive Sharpe. (I suspect that posing overfit of backtests as a Gaussian Process might be fruitful.)

  2. We have to estimate the dimensionality, $l$, which is about as odious as estimating the number of 'pseudo-observations' in the maximal Sharpe test. I had originally suspected that $l$ would be 'obvious' from the application, but this is not apparently so.

  3. Although the returns may live nearly in an $l$ dimensional subspace, we might have have selected a suboptimal combination of them in our overfitting process. This would be of no consequence if the $l$ were accurately estimated, but it will stymie our testing of the approximation.

Despite these problems, let us press on.

An example: a two window Moving Average Crossover

While writing this blog post, I went looking for examples of 'classical' technical strategies which would be ripe for overfitting (and which I could easily simulate under the null hypothesis). I was surprised to find that freely available material on Technical Analysis was even worse than I could imagine. Nowhere among the annotated plots with silly drawings could I find a concrete description of a trading strategy, possibly with free parameters to be fit to the data. Rather than wade through that swamp any longer, I went with an old classic, the Moving Average Crossover.

The idea is simple: compute two moving averages of the price series with different windows. When one is greater than the other, hold the asset long, otherwise hold it short. The choice of two windows must be overfit by the quant. Here I perform that experiment, but under the null hypothesis, with zero mean simulated returns generated independently of each other. Any realization of this strategy, with any choice of the windows, will have zero mean returns and thus zero Sharpe.

First I collect 'backtests' (sans any trading costs) of two window MAC for a single realization of returns where the two windows were allowed to vary from 2 to around 1000. The backtest period is 5 years of daily data. I compute the singular value decomposition of the returns, then present a scree plot of the singular values.

suppressMessages({
  library(dplyr)
  library(fromo)
    library(svdvis)
})
# return time series of *all* backtests
backtests <- function(windows,rel_rets) {
    nwin <- length(windows)
    nc <- choose(nwin,2) 

    fwd_rets <- dplyr::lead(rel_rets,1)
    # log returns
    log_rets <- log(1 + rel_rets)
    # price series
    psers <- exp(cumsum(log_rets))
    avgs <- lapply(windows,fromo::running_mean,v=psers)

    X <- matrix(0,nrow=length(rel_rets),ncol=2*nc)

    idx <- 1
    for (iii in 1:(nwin-1)) {
        for (jjj in (iii+1):nwin) {
            position <- sign(avgs[[iii]] - avgs[[jjj]])
            myrets <- position * fwd_rets
            X[,idx] <- myrets
            X[,idx+1] <- -myrets
            idx <- idx + 1
        }
    }
    # trim the last row, which has the last NA
    X <- X[-nrow(X),]
    X
}
geomseq <- function(from=1,to=1,by=(to/from)^(1/(length.out-1)),length.out=NULL) {
    if (missing(length.out)) { 
        lseq <- seq(log(from),log(to),by=log(by)) 
    } else {
        lseq <- seq(log(from),log(to),by=log(by),length.out=length.out)
    }
    exp(lseq)
}

# which windows to test
windows <- unique(ceiling(geomseq(2,1000,by=1.15))) 
nobs <- ceiling(3 * 252)
maxwin <- max(windows)
rel_rets <- rnorm(maxwin + 10 + nobs,mean=0,sd=0.01)

XX <- backtests(windows,rel_rets)
# grab the last nobs rows
XX <- XX[(nrow(XX)-nobs+1):(nrow(XX)),]

# perform svd
blah <- svd(x=XX,nu=11,nv=11)
# look at it
ph <- svdvis::svd.scree(blah) +
    labs(x='Singular Vectors',y='Percent Variance Explained') 
print(ph)
plot of chunk mac_one_sim_scree

plot of chunk mac_one_sim_scree

I think we can agree that nobody knows how to interpret a scree plot. However, in this case a large proportion of the explained variance seems to encoded in the first two eigenvalues, which is consistent with my a priori guess that $l=2$ in this case because of the two free parameters.

Next I simulate overfitting, performing that same experiment, but picking the largest in-sample Sharpe ratio. I create a series of independent zero mean returns, then backtest a bunch of MAC strategies, and save the maximal Sharpe over a 3 year window of daily data. I repeat this experiment ten thousand times, and then look at the distribution of that maximal Sharpe.

suppressMessages({
  library(dplyr)
  library(tidyr)
  library(tibble)
  library(SharpeR)
  library(future.apply)
  library(ggplot2)
})
ope <- 252

geomseq <- function(from=1,to=1,by=(to/from)^(1/(length.out-1)),length.out=NULL) {
    if (missing(length.out)) { 
        lseq <- seq(log(from),log(to),by=log(by)) 
    } else {
        lseq <- seq(log(from),log(to),by=log(by),length.out=length.out)
    }
    exp(lseq)
}
# one simulation. returns maximal Sharpe
onesim <- function(windows,n=1000) {
    maxwin <- max(windows)
    rel_rets <- rnorm(maxwin + 10 + n,mean=0,sd=0.01)
    fwd_rets <- dplyr::lead(rel_rets,1)
    # log returns
    log_rets <- log(1 + rel_rets)
    # price series
    psers <- exp(cumsum(log_rets))
    avgs <- lapply(windows,fromo::running_mean,v=psers)

    nwin <- length(windows)
    maxsr <- 0

    for (iii in 1:(nwin-1)) {
        for (jjj in (iii+1):nwin) {
            position <- sign(avgs[[iii]] - avgs[[jjj]])
            myrets <- position * fwd_rets
            # compute Sharpe on some part of this
            compon <- myrets[(length(myrets)-n):(length(myrets)-1)]
            thissr <- SharpeR::as.sr(compon,ope=ope)$sr
            # we are implicitly testing both combinations of long and short here,
            # so we take the absolute Sharpe, since we will always overfit to
            # the better of the two:
            maxsr <- max(maxsr,abs(thissr))
        }
    }
    maxsr
}

windows <- unique(ceiling(geomseq(2,1000,by=1.15))) 
nobs <- ceiling(3 * 252)
nrep <- 10000
plan(multicore)
set.seed(1234)
system.time({
    simvals <- future_replicate(nrep,onesim(windows,n=nobs))
})
    user   system  elapsed 
1214.175    3.439  228.226 

Here I plot the empirical quantiles of the maximal (annualized) Sharpe versus theoretical quantiles under the Markowitz approximation, assuming $l=2$. I also plot the $y=x$ lines, and horizontal and vertical lines at the nominal upper $0.05$ cutoff based on the Markowitz approximation.

# plot max value vs quantile
library(ggplot2)
apxdf <- 2.0
ph <- data.frame(simvals=simvals) %>%
    ggplot(aes(sample=simvals)) + 
    geom_vline(xintercept=SharpeR::qsropt(0.95,df1=apxdf,df2=nobs,zeta.s=0,ope=ope),linetype=3) +
    geom_hline(yintercept=SharpeR::qsropt(0.95,df1=apxdf,df2=nobs,zeta.s=0,ope=ope),linetype=3) +
    stat_qq(distribution=SharpeR::qsropt,dparams=list(df1=apxdf,df2=nobs,zeta.s=0,ope=ope)) +
    geom_abline(intercept=0,slope=1,linetype=2) + 
    labs(title='empirical quantiles of maximal Sharpe versus Markowitz Approximation',
             x='theoretical quantile',y='empirical quantile (Sharpe in annual units)')
print(ph)
plot of chunk mac_sims_qq

plot of chunk mac_sims_qq

This approximation is clearly no good. The empirical rate of type I errors at the $0.05$ level is around 60%, and the Q-Q line is just off. I must admit that when I previously looked at this approximation (and in the vignette for SharpeR!) I used the qqline function in base R, which fits a line based on the first and third quartile of the empirical fit. That corresponds to an affine shift of the line we see here, and nothing seems amiss.

So perhaps the Markowitz approximation can be salvaged, if I can figure out why this shift occurs. Perhaps we have only traded picking a maximal $t$ for picking a maximal $T^2$ and there still has to be a mechanism to account for that. Or perhaps in this case, despite the 'obvious' setting of $l=2$, we should have chosen $l=7$, for which the empirical rate of type I errors is around 60%, though we have no way of seeing that 7 from the scree plot or by looking at the mechanism for generating strategies. Or perhaps the problem is that we have not actually picked a maximal strategy over the subspace, and this technique can only be used to provide a possibly conservative test. In this regard, our test would be no more useful than the maximal Sharpe test described in the previous blog post.

Click to read and post comments

Jun 14, 2018

Distribution of Maximal Sharpe

I recently ran across what Marcos Lopez de Prado calls "The most important plot in Finance". As I am naturally antipathetic to such outsized, self-aggrandizing claims I was resistant to drawing attention to it. However, what it purports to correct is a serious problem in quantitative trading, namely backtest overfit (variously known elsewhere as data-dredging, p-hacking, etc.). Suppose you had some process that would, on demand, generate a trading strategy, backtest it, and present you (somehow) with an unbiased estimate of the historical performance of that strategy. This process might be random generation a la genetic programming, some other automated process, or a small army of very eager quants ("grad student descent"). If you had access to such a process, surely you would query it hundreds or even thousands of times, much like a slot machine, to get the best strategy.

Before throwing money at the best strategy, first you have to identify it (probably via the Sharpe ratio on the backtested returns, or some other heuristic objective), then you should probably assess whether it is any good, or simply the result of "dumb luck". More formally, you might perform a hypothesis test under the null hypothesis that all the generated strategies have non-positive expected returns, or you might try to construct a confidence interval on the Signal-Noise ratio of the strategy with best in-sample Sharpe.

The "Most Important Plot" in all finance is, apparently, a representation of the distribution of the maximal in-sample Sharpe ratio of $B$ different backtests over strategies that are zero mean, and have independent returns, versus that $B$. As presented by its author it is a heatmap, though I imagine boxplots or violin plots would be easier to digest. To generate this plot, note that the smallest value of $B$ independent uniform random variates takes a Beta distribution with parameters $1$ and $B$. So to find the $q$th quantile of the maximum Sharpe, compute the $1-q$ quantile of the $\beta\left(1,B\right)$ distribution, then plug that the quantile function of the Sharpe distribution with the right degrees of freedom and zero Signal Noise parameter. (See Exercise 3.29 in my Short Sharpe Course.)

In theory this should give you a significance threshold against which you can compare your observed maximal Sharpe. If yours is higher, presumably you should trade the strategy (uhh, after you pay Marcos for using his patented technology), otherwise give up quantitative trading and become an accountant. One huge problem with this Most Important Plot method (besides its ignorance of entire fields of research on Multiple Hypothesis Testing, False Discovery Rate, Estimation After Selection, White's Reality Check, Romano-Wolf, Politis, Hansen's SPA, inter alia) is the assumption of independence. "Assuming quantities are independent which are not independent" is the downfall of many a statistical procedure applied in the wild (much more so than non-normality), and here is no different. And we have plenty of reason to believe returns from any real strategy generation process would fail independence:

  1. Strategies are typically generated on a limited universe of assets, and using a limited set of predictive 'features', and are tested on a single, often relatively short, history.
  2. Most strategy generation processes (synthetic and human) have a very limited imagination.
  3. Most strategy generation processes (synthetic and human) tend to work incrementally, generating new strategies after having observed the in-sample performance of existing strategies. They "hill-climb".

My initial intuition was that dependence among strategies, especially of the hill-climbing variety, would cause this Most Important Test to have a much higher rate of type I errors than advertised. (This would be bad, since it would effectively still pass dud trading strategies while selling you a false sense of security.) However, it seems that the more innocuous correlation of random generation on a limited set of strategies and assets causes this test to be conservative. (It is similar to Bonferroni's correction in this regard.)

To establish this conservatism, you can use Slepian's Lemma. This lemma is a kind of stochastic dominance result for multivariate normals. It says that if $X$ and $Y$ are $B$-variate normal random variables, where each element is zero mean and unit variance, and if the covariance of any pair of elements of $X$ is no less than the covariance of the corresponding pair of elements of $Y$, then $Y$ stochastically dominates $X$, in the multivariate sense. This is actually a stronger result than what we need, which is stochastic dominance of the maximal element of $Y$ over the maximal element of $X$, which it implies.

Here I simply illustrate this dominance empirically. I create a $B$-variate normal with zero mean and unit variance of marginals, for $B=1000$. The elements are all correlated to each other with correlation $\rho$. I compute the maximum over the $B$ elements. I perform this simulation 100 thousand times, then compute the $q$th empirical quantile over the $10^5$ maximum values. I vary $\rho$ from 0 to 0.75. Here is the code:

suppressMessages({
  library(dplyr)
  library(tidyr)
  library(tibble)
  library(doFuture)
  library(broom)
  library(ggplot2)
})
Error in library(doFuture): there is no package called 'doFuture'
# one (bunch of) simulation.
onesim <- function(B,rho=0,propcor=1.0,nsims=100L) {
    propcor <- min(1.0,max(propcor,1-propcor))
    rho <- abs(rho)
    # (anti)correlated part; each row is a simulation
    X0 <- outer(array(rnorm(nsims)),array(2*rbinom(B,size=1,prob=propcor)-1))
    # idiosyncratic part
    XF <- matrix(rnorm(B*nsims),nrow=nsims,ncol=B)
    XX <- sqrt(rho) * X0 + sqrt(1-rho) * XF
    data_frame(maxval=apply(XX,1,FUN="max")) 
}
# many sims.
repsim <- function(B,rho=0,propcor=1.0,nsims=1000L) {
    maxper <- 100L
    nreps <- ceiling(nsims / maxper)
  jumble <- replicate(nreps,onesim(B=B,rho=rho,propcor=propcor,nsims=maxper),simplify=FALSE) %>%
        bind_rows()
}
manysim <- function(B,rho=0,propcor=1.0,nsims=10000L,nnodes=7) {
  if ((nsims > 100*nnodes) && require(doFuture)) {
        registerDoFuture()
        plan(multicore)
    # do in parallel.
        nper <- ceiling(nsims / nnodes)
    retv <- foreach(i=1:nnodes,.export = c('B','rho','propcor','nper','onesim','repsim')) %dopar% {
      repsim(B=B,rho=rho,propcor=propcor,nsims=nper) 
    } %>%
      bind_rows()
        retv <- retv[1:nsims,]
  } else {
        retv <- repsim(B=B,rho=rho,propcor=propcor,nsims=nsims)
  }
  retv
}

params <- tidyr::crossing(data_frame(rho=c(0,0.25,0.5,0.75)),
                                                    data_frame(propcor=c(0.5,1.0))) %>%
    filter(rho > 0 | propcor == 1)

# run a bunch; 
nrep <- 1e5
set.seed(1234)
system.time({
    results <- params %>%
        group_by(rho,propcor) %>%
            summarize(sims=list(manysim(B=1000,rho=rho,propcor=propcor,nsims=nrep))) %>%
        ungroup() %>%
        tidyr::unnest() 
})
   user  system elapsed 
 73.487   0.231  73.398 
# aggregate the results
do_aggregate <- function(results) {
    results %>%
        group_by(rho,propcor) %>%
            summarize(qs=list(broom::tidy(quantile(maxval,probs=seq(0.50,0.9975,by=0.0025))))) %>%
        ungroup() %>%
        tidyr::unnest() %>%
        rename(qtile=names,value=x) 
}
sumres <- results %>% do_aggregate()

Here I plot the empirical $q$th quantile of the maximum versus $q$, with different lines for the different values of $\rho$. I include a vertical line at the 0.95 quantile, to show where the nominal 0.05 level threshold is. The takeaway, as implied by Slepian's lemma, is that the maximum over $B$ Gaussian elements decreases stochastically as the correlation of elements increases. Thus when you assume independence of your backtests, you use the red line as your significance threshold (typically where it intersects the dashed vertical line), while your processes live on the green or blue or purple lines. Your test will be too conservative, and your true type I rate will be lower than the nominal rate.

# plot max value vs quantile
library(ggplot2)
ph <- sumres %>%
    filter(propcor==1) %>%
    mutate(rho=factor(rho)) %>%
    mutate(xv=as.numeric(gsub('%$','',qtile))) %>%
    ggplot(aes(x=xv,y=value,color=rho,group=rho)) +
    geom_line() + geom_point(alpha=0.2) + 
    geom_vline(xintercept=95,linetype=2,alpha=0.5) +    # the 0.05 significance cutoff
    labs(title='quantiles of maximum of B=1000 multivariate normal, various correlations',
             x='quantile',y='maximum value')
print(ph)
plot of chunk just_normal_plot

plot of chunk just_normal_plot

But wait, these simulations were over Gaussian vectors (and Slepian's Lemma is only applicable in the Gaussian case), while the Most Important Test is to be applied to the Sharpe ratio. They have different distributions. It turns out, however, that when the population mean is the zero vector, the vector of Sharpe ratios of returns with correlation matrix $R$ is approximately normal with variance-covariance matrix $\frac{1}{n} R$. (This is in section 4.2 of Short Sharpe Course.) Here I establish that empirically, by performing the same simulations again, this time spawning 252 days of normally distributed returns with zero mean, for $B$ possibly correlated strategies, computing their Sharpes, and taking the maximum. I only perform $10^4$ simulations because this one is quite a bit slower:

# really just one simulation.
onesim <- function(B,nday,rho=0,propcor=1.0,pzeta=0) {
    propcor <- min(1.0,max(propcor,1-propcor))
    rho <- abs(rho)
    # correlated part; each row is one day
    X0 <- outer(array(rnorm(nday)),array(2*rbinom(B,size=1,prob=propcor)-1))
    # idiosyncratic part
    XF <- matrix(rnorm(B*nday),nrow=nday,ncol=B)
    XX <- pzeta + (sqrt(rho) * X0 + sqrt(1-rho) * XF)
    sr <- colMeans(XX) / apply(XX,2,FUN="sd")
    # if you wanted to look at them cumulatively:
    #data_frame(maxval=cummax(sr),iterate=1:B)
    # otherwise just the maxval
    data_frame(maxval=max(sr),iterate=B)
}
# many sims.
repsim <- function(B,nday,rho=0,propcor=1.0,pzeta=0,nsims=1000L) {
  jumble <- replicate(nsims,onesim(B=B,nday=nday,rho=rho,propcor=propcor,pzeta=pzeta),simplify=FALSE) %>%
        bind_rows()
}
manysim <- function(B,nday,rho=0,propcor=1.0,pzeta=0,nsims=1000L,nnodes=7) {
  if ((nsims > 10*nnodes) && require(doFuture)) {
        registerDoFuture()
        plan(multiprocess)
    # do in parallel.
        nper <- as.numeric(table(1:nsims %% nnodes))
    retv <- foreach(iii=1:nnodes,.export = c('B','nday','rho','propcor','pzeta','nper','onesim','repsim')) %dopar% {
      repsim(B=B,nday=nday,rho=rho,propcor=propcor,pzeta=pzeta,nsims=nper[iii]) 
    } %>%
      bind_rows()
  } else {
        retv <- repsim(B=B,nday=nday,rho=rho,propcor=propcor,pzeta=pzeta,nsims=nsims)
  }
  retv
}

params <- tidyr::crossing(data_frame(rho=c(0,0.25,0.5,0.75)),
                                                    data_frame(propcor=c(1.0))) %>%
    filter(rho > 0 | propcor == 1)

# run a bunch; 
nday <- 252
numbt <- 1000
nrep <- 1e4
# should take around 8 minutes on 7 cores
set.seed(1234)
system.time({
    sh_results <- params %>%
        group_by(rho,propcor) %>%
            summarize(sims=list(manysim(B=numbt,nday=nday,propcor=propcor,pzeta=0,rho=rho,nsims=nrep))) %>%
        ungroup() %>%
        tidyr::unnest() 
})
    user   system  elapsed 
1354.399    4.685 1353.082 
# aggregate
sh_sumres <- sh_results %>% 
    filter(iterate==1000) %>%
    do_aggregate() %>%
    mutate(value=sqrt(252) * value)   # annualize!

Again, I plot the empirical $q$th quantile of the maximum Sharpe, in annualized units, versus $q$, with different lines for the different values of $\rho$. Because we take one year of returns and annualize the Sharpe, the test statistics should be approximately normal with approximately unit marginal variances. This plot should look eerily similar to the one above, so I overlay the Sharpe simulation results with the results of the Gaussian experiment above to show how close they are. Very little has been lost in the normal approximation to the sample Sharpe, but the maximal Sharpes are slightly elevated compared to the Gaussian case.

# plot max value vs quantile
library(ggplot2)
ph <- sh_sumres %>% 
    mutate(simulation='sharpe') %>% 
    rbind(sumres %>% mutate(simulation='gaussian')) %>%
    filter(propcor==1) %>% 
    mutate(rho=factor(rho)) %>%
    mutate(xv=as.numeric(gsub('%$','',qtile))) %>%
    ggplot(aes(x=xv,y=value,color=rho,linetype=simulation)) +
    geom_line() + geom_point(alpha=0.2) + 
    geom_vline(xintercept=95,linetype=2,alpha=0.5) +    # the 0.05 significance cutoff
    labs(title='maximum versus quantile',x='quantile',y='maximum value')
print(ph)
plot of chunk basic_both_plot

plot of chunk basic_both_plot


The simulations here show what can happen when many strategies have mutual positive correlation. One might wonder what would happen if there were many strategies with a significant negative correlation. It turns out this is not really possible. In order for the correlation matrix to be positive definite, you cannot have too many strongly negative off-diagonal elements. Perhaps there is a Pigeonhole Principle argument for this, but a simple expectation argument suffices.

Just in case, however, above I also simulated the case where you flip a coin to determine whether an element of the Gaussian has a positive or negative correlation to the common element. When the coin is completely biased to heads, you get the simulations shown above. When the coin is fair, the elements of the Gaussian are expected to be divided evenly into two highly groups. Here is the plot of the $q$th quantile of the maximum versus $q$, with different colors for $\rho$ and different lines for the probability of heads. Allowing negatively correlated elements does stochastically increase the maximum element, but never above the $\rho=0$ case. And so the Most Important Test still appears conservative.

# plot max value vs quantile
library(ggplot2)
ph <- sumres %>%
    mutate(rho=factor(rho)) %>%
    mutate(`prob. heads`=factor(propcor)) %>%
    mutate(xv=as.numeric(gsub('%$','',qtile))) %>%
    ggplot(aes(x=xv,y=value,color=rho,linetype=`prob. heads`)) +
    geom_line() + geom_point(alpha=0.2) + 
    geom_vline(xintercept=95,linetype=2,alpha=0.5) +    # the 0.05 significance cutoff
    labs(title='quantiles of maximum of B=1000 multivariate normal, various (anti) correlations',
             x='quantile',y='maximum value')
print(ph)
plot of chunk flippy_normal_plot

plot of chunk flippy_normal_plot

In a followup post I will attempt to address the conservatism and hill-climbing issues, using the Markowitz approximation.

Click to read and post comments
← Previous Next → Page 3 of 5

Copyright © 2018-2025, Steven E. Pav.  
The above references an opinion and is for information purposes only. It is not offered as investment advice.