A post-hoc test for the Sharpe ratio
Suppose you observe the historical returns of \(p\) different fund managers, and wish to test whether any of them have superior Signal-Noise ratio (SNR) compared to the others. The first test you might perform is the test of pairwise equality of all SNRs. This test relies on the multivariate delta method and central limit theorem, resulting in a chi-square test, as described by Wright et al and outlined in section 4.3 of our Short Sharpe Course. This test is analogous to ANOVA, where one tests different populations for unequal means, assuming equal variance. (The equal Sharpe test, however, deals naturally with the case of paired observations, which is commonly the case in testing asset returns.)
In the analogous procedure, if one rejects the null of equal means in an ANOVA, one can perform pairwise tests for equality. This is called a post hoc test, since it is performed conditional on a rejection in the ANOVA. The basic post hoc test is Tukey's range test, sometimes called 'Honest Significant Differences'. It is natural to ask whether we can extend the same procedure to testing the SNR. Here we will propose such a procedure for a crude model of correlated returns.
The Tukey test has increased power by pooling all populations together to estimate the overall variance. The test statistic then becomes something like
where \(Y_{(1)}\) is the smallest mean observed, and \(Y_{(p)}\) is the largest, and \(S^2\) is the pooled estimate of variance. The difference between the maximal and minimal \(Y\) is why this is called the 'range' test, since this is the range of the observed means.
Switching back to our problem, we should not have to assume that our tested returns series have the same volatility. Moreover, the standard error of the Sharpe ratio is only weakly dependent on the unknown population parameters, so we will not pool variances. In our paper on testing the asset with maximal Sharpe, we established that the vector of Sharpes, for normal returns and when the SNRs are small, is approximately asymptotically normal:
Here \(R\) is the correlation of returns. See our previous blog post for more details. Under the null hypothesis that all SNRs are equal to \(\zeta_0\), we can express this
where \(R^{1/2}\) is a matrix square root of \(R\).
Now assume the simple rank-one model for correlation, where assets are correlated to a single common latent factor, but are otherwise independent:
Under this model of \(R\) we computed inverse-square-root of \(R\) as
Picking two distinct indices, \(i, j\) let \(v = \left(e_i - e_j\right)\) be the contrast vector. We have
because \(v^{\top}1=0\). Thus the range of the observed Sharpe ratios is a scalar multiple of the range of a set of \(p\) independent standard normal variables. This is akin to the 'monotonicity' principle that we abused earlier when performing inference on the asset with maximum Sharpe.
Under normal approximation and the rank-one correlation model, we should then see
with probability \(\alpha\), where the \(q_{1-\alpha,m,n}\) is the upper \(\alpha\)-quantile
of the Tukey distribution with \(m\) and \(n\) degrees of freedom.
This is computed by qtukey
in R.
Alternatively one can construct confidence intervals around each \(\hat{\zeta}_i\) of
width \(HSD\), whereby if another \(\hat{\zeta}_j\) does not fall within it, the two
are said to be Honestly Significantly Different. The familywise error rate should be
no more than \(\alpha\).
Testing
Let's test this under the null. We spawn 4 years of correlated returns from 16 managers, then compare the maximum and minimum observed Sharpe ratio, comparing them to the test value of \(HSD\). Assume that the correlation is known to have value \(\rho=0.8\). (More realistically, it would have to be estimated.) Note that for this many fund managers we have
and thus taking into account the \(\sqrt{1-\rho}\) term,
This is only slightly bigger than the naive approximate confidence intervals one would typically apply to the Sharpe ratio, which in this case would be around
We perform 10 thousand simulations, computing the Sharpe over all managers, and collecting the ranges. We compute the empirical type I error rate, and find it to be nearly equal to the nominal value of 0.05:
suppressMessages({
library(mvtnorm)
})
nman <- 16
nyr <- 4
ope <- 252
SNR <- 0.8 # annual units
rho <- 0.8
nday <- round(nyr * ope)
R <- pmin(diag(nman) + rho,1)
mu <- rep(SNR / sqrt(ope),nman)
nsim <- 10000
set.seed(1234)
ranges <- replicate(nsim,{
X <- mvtnorm::rmvnorm(nday,mean=mu,sigma=R)
zetahat <- colMeans(X) / apply(X,2,sd)
max(zetahat) - min(zetahat)
})
alpha <- 0.05
HSDval <- sqrt((1-rho) / nday) * qtukey(alpha,lower.tail=FALSE,nmeans=nman,df=Inf)
mean(ranges > HSDval)
## [1] 0.0541
Compact Letter Display
The results of Tukey's test can be difficult to summarize. You might observe, for example, that managers 1 and 2 have significantly different SNRs, but not have enough evidence to say that 1 and 3 have different SNR, nor 2 and 3. How, then should you think about manager 3? He/She perhaps has the same SNR as 2, and perhaps the same as 1, but you have evidence that 1 and 2 have different SNR. You might label 1 as being among the 'high performers' and 2 among the 'average performers'; In which group should you place 3?
One answer would be to put manager 3 in both groups.
This is a solution you might see as the result of compact letter displays, which is
a commonly used way of communicating the results of multiple comparison procedures
like Tukey's test.
The idea is to put managers into multiple groups, each group identified by a letter,
such that if two managers are in a common group, the HSD test fails to find they
have significantly different SNR.
The assignment to groups is actually not unique, and so subject to
optimizing certain criteria, like minimizing the total number of groups, and so on,
cf. Gramm et al.
For our purposes here, we use Piepho's algorithm, which is conveniently provided
by the multcompView
package in R.
Here we apply the technique to the series of monthly returns of 5 industry factors, as compiled by Ken French, and published in his data library. We have almost 1200 months of data for these 5 returns. The returns are highly positively correlated, and we find that their common correlation is very close to 0.8. For this setup, and measuring the Sharpe in annualized units, the critical value at the 0.05 level is
For comparison, the half-width of the two sided confidence interval on the Sharpe in this case would be
which is a bit bigger. We have actually gained resolving power in our comparison of industries because of the high level of correlation.
Below we compute the observed Sharpe ratios of the five industries, finding them to range from around \(0.49\,\mbox{year}^{-1/2}\) to \(0.67\,\mbox{year}^{-1/2}\). We compute the HSD threshold, then call Piepho's method and print the compact letter display, shown below. In this case we require two groups, 'a' and 'b'. Based on our post hoc test, we assign Healthcare and Other into two different groups, but find no other honest significant differences, and so Consumer, Manufacturing and Technology get lumped into both groups.
# this is just a package of some data:
# if (!require(aqfb.data)) { install.packages('shabbychef/aqfb_data') }
library(aqfb.data)
data(mind5)
mysr <- colMeans(mind5) / apply(mind5,2,FUN=sd)
# sort decreasing for convenience later
mysr <- sort(mysr,decreasing=TRUE)
# annualize it
ope <- 12
mysr <- sqrt(ope) * mysr
# show
print(mysr)
## Healthcare Consumer Manufacturing Technology Other
## 0.664421 0.649723 0.602318 0.584746 0.486777
srdiff <- outer(mysr,mysr,FUN='-')
R <- cov2cor(cov(mind5))
# this ends up being around 0.8:
myrho <- median(R[row(R) < col(R)])
alpha <- 0.05
HSD <- sqrt(ope) * sqrt((1-myrho) / nrow(mind5)) * qtukey(alpha,lower.tail=FALSE,nmeans=ncol(mind5),df=Inf)
library(multcompView)
lets <- multcompLetters(abs(srdiff) > HSD)
print(lets)
## $Letters
## Healthcare Consumer Manufacturing Technology Other
## "a" "a" "a" "a" "a"
##
## $LetterMatrix
## a
## Healthcare TRUE
## Consumer TRUE
## Manufacturing TRUE
## Technology TRUE
## Other TRUE