Date: Jan 17, 2013 6:33 PM Author: David Jones Subject: Re: Mahalanobis_distance and Gaussian distribution "Rich Ulrich" wrote in message

news:75hgf8t3orrka7gjpqehrm89vcdmi2g8b6@4ax.com...

On Thu, 17 Jan 2013 11:21:24 -0000, "David Jones"

<dajhawk@hotmail.co.uk> wrote:

[snip, a bunch]

>

>The Mahalanobis distances may be dimensionless with respect to the units of

>the underlying observations but that does not men that they are immediately

>comparable across different sources of data. Even of the number of

>dimensions is the same you still need to look at context. For example, if

>used in some formal testing procedure, the power of such tests can be

>different. Consider two different set of observations on the underlying

>quantity, one with rather more random observation error than the other.

>

>For different dimensions, consider the case where the dimensions are much

>more different, say 2 and 100. Then a typical value of Mahalanobis

>distance

>for a point from the second population would be 100, but this would be very

>unusual value for a point from the first population. In fact the sets of

>values of distances for the two populations would hardly overlap. If this

>is

>meaningful for whatever way you intend to use the distances then OK. But

>many uses are of the kind where you are looking for datapoints that are

>unusual with respect to an initial distribution ... the Mahalanobis

>distance

>is not (without some transformation) directly usable in a comparison

>between

>sets of data with different dimensions, as exemplified in the case above

>where a value of 100 is unusual for one population but not the other.

David,

I'm asking myself -- to judge which is more of an outlier, Why

can't we consider the "p-value" of each of these two

chisquared distributions with different df's?

I'm not saying that this is a good idea. -- I *suspect* that there

is something shaky about it, or I might have heard of it being

done before, and it doesn't seem familiar. Or, is that just

because the circumstances are too rare in my reading?

--------------------------------------------------------------------

The answer here is that you can always compare the p-values of

test-statistics to see how unusual the observed statistics are compared to

their corresponding distributions. But that doesn't tell you anything about

a comparison of the worth of those test statistics. For example, given the

assumption of a normal distribution with unknown mean you can use either the

usual t-statistic to test for a given mean, or one based the number of

values on either side of that mean. The p-values for the test statistics

would be valid and comparable in a sense. Yet you know that one test is more

powerful than the other. (Here the question is not just comparing two

statistics derived from the same data, but two statistics derived from

different data, but this doesn't add much extra to the question.)

Thus in the Mahalanobis distance case, there really needs to be some

consideration of power. If one thinks of plotting "probability of

rejection" against some representative quantification of departure from the

mean, it seems that this is most naturally the covariance-matrix-weighted

quadratic form based on the difference of the means of null and alternative.

These x-axis variables are therefore different for two Mahalanobis distances

which are derived from different basic variables or from variables with

different measurement errors contributing to the covariance matrix and

certainly from observation vectors of different lengths. Of course, one

could start from a univariate measure of "size of departure" and map this

through some model of how this affects the means of the "alternative" models

in each case: then one would be able to compare the powers of the tests on

the same scale, but this would clearly depend on how the changes in means

are modelled for each space. Of course, if there really is just one

underlying variable controlling the change being sought by a test, it would

be possible to devise some optimal test statistic aligned to the given

direction, which would argue for not using the Mahalanobis distance.

It is not clear what the OP's context for asking actually was, as there are

a number of possibilities even for a single population case:

(i) comparison of Mahalanobis distances based on different subspaces

(ii) comparison of p-values for Mahalanobis distances based on different

subspaces

(iii) choice of which of several Mahalanobis distances based on different

subspaces to use for single test statistics

(iv) possible benefits of combining subspaces to create a better test

statistic.

There is also the possibility that there are multiple populations, in which

case there may be a clustering problem, and the apparent separation of

populations may be important.