Search All of the Math Forum:

Views expressed in these public forums are not endorsed by NCTM or The Math Forum.

Notice: We are no longer accepting new posts, but the forums will continue to be readable.

Topic: Mahalanobis_distance and Gaussian distribution
Replies: 6   Last Post: Jan 18, 2013 12:10 PM

 Messages: [ Previous | Next ]
 David Jones Posts: 80 Registered: 2/9/12
Re: Mahalanobis_distance and Gaussian distribution
Posted: Jan 17, 2013 6:21 AM

"MBALOVER" wrote in message

On Sunday, January 13, 2013 12:03:42 PM UTC-8, David Jones wrote:
> "MBALOVER" wrote in message
>
>
>
>

> >from Wiki, http://en.wikipedia.org/wiki/Mahalanobis_distance
>
> >Maha distance is to measure the probability if a point belongs to a
>
> >distribution.
>
>
>
>
>

> >1.Do we have to assume that that distribution is Gaussian to have Maha
>
> >distance meaningful?
>
>
>
> There are some results for the distribution of the statistic that do rely
> on
>
> the assumption hat the initial distribution is Gaussian . Assuming that
> you
>
> mean the case where the mean and covariance matrix are assumed known, the
> is
>
> result that statistic has a chi-squared distribution ... which does rely
> on
>
> the Gaussian assumption. But there is a related result that does not rely
> on
>
> the Gaussian assumption... specifically the mean value for the statistic
> is
>
> known ... but this does rely on having use the right covariance matrix.
> The
>
> variance of the statistic requires rather information but can be evaluated
>
> theoretically from the first four joint moments of the initial
> distribution.
>
> This is probably too complicated for practical use. There are several
>
> possibilities.
>
> (i) use probabilities derived from the chi-squared result, but don't treat
>
> them as anything more than a rough guide
>
> (ii) create a standardised statistic by subtracting off the mean and
>
> dividing by the standard deviation, both derived from the chi-squared
>
> result ... this at least would give something more easily related to the
>
> underlying data and not be so dependent on he Gaussian distribution in
>
> deriving ficticious/incorrect probabilities.
>
> (iii) use some sort of resampling technique or simulations to get abetter
>
> grip on the properties of the distribution .
>
>
>
> Of course if you re using a version of the statistic where the mean and
>
> variance of the initial distribution have to be estimated from data, the
>
> situation is more complicated.
>
>
>
> The Mahalanobis distance is a multivariate version of judging the distance
>
> of a point from a univariate distribution by scaling the distance from
> the
>
> mean by the population standard deviation, and is applicable to any
>
> distribution for which these moments exist. Of course, there may always be
>
> something better ... this would depend both on the distribution being used
>
> as the initial distribution and on the sort of departures from this
>
> distribution that are important in any particular context. There are
> several
>
> difficulties in defining general measures of how far a point is from a
>
> distribution, not least because of the potential effects of even simple
> 1-1
>
> transformations of a multivariate space (and the Mahalanobis distance
> isn't
>
> immune from these difficulties, but at least one can look for a
>
> transformation yielding something close to a multivariate Gaussian
>
> distribution).
>
>
>
>
>

> >2. I have two distributions in different coordinate spaces. Let's Space A
>
> >which has 3D, and Space B which has 2D. I have two points P1 with
>
> >coordinates [x, y, z] in Space A and P2 with >coordinates [ u, v ] in
> >Space

>
> >B. I wonder if I can apply MH distance to compare which one ( either P1
> >or

>
> >P2) is closer to its corresponding distributions. Does comparison make
>
> >sense?
>
> >Do I have to do anything to normalize between two distributions>
>
>
>
> Would you be prepared to compare the simple shifted and scaled (ie using
>
> mean and standard deviation) versions of measurements from two different
>
> univariate ditributions? The answer isn't obviously yes. If you look at
>
> properties of the Mahalanobis distance you will see that these do depend
> on
>
> the number of dimensions involved in the initial distributions, but you
>
> might choose to proceed either by converting to chi-squared exceedence
>
> probabilities (or equivalent), or by standardising by the mean and
> standard
>
> deviation of the Mahalanobis distance

Thanks David. When you say "If you look at
properties of the Mahalanobis distance you will see that these do depend on
the number of dimensions involved in the initial distributions", I guess you
mean if we look at distribution of Mahalanobis distance, for Space A it will
be 3 dof Chi-square and for space B it will be 2dof Chi-square and thus
depending on number of dimensions. But I wonder why we have to care about
distributions of mahalanobis distances. Why don't we just compare directly
Mahalanobis distance (say D1) from P1 to distribution in space A and
Mahalanobis distance ( say D2) from P2 to the distribution in Space B to
see which one is larger. For example D1 > D2 we can say P1 is father away
from its corresponding distribution in Space A that P2 away from its
corresponding distribution in Space B.

--------------------------------------------------------------------------------------------------------

The Mahalanobis distances may be dimensionless with respect to the units of
the underlying observations but that does not men that they are immediately
comparable across different sources of data. Even of the number of
dimensions is the same you still need to look at context. For example, if
used in some formal testing procedure, the power of such tests can be
different. Consider two different set of observations on the underlying
quantity, one with rather more random observation error than the other.

For different dimensions, consider the case where the dimensions are much
more different, say 2 and 100. Then a typical value of Mahalanobis distance
for a point from the second population would be 100, but this would be very
unusual value for a point from the first population. In fact the sets of
values of distances for the two populations would hardly overlap. If this is
meaningful for whatever way you intend to use the distances then OK. But
many uses are of the kind where you are looking for datapoints that are
unusual with respect to an initial distribution ... the Mahalanobis distance
is not (without some transformation) directly usable in a comparison between
sets of data with different dimensions, as exemplified in the case above
where a value of 100 is unusual for one population but not the other.

Date Subject Author
1/13/13 mbalover9@gmail.com
1/13/13 David Jones
1/17/13 mbalover9@gmail.com
1/17/13 David Jones
1/17/13 Richard Ulrich
1/17/13 David Jones
1/18/13 Herman Rubin