Search All of the Math Forum:

Views expressed in these public forums are not endorsed by NCTM or The Math Forum.

Notice: We are no longer accepting new posts, but the forums will continue to be readable.

Topic: Are there any closed-form expression for evaluating predictory power using power curve, Gini coefficient, ROC, Lorenz curve, etc.?
Replies: 1   Last Post: Jul 5, 2006 2:28 AM

 Messages: [ Previous | Next ]
 networm Posts: 327 Registered: 10/6/05
Are there any closed-form expression for evaluating predictory power using power curve, Gini coefficient, ROC, Lorenz curve, etc.?
Posted: Jul 4, 2006 4:43 PM

Hi all,

I am playing with the spam data. There are several models I'd like to
try out.

These models, upon training, are used to generate a spam probability,
and to predict if the incoming email is spam or not. These models have
parameters that need to be tuned during training. Normally, I can use
training-testing cross-validation and use the power curve, gini
coefficient, ROC, etc. to tune the training parameters.

However, I met with difficulty in one model:

This model views the incoming training sequence as a stochastic
process, and it has parameters that depend on the training sequence
viewed as a time series. If the order of the training sequence change,
the parameters will also be changed.

Thus, for one sequence of time series training set, it has a particular
set of parameters. Thus, to tune one set of parameters, the whole
training sequence is only one sample. I cannot use the training-testing
cross-validation to tune my parameters, because the whole input
sequence is just one sample.

I do have a bunch of such sequences, they can be deemed as tens of
samples if each sequence is to be deemded as one sample. The problem is
the parameters for each sequence i.e. sample are different.

Let's say I have 10 such input sequences, for each one, there are 6
parameters to tune,

Each sequence will generate a set of parameters to predict the
probability of the next incoming email in that sequence being a spam...

So if I use normal approach using the ROC or Gini coefficient, or the
area under curve, I have to tune 10x6=60 parameters altogether.

I have the closed-form expression for the spam probability generated by
this model, and for the test cases, I can have the power curve, Gini
coefficient, ROC, lorenz curve, false alarm rate calculated. If I can
have the close form expression of the Gini coefficient based on my
predicted spam probability, then I don't need to pool all 60 parameters
together.

I have pool all 10 sequence and 60 parameters together because I have
to use the empirical data and do something similar to "Monte Carlo" in
order to generate the ROC curve.

But if I can wrap the ROC and the Gini coefficient and other measures
of the predictory curve into one close-form expression, then I don't
need to pool all the empirical data together, I can play with the
tuning process analytically, then maybe I can tune 6 parameters for
each sequence individually, hence the work load can be drastically
dropped.

So my questions are:

1. How to improve the predictory power of the model when there are
parameters to be tuned for each time series input which is deemed as
one sample as a whole. Thus different samples need different
parameters...

2. Is there a closed-form expression for Gini coefficient etc. which is
based on the predicted spam probability(this is the output of my
model), and which can be used analytically without "Monte Carlo"

Thanks a lot!

Date Subject Author
7/4/06 networm
7/5/06 networm