I am playing with the spam data. There are several models I'd like to try out.
These models, upon training, are used to generate a spam probability, and to predict if the incoming email is spam or not. These models have parameters that need to be tuned during training. Normally, I can use training-testing cross-validation and use the power curve, gini coefficient, ROC, etc. to tune the training parameters.
However, I met with difficulty in one model:
This model views the incoming training sequence as a stochastic process, and it has parameters that depend on the training sequence viewed as a time series. If the order of the training sequence change, the parameters will also be changed.
Thus, for one sequence of time series training set, it has a particular set of parameters. Thus, to tune one set of parameters, the whole training sequence is only one sample. I cannot use the training-testing cross-validation to tune my parameters, because the whole input sequence is just one sample.
I do have a bunch of such sequences, they can be deemed as tens of samples if each sequence is to be deemded as one sample. The problem is the parameters for each sequence i.e. sample are different.
Let's say I have 10 such input sequences, for each one, there are 6 parameters to tune,
Each sequence will generate a set of parameters to predict the probability of the next incoming email in that sequence being a spam...
So if I use normal approach using the ROC or Gini coefficient, or the area under curve, I have to tune 10x6=60 parameters altogether.
I have the closed-form expression for the spam probability generated by this model, and for the test cases, I can have the power curve, Gini coefficient, ROC, lorenz curve, false alarm rate calculated. If I can have the close form expression of the Gini coefficient based on my predicted spam probability, then I don't need to pool all 60 parameters together.
I have pool all 10 sequence and 60 parameters together because I have to use the empirical data and do something similar to "Monte Carlo" in order to generate the ROC curve.
But if I can wrap the ROC and the Gini coefficient and other measures of the predictory curve into one close-form expression, then I don't need to pool all the empirical data together, I can play with the tuning process analytically, then maybe I can tune 6 parameters for each sequence individually, hence the work load can be drastically dropped.
So my questions are:
1. How to improve the predictory power of the model when there are parameters to be tuned for each time series input which is deemed as one sample as a whole. Thus different samples need different parameters...
2. Is there a closed-form expression for Gini coefficient etc. which is based on the predicted spam probability(this is the output of my model), and which can be used analytically without "Monte Carlo"