First of all, please excuse me if this post reappears in a couple of hours - I've sent it via the news server I use at work, which sometimes posts messages with huge delay, or doesn't post them at all, and by waiting till tomorow to see what happens I would miss a whole working day of US-time - and since I would like to get help as soon as possible, I'm posting it via another server. The rush is also the excuse for posting to two sci.stat newsgroups. ---------- I would greatly appreciate help on the following matter - preferably from someone familiar with the topic of relative survival, but the issue (and my ignorance) is rather general so that anyone dealing with GLM might be able to help.
We are trying to familiarise ourselves with the Hakulinen approach to modelling relative survival and first we would like to replicate the results of Prof. Hakulinen (who held a course on the topic at our institute last year) and associates. Unfortunately, all their publications and analyses relate to code in GLIM, which we have no access to; but we have SPSS (soon we should upgrade to ver. 11 which, I believe, handles such models), Stata (a colleague, actually a colleaguess :) is mastering it) and of course R (which I am starting to struggle with), so we would like to be able to implement the regression approach to relative survival using one of these packages. (preferably Stata :)
To use the model, the patients are first grouped into strata, one stratum for each combination of predictors of interest (say: age group[4 values], gender, period of diagnosis and follow-up year[1..5]). A life table is then estimated for each stratum (k), whereby the number of deaths (d_ki) in a given follow-up interval (i) among the l'_ki patients at risk is counted and assumed to be binomially distributed; p_ki is defined as the interval-specific observed survival rate while p*_ki denotes the expected survival rate (obtained from mortality tables for general population). The additive hazards model, also known as the relatve survival model, can be expressed as
ln ( -ln ( p_ki/p*_ki)) = x_vector beta_vector
The manual says: "This implies a generalised linear model, the outcome being l'_ki - d_ki (the number of patients surviving the interval), the error structure binomial with denominator l'_ki and the link function complementary log-log combined with a division by p*_ki."
Now, for the dataset of interest (the one on melanoma refered to in http://www.cancerregistry.fi/surv2/manual.pdf) we have obtained p_ki and p*_ki for each stratum and we would like to fit the model, hoping to get results identical to, e.g., Table 11.2 in the abovementioned manual). The colleague figured that in Stata the right procedure should be cloglog, while in R I imagine it would be glm with family=binomial; but exactly how to arrange the data??? And exactly what syntax to use??? - It would be sooo nice if someone told me this!
We have constructed the dataset with 80 cases (1 per stratum: 4x2x2x5=80). I understand things to the point that if we use the ratio p_ki/p*_ki for each stratum as the dependent variable we can not feed such grouped data into the program (Stata or R, whichever), because if nothing else the GLM routine should somehow be told what l'_ki corresponds to each case. Reading help on Stata's cloglog function I saw the weighting option, which includes weighf to weight by frequency, so I imagine we might construct two cases for each stratum, one with y=1 and weight=l'_ki (p_ki/p*_ki) and one with y=0 and weight=l'_ki (1-(p_ki/p*_ki)), and then use cloglog with dependent variable y, independents group, gender, period of diagnosis and follow-up year, and weightf by weight option. Is this correct??? Or should we now produce l'_ki (p_ki/p*_ki) cases with y=1 and l'_ki (1-(p_ki/p*_ki)) cases with y=0, all covariates being equal, from each stratum??? And then run which command???