SECTION 1
Heart Disease Dataset [15 marks]
The dataset heart in the SAS course library contains data on 303 tests for the diagnosis of heart disease. The variables are Age (in years), sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved during an exercise test, exercise induced angina, ST depression induced by exercise, the number of major blood vessel coloured by fluoroscopy, blood disorder and the presence of disease
Variable Description
Age Numeric age
Sex
Cp
Trestbps
Chol
Fbs
Restecg
Exang
Oldpeak
Slope
ca
Thal
Disease
Disease_binary 1 = Male; 0 = Female
Chest Pain Type (1 = typical angina; 2 = atypical angina; 3 = non-anginal pain; 4 = asymptomatic)
Resting blood pressure (in mm Hg on admission to the hospital)
Serum cholestoral in mg/dl
Fasting blood sugar > 120 mg/dl (1 = yes; 0 = no)
Resting electrocardiographic results (0 = normal; 1 = ST-T wave abnormality; 2 = left ventricular hypertrophy)
Exercise induced angina (1 = yes; 0 = no)
ST depression induced by exercise relative to rest
Slope of the peak exercise ST segment (1 = upsloping; 2 = flat; 3 = downsloping)
Number of major vessels (0-3) coloured by fluoroscopy
Type of blood disorder (3 = normal; 6 = fixed defect; 7 = reversible defect)
The rating of heart disease (on a scale from 0-4), where 4 is most severe
Is heart disease present (1=yes; 0=no)
The dataset is obtained and modified from UCI Machine Learning Repository. The full dataset is described and analysed in:
Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Schmid, J., Sandhu, S., Guppy, K., Lee, S., & Froelicher, V. (1989). International application of a new probability algorithm for the diagnosis of coronary artery disease. American Journal of Cardiology, 64,304--310.
Our OBJECTIVE is to explain the presence of severe heart disease based on the variables available. We would like to develop a model to assist clinicians and hospital staff in determining a patient’s outcome. This allows us to treat at-risk patients, whilst reducing the need for unnecessary treatments for other patients.
[8 marks] Present your final model by pasting the table from SAS that contains the coefficients. Briefly discuss any decisions you made in refining your models – this includes variables selection, the assumption of proportional odds, assessing model fit and any smaller models you tried out. You are not required to find a model that is a perfect fit to the data (which may not be possible using the techniques you have covered so far in this course). Simply choose ONE reasonable model to present and discuss.
The outcome of the model should be to predict probability of the patient having a severe heart disease based on the final predictor variables in the model. This was done by taking the disease variable as the response variable and by using the least severe category (i.e. 0) as the reference category (i.e. having descending in the code). Hence, as this variable has a natural order and has more than two categories, the model generated will be an ordinal logistic regression model. A bivariate analysis between variables was conducted using SAS and then a process of a manual forward selection was undertaken. This was done by creating smaller models using the original variables, along with necessary interaction terms and polynomial terms such as ageage, slopeoldpeak, sex*trestbps and many more. These smaller models with a mix of interaction terms and polynomial terms, were evaluated on whether they were significant and if they provided valuable information to the response variable.
Hence, the final model encompasses the predictor variables – sex, ca, oldpeak and cp. These predictor variables are deemed to be statistically significant in the model as the p-value for all these variables less than 0.05, based on the Type 3 Analysis of Effects table. They are deemed statistically significant, as we can reject the null hypothesis that the predictors coefficients are the same, in favour of the alternate, that they are different. Additionally, for the model to be deemed as an ordinal logistic regression model, the assumption of proportional odds must hold.
H0: The proportional odds assumptions holds
HA: The proportional odds assumptions does not hold
Based on the score test, the chi-square test statistic is 90.0104 and the p-value is <0.0001. As the p-value is less than 0.05, we reject the null hypothesis in favour of the alternate. Hence, we can conclude that the proportional odds assumptions does not hold for this model. As a result, we fit a multinomial logistic model on the same predictor variables. We have to check that this model fits the data well compared to the fully saturated model.
H0: The model with sex, ca, oldpeak and cp as predictors fits the data well
HA: The fully saturated model fits the data well
As the p-value for both the Deviance (test statistic – 391.0841) and Pearson (test statistic – 512.3130) tests are 1.00, we do not reject the null, as 1.00 > 0.05. Hence, we can conclude that the model with these predictors fits the data well.
[5 marks] Interpret the impact of each variable on the probability of a patient experiencing heart disease, and presenting with more severe heart disease.
When looking at severe heart disease we will be using the highest ranking of disease which is 4. The regression equation for disease (4) is:
Log(pMostSEVERE / pNODISEASE) = -3.6237 + 1.3464sex + 0.5299oldpeak + 1.8095ca(1) + 1.3314ca(2) + 1.4632ca(3) + 0.1411cp(1) – 0.4436cp(2) + 1.6584cp(3)
The less severe cases of disease have the following regression equations:
Log(pQuiteSEVERE / pNODISEASE) = -6.9025 + 1.2242sex + 1.1205oldpeak + 2.7206ca(1) + 2.6358ca(2) + 2.7666ca(3) + 0.9111cp(1) + 0.5607cp(2) + 3.4073cp(3)
Log(pMediumSEVERE / pNODISEASE) = -18.5534 + 1.5347*sex + 1.2328 * oldpeak + 2.2496ca(1) + 3.2904ca(2) + 3.0858 ca(3) + 13.0740cp(1) + 11.7678cp(2) + 14.4226cp(3)
Log(pNotSEVERE / pNODISEASE) = -7.9377 + 1.8978sex + 1.361oldpeak + 2.4402ca(1) + 2.2799ca(2) + 4.2662ca(3) – 9.7202cp(1) – 1.2646cp(2) +2.208cp(3)
For the regression equation for severe heart disease (4) we can interpret the coefficients as follows:
The odds of severe heart disease (4) vs no heart disease (0) are 3.844 times higher for males compared to females
A one unit increase in oldpeak multiplies the odds of severe heart disease (4) vs no heart disease (0) by a factor of 1.699 – 69.9% increase
The odds of severe heart disease (4) vs no heart disease (0) are 6.108 times higher for those with number of major vessels coloured by fluoroscopy being 1 relative to those with number of major vessels coloured by fluoroscopy being 0
The odds of severe heart disease (4) vs no heart disease (0) are 3.786 times higher for those with number of major vessels coloured by fluoroscopy being 2 relative to those with number of major vessels coloured by fluoroscopy being 0
The odds of severe heart disease (4) vs no heart disease (0) are 4.320 times higher for those with number of major vessels coloured by fluoroscopy being 3 relative to those with number of major vessels coloured by fluoroscopy being 0
The odds of severe heart disease (4) vs no heart disease (0) are 1.152 times higher for those with atypical angina relative to those with typical angina
The odds of severe heart disease (4) vs no heart disease (0) are 0.642 times higher for those with non-anginal pain relative to those with typical angina
The odds of severe heart disease (4) vs no heart disease (0) are 5.251 times higher for those with chest pain being asymptomatic relative to those with typical angina
The coefficients with the highest impact are sex, ca(3) [number of major vessels coloured by fluoroscopy being 3] and cp(3) [asymptomatic chest pain].
[2 marks] In order to prevent severe heart disease, what characteristics indicate that the patient should receive increased treatment from medical staff?
To prevent severe heart disease early treatment from medical staff is advised if certain features present themselves together: the patient is a male, he has a high overall oldpeak, the number of major vessels coloured by fluoroscopy is three, and the chest pain is asymptomatic. The same holds true when looking for heart disease overall as ca(3) and ca(4) are the major characteristics present among all types of heart disease.
SECTION 2
Citations Dataset [15 marks]
The dataset equationcitations in the SAS course library contains the number of citations of papers published in three top Evolutionary Biology journals, including self-citations, and citations from non-theoretical papers, along with the number of equations each paper has used.
Journal
Factor. Journal in which the paper was published (The American Naturalist, Evolution, Proceedings of the Royal Society of London B: Biological Sciences).
Authors
Character. Names of authors.
Volume
Volume in which the paper was published.
Startpage
Starting page of publication.
Pages
Number of pages.
Equations
Number of equations in total.
Mainequations
Number of equations in main text.
Appequations
Number of equations in appendix.
Cites
Number of citations in total.
Selfcites
Number of citations by the authors themselves.
Othercites
Number of citations by other authors.
Theocites
Number of citations by theoretical papers.
Nontheocites
Number of citations by nontheoretical papers.
The original analysis of the dataset can be found at
Fawcett, T.W. and Higginson, A.D. (2012). Heavy Use of Equations Impedes Communication among Biologists. PNAS -- Proceedings of the National Academy of Sciences of the United States of America, 109, 11735--11739. http://dx.doi.org/10.1073/pnas.1205259109
[8 marks] Present your final model by pasting the table from SAS that contains the coefficients. Briefly discuss any decisions you made in refining your models. You are not required to find a model that is a perfect fit to the data (which may not be possible using the techniques you have covered so far in this course). Simply choose ONE reasonable model to present and discuss.
The final model that was created involved predicting the number of other citations of papers based on the predictor variables provided. As the number of other citations of papers is a count variable, a Poisson regression was used to create the model. This variable was the response variable chosen as it is better to predict how many other authors cite the work, as opposed to how many times the author cites his own work. As there tend to be more citations when there are more pages, the predictor variable of pages was used as an offset term to ensure that the expected value of citations is given the same exposure. Hence the offset term is citations per page.
The overall model fit is evaluated on using the Criteria of Goodness Fit. Multiple smaller models with various predictor variables combinations were tried out, however, this final model produced compared to the other models, had the lowest possible AIC, AICC and BIC values along with the Deviance value. We now must check that this model fits the data well compared to the fully saturated model.
H0: The model with predictor variables fits the data well
HA: The fully saturated model fits the data well
As the test statistic for both the Deviance tests is 1.1440, we do not reject the null, as the value is close to 1, thus implying that the p-value is greater than 0.05. Hence, we can conclude that the model with these predictors fits the data well.
The final model produced has the following predictor variables pages, pagespages, journal, mainequations, mainequationspages, selfcites, appequations, appequationspages. The interaction terms between the mainequations/appequations and the pages was included in the model, as some equations may span over multiple pages. Although, it was deemed statistically insignificant it was kept in the final model, as it shows that it is better to have equations that do not span over multiple pages. The predictor variable of selfcites was included into the model as authors may cite their own articles to raise awareness of the current article. The polynomial term of pagespages was included in the model to stabilise the variance that occurs in the model. Also this polynomial term, enhanced the overall model and the effect that it has on the other variables.
Based on the ‘Analysis of Maximum Likelihood Parameter Estimates’ table, most of the p-value for the predictor variables were <0.05, hence indicating that these predictor variables in the model are statistically significant as we can reject the null hypothesis that all predictor coefficients are equal in favour of the alternate. However, there are predictor variables included in the final model that were greater than 0.05. These were kept in the final model as the information provided by these variables when explaining the response variable takes precedence over the p-value score.
[7 marks] Suppose you are co-author on an evolutionary biology paper, and you would like to communicate your work as widely as possible. What advice do you give to your co-author on the presentation of equations in your work?
As seen from the above table we get a regression equation of:
log(µ) = 0.6259 + 0.2055pages – 0.0044(pages^2) – 0.0373journal(AmNat) – 0.0513journal(Evolu) – 0.0052mainequations – 0.0082selfcites + 0.0003(pagesmainequations) + 0.0570appequations – 0.0023(pages*appequations).
The intercept is the expected amount of othercites when there are zero pages, zero equations (mainequations, appequations), zero selfcites, and the journal is ProcB.
To judge the impact of each coefficient on othercites we will use the following function:
rate of othercites = 〖(e〗^(beta coefficient)-1)100. A one unit increase in pages is associated with a 22.81% increase in the rate of othercites when all other predictor variables are controlled. A one unit increase in pages for pagespages is associated with a 0.44% decrease in the rate of othercites when all other predictor variables are controlled.
Journal(AmNat) is associated with a 3.67% decrease in the rate of othercites than with journal(ProcB) when all other predictor variables are controlled.
Journal(Evolu) is associated with a 5% decrease in the rate of othercites than with journal(ProcB) when all other predictor variables are controlled.
Mainequations = 0.52% decrease in the rate of othercites when all other predictor variables are controlled.
A one unit increase in selfcites is associated with a 0.82% decrease in the rate of othercites when all other predictor variables are controlled.
When pages and mainequations interact, and each has a one unit increase, pagesmainequations is associated with a 0.03% increase in the rate of othercites when all other predictor variables are controlled. A one unit increase in appequations is associated with a 5.87% increase in the rate of othercites when all other predictor variables are controlled. When pages and mainequations interact, and each has a one unit increase, pagesappequations is associated with a 0.23% decrease the rate of in othercites when all other predictor variables are controlled.
To demonstrate the models predictions, we will use an example from the data set and output the expected number of citations.
The 394th data point has 10 pages, zero equations, 100 self cites and is in the journal ProcB. Othercites = e^(0.6259+100.2055-10100.0044-100.0082) = 8.67
This journal had 12 othercites in reality.
When trying to reach the broadest audience with a journal article (which is represented by othercites) and have it cited as much as possible, there needs to be a high number of pages but at around 45 pages the page count starts to negatively impact the article due to the polynomial effect from pages^2. Another suggestion would be to only include equations in the main body of the article if there are lots of pages, as mainequations lowers the chances of being cited while pages*equations would increase the chances.
Sample Solution