STAT 3022 Applied Linear Models
Question 1 (45 marks)
An analyst studies the relationship between the salaries (Y ) of academics in a university in the US and X1 = number of years since PhD, X2 = number of years of service, X3 = gender, which is a categorical variable with 2 levels (Female and Male), and X4 = academic rank, which is a categorical variable with 3 levels (A, B, and C). The analyst fitted several models in R, whose selected outputs are given at the end of this document.
Based on the outputs, answer the following questions. Note that “not enough information to answer’ ’ can be the correct response.
-
(i) First, the analyst fitted the model m1 that only contains the two quantitative (continuous) variables X1 and X2. Obtain the ANOVA table for this model.
-
(ii) The model m2 in the output contains two quantitative variables X1,X2 and the categorical variable X3 without any interaction. Complete the summary table of the model m1 below.
-
(iii) For the model m2, state the models under the null and alternative hypotheses for yrs.service in (1) the summary table (outlined in part (ii) of this question) and (2) the ANOVA table (given in the output).
-
(iv) From the model m2, obtain the point prediction and 95% prediction intervals for (1) a female academic with 5 years since PhD and 2 years of service, and (2) a male academic with 3 years since PhD and 2 years of service.
-
(v) Consider the model m3 with the additive effect of X1, X2, X4, as well as the interaction effect between X4 and X1 and the interaction effect between X4 and X2. Given that the normality is reasonable, conduct an appropriate F -test to conclude whether all the interaction terms can be dropped out of the model. Please specify the models under the null and alternative hypotheses, the value of the test statistic, and the p-value of the test.
Question 2 (45 marks)
The Scholastic Aptitude Test, or SAT, is a standard test used throughout the United States to determine college entrance. When it was first introduced in 1982, the large variation among average SAT scores between the states became an area of great concern for some states and great pride for others. But what causes such variation? A scientist set out to determine the extent to which demographic variables influenced SAT scores, and carried out a large study to address this issue. They measured, for each state (state): the average total SAT score (Y = total); the total state expenditure on secondary schools, expressed in hundreds of dollars per student (X1 = expend); the average students/teacher ratio (X2 = ratio); the average teacher salary (X3 = salary), and percentage of SAT takers (X4 = perc). The dataset SATscore is available on Canvas.
-
(i) Obtain the pairwise scatterplots and a correlation matrix among the variables in the dataset (note that state should be only treated as the row name, not a variable in this dataset).
-
(ii) Fit the multiple linear regression of the outcome on all the covariates. Obtain the summary table, and based on it, write the fitted regression equation.
-
(iii) Conduct appropriate model diagnostics to check the normality and the constant variance as- sumption of the model.
-
(iv) Can any state be considered as an influential observation for the model? For any state that you consider to be influential, is it influential mostly because it is an outlier or a high leverage or both? If it is a high leverage, what causes it?
Clearly provide the evidence (eg. number, plot, etc.) that support your conclusion.
-
(v) One common measure of multicollinearity in the model with all continuous covariates is the variance inflation factor (VIF), which is defined as follows. To compute the VIF for Xk, we treat Xk as the response, and then fit a multiple linear regression of Xk on all the other covariates in the model. Denoting Rk2 as the multiple R-squared of this model, then Using this definition, obtain the VIF for all the covariates in the model. You are not allowed to use any additional package in R to compute it.
-
(vi) A common rule of thumb is if any covariate has a VIF greater than 5, then multicollinearity is serious in the model. Based on that rule, (1) comment on whether the model has a serious multicollinearity, and (2) relate it to the pairwise correlation plot.
Question 3 (10 marks)
Consider the linear model with outcome Y and three quantitative (continuous) covariates X1,X2 and X3. Let rjk be the sample correlation between Xj and Xk. Assume X1 and X2 are uncorrelated, i.e r12 = 0. From the lecture note, we know that SSR(X1, X2) = SSR(X1) + SSR(X2) in this case.
Now, assume both X1 and X2 are correlated with X3, i.e r13 ̸= 0,r23 ̸= 0. In this case, is the claim that SSR(X1, X2 | X3) = SSR(X1 | X3) + SSR(X2 | X3) always true? If so, prove it. If not, provide at least one counterexample (numerically or theoretically) where the above equality does not hold.
-
-