Question 11.6: Consider the pizza delivery data described in Appendix A.4. ......

Introduction to Statistics and Data Analysis With Exercises, Solutions and Applications in R [1403016]

Consider the pizza delivery data described in Appendix A.4.

(a) Read the data into R. Fit a multiple linear regression model with delivery time as the outcome and temperature, branch, day, operator, driver, bill, number of ordered pizzas, and discount customer as covariates. Give a summary of the coefficients.

(b) Use R to calculate the 95% confidence intervals of all coefficients. Hint: the standard errors of the coefficients can be accessed either via the covariance matrix or the model summary.
(c) Reproduce the least squares estimate of σ². Hint: use residuals() to obtain the residuals.
(d) Now use R to estimate both R² and $R^{2}_{ad j}$ . Compare the results with the model output from a).
(e) Use backward selection by means of the stepAIC function from the library MASS to find the best model according to AIC.
(f) Obtain $R^{2}_{ad j}$ from the model identified in e) and compare it to the full model from a).
(g) Identify whether the model assumptions are satisfied or not.
(h) Are all variables from the model in (e) causing the delivery time to be either delayed or improved?
(i) Test whether it is useful to add a quadratic polynomial of temperature to the model.
(j) Use the model identified in (e) to predict the delivery time of the last captured delivery (i.e. number 1266). Use the predict() command to ease the calculation of the prediction.

Step-by-Step

The 'Blue Check Mark' means that this solution was answered by an expert.
Learn more on how do we answer questions.

(a) The multivariate model is obtained by using the lm() command and separating the covariates with the + sign. Applying summary() to the model returns the comprehensive summary.

mp <- lm(time ∼ temperature + branch + day + operator + driver
+ bill + pizzas + discount_customer)
summary(mp)

The output shows that lower temperature, higher bills, and more ordered pizzas increase the delivery times. The branch in the East is the fastest, and so is the driver Domenico. While there are differences with respect to day, discount customers, and the operator, they are not significant at the 5 % level.

(b) The confidence intervals are calculated as: $\hat{\beta }_{i} ± t_{n−p−1;1−α/2} ·\hat{\sigma }_{\hat{\beta }}$ . We know from the model output from (a) that there are 1248 degrees of freedom (1266 observations − 18 estimated coefficients). The respective quantile from the t-distribution is obtained with the qt() function. The coefficients are accessed via the coefficients command (alternatively: mp$coefficients); the variances of the coefficients are either accessed via the diagonal elements of the covariance matrix (diag(vcov(mp))) or the model summary (summary(mp)[[4]][,2]—both of which are laborious. The summary of coefficients, lower confidence limit (lcl), and upper confidence limit (ucl) may be summarized in a matrix, e.g. via merging the individual columns with the cbind command.

lcl <- coefficients(mp) – qt(0.975,1248)*sqrt(diag(vcov(mp)))
ucl <- coefficients(mp) + qt(0.975,1248)*sqrt(diag(vcov(mp)))
cbind(coefficients(mp),lcl,ucl)

(c) The variance is estimated as the residual sum of squares divided by the degrees of freedom, see also (11.27). Applying the residuals command to the model and using other basic operations yields an estimated variance of 28.86936.

$\hat{\sigma}^{2}=\frac{\left(y-X\hat{\beta } \right)^{\prime } \left(y-X\hat{\beta } \right)}{n-\left(p+1\right) } =\frac{\hat{e}^{\prime } \hat{e}}{n-\left(p+1\right)} =\frac{1}{n-\left(p+1\right)}\sum\limits_{i=1}^{n}{\hat{e}^{2}_{i}}.$ (11.27)

sum(residuals(mp)ˆ2)/(mp$df.residual)

Taking the square root of the result yields $\sqrt{ 28.86936}$ = 5.37 which is also reported in the model output from (a) under “Residual standard error”.

(d) The sum of squares error is defined as $\Sigma ^{n}_{i=1} \left(y_{i}-\hat{y}_{i} \right)^{2}$ . The total sum of squares is $\Sigma ^{n}_{i=1} \left(y_{i}-\bar{y} \right)^{2}$ . This can be easily calculated in R. The goodness of fit is then obtained as R² = 1 − ${S Q_{Error}}/{S Q_{Total}}$ = 0.3178. Dividing SQ $_{Error}$ by n − p − 1 = 1266 − 17 − 1 = 1248 and SQ $_{Total}$ by n − 1 = 1265 yields $R^{2}_{ adj}$ = 0.3085.
This corresponds to the model output from (a).

SQE <- sum(residuals(mp)ˆ2)
SQT <- sum((time-mean(time))ˆ2)
1-(SQE/SQT)
1-((SQE/1248)/(SQT/1265))

(e) Applying stepAIC to the fitted model (with option “back” for backward selection) executes the model selection by means of AIC.

library(MASS)
stepAIC(mp, direction=’back’)

The output shows that the full model has an AIC of 4275.15. The smallest AIC is achieved by removing the operator variable from the model.

The reduced model has an AIC of 4273.37. Removing the discount customer variable from the model yields an improved AIC (4272.0 < 4273.37).

The model selection procedure stops here as removing any variable would only increase the AIC, not decrease it.

The final model, based on backward selection with AIC, includes day, driver, branch, number of pizzas ordered, temperature, and bill.

(f) Fitting the linear model with the variables obtained from (e) and obtaining the summary of it yields an $R^{2}_{ adj}$ of 0.3092.

mps <- lm(time ∼ temperature + branch + day + driver + bill +
pizzas)
summary(mps)

This is only marginally higher than the goodness of fit from the full model (0.3092 > 0.3085). While the selected model is better than the model with all variables, both, with respect to AIC and $R^{2}_{ adj}$ , the results are very close and remind us of the possible instability of applying automated model selection procedures.

(g) Both the normality assumption and heteroscedasticity can be checked by applying plot() to the model. From the many graphs provided we concentrate on the second and third of them:

plot(mps, which=2)
plot(mps, which=3)

Figure B.23a shows that the residuals are approximately normally distributed because the dots lie approximately on the bisecting line. There are some smaller deviations at the tails but they are not severe. The plot of the fitted values versus the square root of the absolute values of the residuals shows no pattern; it is a random plot (Fig. B.23b). There seems to be no heteroscedasticity.

(h) Not all variables identified in (e) represent necessarily a “cause” for delayed or improved delivery time. It makes sense to speculate that because many pizzas are being delivered (and need to be made!) the delivery time increases. There might also be reasons why a certain driver is improving the delivery time: maybe he does not care about red lights. This could be investigated further given the results of the model above. However, high temperature does not cause the delivery time to be shorter; likely it is the other way around: the temperature is hotter because the delivery time is shorter. However, all of these considerations remain speculation. A regression model only exhibits associations. If there is a significant association, we know that given an accepted error (e.g. 5 %), values of x are higher when values of y are higher. This is useful but it does not say whether x caused y or vice versa.

(i) To check whether it is worth to add a polynomial, we simply add the squared temperature to the model. To make R understand that we apply a transformation we need to use I().

mps2 <- lm(time ∼ temperature + I(temperatureˆ2) +
I(temperatureˆ3) + branch + day + driver + bill + pizzas)
summary(mps2)

It can be seen that the null hypothesis $H_{0} : β_{temp^{2}}$ = 0 is rejected. This indicates that it is worthwhile to assume a quadratic relationship between temperature and delivery time.

(j) The prediction can be obtained by the predict command as follows:

predict(mps,pizza[1266,])

The prediction is 36.5 min and therefore 0.8 min higher than the real delivery time.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	40.42270	2.00446	20.166	< 2e-16 ***
temperature	-0.20823	0.02594	-8.027	2.28e-15 ***
branchEast	-1.60263	0.42331	-3.786	0.000160 ***
branchWest	-0.11912	0.37330	-0.319	0.749708
dayMonday	-1.15858	0.63300	-1.830	0.067443 .
daySaturday	0.88163	0.50161	1.758	0.079061 .
daySunday	1.01655	0.56103	1.812	0.070238 .
dayThursday	0.78895	0.53006	1.488	0.136895
dayTuesday	0.79284	0.62538	1.268	0.205117
dayWednesday	0.25814	0.60651	0.426	0.670468
operatorMelissa	-0.15791	0.34311	-0.460	0.645435
driverDomenico	-2.59296	0.73434	-3.531	0.000429 ***
driverLuigi	-0.80863	0.58724	-1.377	0.168760
driverMario	-0.39501	0.43678	-0.904	0.365973
driverSalvatore	-0.50410	0.43480	-1.159	0.246519
bill	0.14102	0.01600	8.811	< 2e-16 ***
pizzas	0.55618	0.11718	4.746	2.31e-06 ***
discount_customer	-0.28321	0.36848	-0.769	0.442291
—
Signif. codes: 0 * 0.001 0.01 * 0.05 . 0.1 1

Residual standard error: 5.373 on 1248 degrees of freedom
Multiple R-squared: 0.3178, Adjusted R-squared: 0.3085
F-statistic: 34.2 on 17 and 1248 DF, p-value: < 2.2e-16

		lcl	ucl
(Intercept)	40.4227014	36.4902223	44.3551805
temperature	-0.2082256	-0.2591146	-0.1573366
branchEast	-1.6026299	-2.4331162	-0.7721436
branchWest	-0.1191190	-0.8514880	0.6132501
dayMonday	-1.1585828	-2.4004457	0.0832801
…

Step: AIC=4275.15
time ~ temperature + branch + day + operator + driver + bill + pizzas + discount_customer
	Df	Sum of Sq	RSS	AIC
– operator	1	6.11	36035	4273.4
– discount_customer	1	17.05	36046	4273.8
<none>			36029	4275.2
– day	6	448.79	36478	4278.8
– driver	4	363.91	36393	4279.9
– branch	2	511.10	36540	4289.0
– pizzas	1	650.39	36679	4295.8
– temperature	1	1860.36	37889	4336.9
– bill	1	2241.30	38270	4349.6

Step: AIC=4273.37
time ~ temperature + branch + day + driver + bill + pizzas + discount_customer
	Df	Sum of Sq	RSS	AIC
– discount_customer	1	17.57	36053	4272.0
<none>			36035	4273.4
– day	6	452.00	36487	4277.1
– driver	4	364.61	36400	4278.1
– branch	2	508.57	36544	4287.1
– pizzas	1	649.54	36685	4294.0
– temperature	1	1869.98	37905	4335.4
– bill	1	2236.19	38271	4347.6

Step: AIC=4271.98
time ~ temperature + branch + day + driver + bill + pizzas
	Df	Sum of Sq	RSS	AIC
<none>			36053	4272.0
– day	6	455.62	36508	4275.9
– driver	4	368.18	36421	4276.8
– branch	2	513.17	36566	4285.9
– pizzas	1	657.07	36710	4292.8
– temperature	1	1878.24	37931	4334.3
– bill	1	2228.88	38282	4345.9

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	-18.954965	8.795301	-2.155	0.03134 *
temperature	1.736692	0.282453	6.149	1.05e-09 ***
I(temperature^2)	-0.015544	0.002247	-6.917	7.36e-12 ***
branchEast	-1.429772	0.416107	-3.436	0.00061 ***
…