THE EFFECT OF SHELF HEIGHT ON CEREAL SALES (CONTINUED)
Recall that the Midway supermarket chain ran a study on 125 stores to see whether shelf height, set at five different levels, has any effect on sales of a popular brand of cereal. (See the file Cereal Sales.xlsx.) Does Midway get the same results as before if it analyzes the data with regression?
Objective To see how Midway can analyze its data with regression, using only dummy variables for the treatment levels.
Before we can run a regression, we must first reorganize the data. Recall that the original data in the file are in unstacked form—one sales column for each shelf height. For regression, the data must be in stacked form. This is easy to accomplish with StatTools. First, select Stack from the Data Utilities group in StatTools. In the resulting dialog box (not shown), check all five variables, and specify Shelf Height as the Category Name and Sales as the Value Name. This creates a new worksheet with two long variables called Shelf Height and Sales. Next, create a new StatTools data set for the stacked data, and then use StatTools to create dummies for the different shelf heights, based on the Shelf Height variable. The results for a few of the stores appear in Figure 19.13.
We now run a multiple regression with the Sales variable as the dependent variable and the Shelf Height dummies as the explanatory variables. We used Lowest as the reference level, although any level could have been used. The regression output is shown in Figure 19.14.
The first thing to notice is that the ANOVA table from the regression output is identical to the ANOVA table from traditional ANOVA. (See Figure 19.5.) This will always be the case. You can infer, because of the extremely low p-value in this table, that the population regression coefficients are not all 0. However, because these regression coefficients are really mean differences between the various levels and the reference level, you can infer that these mean differences are not all 0. Specifically, at least one of the upper heights differs from the lowest height. The estimates of the mean differences, given in the range B20:B23, are the observed average differences in sales between upper heights and the lowest height. Also, the constant in cell B19 is the observed average sales for the lowest height.
If you compare the confidence intervals in the range F20:G23 of the regression output to the corresponding confidence intervals for the ANOVA output in Figure 19.5, you will see that they are somewhat different. For example, the confidence interval for μ_2 − μ_1 from Figure 19.14 extends from 1.41 to 86.11, whereas the similar confidence interval in Figure 19.5 extends from −15.50 to 103.02. (We had to reverse the signs to get the confidence interval for μ_2 − μ_1, not μ_1 − μ_2.) In particular, the confidence interval from regression, although centered around the same mean difference, is much narrower. In fact, it is entirely positive, leading us to conclude that this mean difference is significant. The ANOVA output led us to the opposite conclusion. The reason for this apparent discrepancy is the subject of the next section. It is basically because the Tukey intervals quoted in the ANOVA output are more “conservative” (wider) and typically lead to fewer significant differences.
One final comment about the regression output regards its R² value. We see that differences in the shelf height account for 13.25% of the variation in sales. This means that although shelf height has some effect on sales, there is a lot of “random” variation in sales across stores that cannot be accounted for by shelf height.