Here some further analysis using proc glm is outlined. This goes beyond the one way ANOVA discussed in Proc GLM.
The following data is from Hicks and is found in the file tomato.dat. Tomato plants were grown in a greenhouse under treatments consisting of combinations of soil type and fertilizer type. A completely randomized two factor design was used with two replications per cell. The following data on yield in kilograms of tomatoes were obtained for the 30 plants under study.
Fertilizer Type
Soil Type 1 2 3
I 5,7 5,5 3,5
II 5,9 1,3 2,2
III 6,8 4,8 2,4
IV 7,11 7,9 3,7
V 6,9 4,6 3,5
The analysis here is as a two factor analysis of variance. Interaction may be assumed, but first the two factor additive model is used.
There are 2 qualitative factors: soil and fertilizer. The data is put
into a SAS data set toms with 3 variables: soil, fert,
and yield. The analysis proceeds as follows.
proc glm
data =toms;
class soil fert;
model yield=soil fert;
means soil fert /bon;
run;
This provides the ususal analysis of variance table and a grouping of similar means due to each qualitative factor using the Bonferroni method of multiple comparison with alpha=0.05 (the default level). Under the assumed additive model, this information is sufficient for a complete analysis of the data.
If the additive model with interaction is to be used, the model statement must be modified. There are two equivalent ways of writing the statement. One is
model yield= soil fert soil*fert;
This introduces the soil-fertilizer interaction term. The other short hand way of obtaining this model is
model yield= soil | fert;
The vertical bar creates main and crossed effects for the variables
in question. This method generalizes to more than two variables. The
notation y=a | b | c would generate a model in which the main, two
way, and three way interactions of the factors a, b, and c were
present in the model, i.e., the 3 factor full model.
Nested models are specified using parenthesis. A model with effects A and B with B nested within A is specified with a model statement
model y=b(a);
If the initial F tests show that interaction is indeed present, a means statement of the form
means soil*fert /bon;
will NOT produce Bonferroni groupings since the bon (and other) options in the means statement apply only to main effects. The p-values for pairwise t-tests comparing the treatments in the interaction case are obtained with
lsmeans soil*fert /pdiff;
This will provide all pairwise comparisons and the corresponding p-values for test of no difference. By suitably increasing the level of significance the Bonferroni ranking can be obtained.
Especially in the case of the interaction model it may be desirable to do customized tests of hypothesis. These can be performed by using the contrast statement. As an example, suppose it is desired to test the hypothesis that there is no difference in yield between soils IV fertilizer 1 and soil IV fertilizer 2. This hypothesis can be written as a test that a particular contrast of the model parameters is zero. (To be sure that the contrast specified is desired one, it is a good idea to use the solution option of the model statement so that the order of the parameters is known.) Recall that the General Linear Model can be written in the form Y=XB+E, where Y is the n x 1 vector of observations, X is an n x k matrix of known constants, and B is a k x 1 vector of unknown parameters. The form for the hypothesis test is assumed to be LB=0 where the matrix L is known. In this case the form for the complete analysis would be
proc glm
data=toms;
class soil fert;
model yield=soil | fert;
contrast 'IV-1 vs. IV-2'
fert 1 -1 0
soil*fert 0 0 0 0 0 0 0 0 0 1 -1 0;
Note that trailing zeroes need not be specified and that effects entering the test with zero coefficients can be omitted.
Multiple contrast statements can be specified.
Using an estimate statement rather than a contrast statement produces an estimate of LB. The syntax is otherwise the same as for the contrast statement.
A final important option in proc glm is the output statement. This statement, which must follow the model statement, creates a SAS data set containing the variables in the original data set together with new variables as specified in the output statement. An illustration of some of the common options is
output
out=results
predicted=pred
residual=resid
L95M=lowmean
U95M=highmean
L95=lowpred
U95=highpred;
The out= option gives the name of the new SAS dataset.
The predicted= option gives the name of the variable in the
out= data set which contains the predicted value of the dependent
variable. By adding records to the original data set which
specify values of the INDEPENDENT variables in the model but set
the corresponding value of the DEPENDENT variable to
missing, one can obtain predictions given by the model
for unobserved settings of the independent variables. This is
particularly useful in regression analysis.
The residual= option gives the name of the variable in the
out= data set which contains the value of the residual.
The L95M= and U95M= options give the names of the variables in
the out= data set which contain the lower and upper endpoints of
a 95% confidence interval for the mean.
The L95= and U95= options give the names of the variables in
the out= data set which contain the lower and upper endpoints of
a 95% confidence interval for the predicted value.
Repeated measures analysis can be done by using the repeated statement.
See Proc GLM for MANOVA for details.
Copyright © 1997 by Jerry Alan Veeh. All rights reserved.