Buscar

Statistics to support soil research and their presentation

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você viu 3, do total de 10 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você viu 6, do total de 10 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Você viu 9, do total de 10 páginas

Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados

16 milhões de materiais de várias disciplinas

Impressão de materiais

Agora você pode testar o

Passei Direto grátis

Prévia do material em texto

Statistics to support soil research and their presentation
R . W E B S T E R
Rothamsted Experimental Station, Harpenden, Hertfordshire AL5 2JQ, UK
Summary
Much soil research needs statistics to support and con®rm impressions and interpretations of
investigations in ®eld and laboratory. Many soil scientists have not been trained in statistical method
and as a result apply quite elementary techniques out of context and without understanding.
This article concentrates on the most common abuses and misunderstandings and points authors to
proper use. It distinguishes variance and standard deviation for measuring dispersion from standard error
to indicate con®dence in estimates of means. It describes the strictly limited context in which to use the
coef®cient of variation. It stresses the importance of quoting means and differences between them in
contrast to statistical signi®cance, which is at best of secondary interest. It guides readers to inspect and
explore their data before deciding to transform them for analysis and illustrates what can be achieved by
taking logarithms of single variates and by principal component analysis of multivariate data.
Statistiques pour soutenir la recherche du sol et leur preÂsentation
ReÂsumeÂ
Beaucoup de recherche en science du sol utilise les statistiques pour con®rmer les impressions et
interpreÂtations dans les champs et en laboratoire. De nombreux scienti®ques manquent de formation en
meÂthodes statistiques, et en conseÂquence appliquent les techniques les plus eÂleÂmentaires dans un contexte
errone et sans comprehension.
Cet article se concentre sur les abus et les malentendus les plus courants et indique l'utilisation
correcte. Il diffeÂrencie la variance et l'eÂcart type pour mesurer la dispersion de l'erreur standard qui
indique la con®ance que l'on peut avoir dans les moyennes. Il souligne l'importance de preÂsenter les
moyennes et les differences entre elles par opposition aÁ la signi®cance statistique, qui est au mieux d'un
intereÃt secondaire. Il incite les lecteurs aÁ examiner leurs donneÂes de preÁs et aÁ les explorer avant de deÂcider
d'effectuer toute transformation en vue d'analyses plus formelles, et il met en eÂvidence ce que l'on peut
achever en utilisant des logarithmes des variables individuelles et des composantes principales des
donneÂes multivariables.
Prologue
Readers of both this Journal and other journals of soil science
will have noticed an increasing polarization in the use of
mathematics and statistics in soil science. On the one hand we
have masters of modern advanced theory applying both it and
the not-so-new with powerful computers and software. They
have made substantial progress in geostatistics, fractal
geometry, and mathematical morphology, to name a few ®elds
of application, in recent years. Long may they continue to
advance the technology and our understanding and ability to
predict arising from it.
On the other hand, there are many ®eld pedologists and
laboratory scientists still struggling with elementary statistics
and experimental design. Their plight is not helped by the
emphasis on inferential tests in many students' texts, such as
that by Dytham (1999) which proudly proclaims that it
contains no equations (though I did espy one skulking in the
corner of a graph). Its author evidently thinks it enough to tell
students how to test for signi®cance and which buttons to press
on their personal computers for the purpose. Matters are made
worse by misleading introductory chapters in what appear as
authoritative works on soil analysis and laboratory practice.
For example, the new Practical Environmental Analysis by
Radojevic & Bashkin (1999), published by the Royal Society
of Chemistry, omits mention of standard error, perpetuates the
muddle over regression, and invokes the Central Limit
E-mail: richard.webster@bbsrc.ac.uk
Received 16 October 2000; accepted 1 December 2000
European Journal of Soil Science, June 2001, 52, 331±340
# 2001 Blackwell Science Ltd 331
Theorem out of context. With books of this kind it is small
wonder that I spend much of my editorial time explaining to
authors how to present and summarize their data properly and
correcting their statistical mistakes.
Evidently authors need guidance on elementary statistics
and summarizing data, and my purpose here is to provide that.
Dispersion, standard deviation, standard error, and
con®dence
`Variety is the spice of life', and soil science would be pretty
uninteresting if there were no variation. How would ®eld
pedologists entertain us if all soil were uniformly drab grey
loam to 3 m? What would they have to argue about? They
would be out of work, and so would most of the rest of us. But
variation there is aplenty, and coping with it demands our
intellect, ingenuity and resources to the full. The natural
variation that has taxed systematists and soil surveyors is one
kind. Farmers have created variation by enclosing, reclaiming,
clearing, and fertilizing the land, though within ®elds they
have removed some by cultivation and drainage. Data on soil
contain additional sources of variation arising from our
observations; these can be attributed to the people who make
the observations, to the imprecision of instruments, to
laboratory technique, and to sampling ¯uctuation.
What to present
Whatever the source of the variation we want to be able to
express it quantitatively.
The variation in a set of N measurements, z1, z2, ¼, zN, is
usually expressed as variance:
S2 ˆ 1
N
XN
iˆ1
…zi ÿ z†2; …1†
where zÅ is the mean of the data. It is the average squared
difference between the observations and their mean. Its square
root, S, is the standard deviation, which is often preferred
because it is in the same units as the measurements and is more
intelligible therefore.
You may also ®nd it helpful to visualize the variation in
your data. If you have enough observations then draw a
histogram of them, as in Figure 1, and see how the standard
deviation relates to the spread of values.
Having measured the variation we want to use our
measurement properly in context, and here I encounter the
®rst common misunderstanding. In many instances we should
like the variance obtained from a set of data to estimate the
variance of some larger population, of which the N observa-
tions are a sample. This variance is by de®nition
�2 = E[(z ±�)2], (2)
where � is the mean of z in the population and E denotes
expectation. Equation (1) gives us a biased result, however; S2
is a biased estimator. The reason is that zÅ in the equation, the
estimate of � from the same data, is itself more or less in error.
To remove the bias we must replace N by N ±1 in the
denominator:
s2 ˆ 1
N ÿ 1
XN
iˆ1
…zi ÿ z†2: …3†
The result, s2, is now unbiased, provided the sampling was
unbiased in the ®rst place.
Whichever we choose, S or s measures the dispersion in our
observations; it does not immediately tell us how reliable is
our estimate of �. Intuitively we should know that the smaller
is either in relation to zÅ the more con®dent we can be in our
estimate, but that is not the only factor involved. To assess the
con®dence we may have in an estimate of the mean we want
the expected squared deviation of it from the true mean, i.e.
E[(zÅ ±�)2]. Its estimate is derived simply from s2 by
s2(zÅ) = s2/N. (4)
This is the estimation variance, and its square root is the
standard error, which is the standard deviation of means of
samples of size N. The larger is our sample the smaller is this
error, other things being equal, and the more con®dent we can
be in our estimate.
The point I stress here is the distinction between the standard
deviationand the standard error. The former describes the
variation in the sample; it should be included in any summary of
a large body of data, as in Table 1. The latter is the error
associated with a mean. Taking the results from the ®rst column
of Table 1, for example, we should report,
`mean phosphorus concentration 4.86 mg l±1, with stan-
dard error 0.25 mg l±1'.
Similarly, standard errors should accompany means in tables
compiled from replicated measurements, and they may be
shown by error bars on graphs. Many authors use the 6 sign to
express error. This itself is a source of ambiguity, however.
Some authors use 6 to mean the standard error, others use it
for the standard deviation. From now on I shall discourage its
use and ask authors to spell out `standard error' in the ®rst
instance and abbreviate it to `SE' thereafter if they wish.
Authors often imply some kind of con®dence interval when
they put error bars on graphs. If the data are drawn from a
normal (Gaussian) distribution and there are many of them then
a bar of length 2s/ÖN centred on zÅ spans a symmetric con®dence
interval of approximately 68%. Wider con®dence intervals are
readily calculated by multiplying by factors that can be found in
standard texts: 1.64 for 90%, 1.96 for 95%, for example.
Usually, however, authors have few data, and so the above
factors need to be replaced by Student's t for the number of
332 R. Webster
# 2001 Blackwell Science Ltd, European Journal of Soil Science, 52, 331±340
degrees of freedom, f. Thus, the lower and upper symmetric
con®dence limits about a mean zÅ are
zÿ tf s=

N
p
and z‡ tf s=

N
p
; …5†
and bars should be drawn to these limits to express con®dence.
Standard deviations themselves estimated from small sets of
data have wide con®dence intervals, and they are of little
interest therefore.
There are formulae for calculating the con®dence limits
for other theoretical distributions. In many instances,
however, data are transformed to approximate a normal
distribution before analysis (see below). The computed
standard errors then apply to the transformed scale and are
not easily transformed back into the original units. If you
want to show error bounds on the original scale then
compute the con®dence limits on the transformed scale and
then back-transform them.
Which words?
Two words are used for variation in the scripts I receive. One
is `variation' itself, the other is `variability', and many authors
evidently regard them as synonyms and use them interchange-
ably. Yet synonyms they are not.
Let us start with the adjective `variable', which, according
to my (Oxford) dictionary, means liable to change or
changeable. A variable (noun) is a quantity that is subject to
change or has the ability to change. From there it is a small
step to the mathematical meaning of a quantity that may take
more than one value. It could be the organic carbon content of
the soil or the soil's hydraulic conductivity, as examples. From
these follow the noun `variability', which means the potential
to vary.
Statisticians distinguish a variable as de®ned above from a
`variate', which is a set of observed values of a variable. The
433 values of available phosphorus (P) in the soil summarized
in Table 1 constitute a variate of the variable, available P. The
data on seven trace metals in the Swiss Jura used to illustrate
principal component analysis, below, are seven variates. What
we observe and represent as variates is therefore `variation'.
Geostatisticians formalize this in the distinction between the
actuality on the ground and the random process that is assumed
to generate that actuality. The random process, which is a
model of the reality, embodies variability. When we draw a
realization from that process what we obtain is variation.
So, in most instances `variation' is the word you want to
describe either what you have observed and recorded as one or
more variates or the population that you sampled or could
sample. Keep the word `variability' for those situations where
it is the potential that is of concern, as in many models.
Another word that is more often used incorrectly than
correctly is `parameter'. A parameter is a quantity that is
constant in the case being considered. In statistics it is
generally reserved for quantities such as means and variances
of populations. `Parameter' is not synonymous with `variable'.
Further, we do not measure parameters; rather we estimate
them from measurements of the variables that interest us. In
computing a parameter has a similar meaning of a quantity that
is held constant for a particular run of a program or model.
Modelling physical and biological processes introduces a
duality, however. The models contain parameters, and their
creators and users have in many instances to estimate those
Figure 1 Histograms of the full set of data
on available phosphorus (a) in mg l±1, and
(b) transformed to common logarithms. The
curve in (b) is that of the ®tted normal
distribution, and that in (a) shows the
lognormal distribution.
Table 1 Summary statistics of available phosphorus at Broom's
Barn of both the full set of data (433 points) and subset (44 points).
Values of �2, with 18 degrees of freedom for the full set of data
and 9 degrees for the subset, are added for the hypothesis that the
data or their logarithms are from a normal distribution
Full set Subset
Statistic P /mg l±1 log10 P P /mg l
±1 log10 P
Mean 4.86 0.546 4.59 0.510
Variance 26.52 0.1142 27.60 0.1197
Standard
deviation
5.15 0.338 5.25 0.346
Skewness 3.95 0.23 3.43 0.33
�2df 18 368.2 26.7
�2df 9 64.5 9.1
Statistics to support soil research 333
# 2001 Blackwell Science Ltd, European Journal of Soil Science, 52, 331±340
parameters from data at speci®c places and times. The result is
that the estimates may vary in either space or time or both, and
so these quantities can become variables in their own rights
with their own distributions.
Coef®cient of variation
The coef®cient of variation (CV) is the standard deviation
divided by the mean, i.e. s/zÅ. It is often multiplied by 100 and
quoted as a percentage. Its merit is that it expresses variation
as pure numbers independent of the scales of measurement. It
enables investigators and those reading their reports to
appreciate quickly the degree of variation present and to
compare one region with another and one experiment with
another.
Unfortunately, this facility is too readily used to compare
variation in different variables, even ones having different
dimensions. This is usually an abuse.
The coef®cient should be restricted to variables measured on
scales with an absolute zero. Otherwise the arbitrary choice of
the zero affects it. A few examples will illustrate the matter.
Suppose we want to express variation in temperature. Usually
we measure temperature in degrees Celsius, and the CV
directly obtained from the measurements will be determined
by the mean above 0°C. If we were to record the temperatures
in Kelvin then the standard deviation would remain the same,
but the CV would be smaller, and a lot smaller if we were
concerned with ambient soil temperatures. If we measured
them in degrees Fahrenheit (Heaven forbid, but it still
happens) then the standard deviation would be larger by 9/5,
but the zero would be different again, so that results expressed
as CV would not be comparable.
Other variables with arbitrary zeros are colour hue, which is
approximately circular and for which we could choose any
particular hue we like as our zero, and pH. The zero of the pH
scale is set at ±log10 of the hydrogen ion concentration in
mol l±1.Although it is convention, it is arbitrary. We could
choose some other concentration, and were we to do so its
logarithm would be different, as would be its CV. If, for
example, we were to de®ne pH as ±log10[H
+] in mol ml±1 we
should increase the current conventional values by 3. The
standard deviation would remain unchanged, but the CV
would be diminished.
For some soil properties physics sets limits on the utility of
the CV. Consider, for example, the soil's bulk density. Its
minimum is determined by the physical structures that keep
particles apart. Particles must touch one another, otherwise the
soil collapses. For mineral soils on dry land a working
minimum bulk density is around 1 g cm±3. At the other end of
the scale the bulk density cannot exceed the average density of
the mineral particles, approximately 2.7 g cm±3. So the CV of
bulk density is fairly tightly constrained, and it makes no sense
to compare it with that of, say, the available phosphorus
content of the soil.
The sensible use of the coef®cient of variation for
comparing two variables, say y and z, relies on the assumption
that they are the same apart from some multiplying factor, b,
so that
y = bz. (6)
Then the mean y is yÅ = bzÅ, its variance s2y = b
2s2z, and its standard
deviation is sy = bsz. From there simple arithmetic shows that
their CVs are the same. This principle offers a better way of
comparing variation by taking logarithms of the observations.
Equation (6) becomes
log y = log b + log z. (7)
Since log b is a constant the variances, s2log y and s
2
log z, are
equal, as are their standard deviations. Thus we have a
measure of variation that is independent of the original scale of
measurement.
We can use it to compare variation in two groups of
observations. To return to our example of pH, let us suppose
that we wish to compare the variation in acidity of a group of
Luvisols with one of Podzols. We treat the hydrogen ion
concentration as the original variable, transform it to pH, and
compute the variances of pH. Whichever has the larger
variance is the more variable, regardless of the mean. Further,
we can make a formal signi®cance test by computing
F ˆ s2log y=s2log z and compare the result with F for Ny ± 1 and
Nz ± 1 degrees of freedom, see below. Lewontin (1966)
explains the matter at greater length.
Finally in this section I comment on the practice of
averaging coef®cients of variation from two or more sets of
data. Variances are additive; but their square roots, the
corresponding standard deviations, are not. So if you want
an average measure of variation from several sets of data then
you should compute the arithmetic mean of their variances,
weighted as appropriate by the numbers of degrees of freedom,
and then take the square root of it to give an `average' standard
deviation. You can then divide it by the mean to obtain an
`average' CV. More generally, the additive nature of variances
confers great ¯exibility in analysis, enabling investigators to
distinguish variation from two or more sources and estimate
their components according to the design as by the analysis of
variance. So, you should convert any outcome to errors only at
the end of the analysis.
Signi®cance tests
One of the commonest misunderstandings of statistics arises
with signi®cance testing. In the minds of many soil scientists
no statistical analysis is complete without an inferential test at
the end, and all too often the only results reported are
statements such as `Treatment A yielded signi®cantly more
than did treatment B (P = 0.05)', `The topsoil is signi®cantly
more acid than the subsoil (P < 0.01)', and `The regression of
CEC on organic matter was highly signi®cant (P < 0.001)'.
334 R. Webster
# 2001 Blackwell Science Ltd, European Journal of Soil Science, 52, 331±340
These statements may be supported by tables in which the
values of P, signifying probability, are replaced by sets of one,
two and three stars. They are of virtually no interest to readers
unless the author tells them how much more the yield was or
how much more acid is the soil. When we come to regression
we want to know at least two things about it ± how steep it is
(given by the regression coef®cient) and how close the
observations lie to the regression line (given by the product±
moment correlation coef®cient or the variance of the
residuals). These are potentially interesting, and statistical
signi®cance is of secondary importance.
There is more to the matter than this, however.
First, let us recognize what signi®cance means in a
statistical context. Before making an inferential test we put
forward a hypothesis. This is usually that there is no real
difference between populations or treatments and that any
differences among the means of our observations are due to
sampling ¯uctuation. That is the `null hypothesis', which our
tests are designed to reject (not con®rm).
To judge whether two means differ we compute from the
sampling error the probability of obtaining the observed
difference if the true means were identical, assuming that we
know the form of the distribution. So the sensitivity of the test
depends on the precision with which we have estimated the
means, i.e. on their standard errors or on the standard error of
their difference. Other things' being equal, and in particular
the variances' being constant, larger samples result in smaller
standard errors and hence more sensitive comparisons. If you
take large enough samples you can establish that any soil is
different from almost any other for whatever property of them
that you care to choose.
The signi®cance test is valuable in preventing false claims
on inadequate evidence. Thus, you might summarize a result
as,
`The mean measured pH of the topsoil was 5.5
compared with 6.9 in the subsoil; but because the
samples were small the difference was not statistically
signi®cant'.
However, when a difference is deemed statistically signi®cant
because you reject the null hypothesis that does not mean that
it is important or physically or biologically meaningful. For
example, if you found the mean pH of the topsoil to be 6.7 and
that in subsoil to be 6.9 you would be unlikely to regard the
difference of much consequence, whatever the probability of
rejecting the null hypothesis. This point needs to be made in
your reporting. You should also realize that, while you may
regard a difference as signi®cant only if P < 0.05, a reader
may be willing to recognize one for which 0.1 < P < 0.05 or,
more stringently, only if P < 0.01. It is to some extent a matter
of personal choice, and if you give the standard errors then
readers can reach their own judgements.
Finally in this section, bear in mind that the null hypothesis
is highly implausible when comparing different horizons and
different types of soil; they are different, full stop!
Signi®cant ®gures and choice of units
`Signi®cance' takes on another meaning in reporting quanti-
tative results. I receive many papers in which measured values
and means and errors are quoted with ®ve and six ®gures, and
some with even more. I get the impression that the authors
simply and thoughtlessly copy the numbers from their
computer output. In some instances they respond to my
chiding with remarks such as `Yes, but the values are in the
tens of thousands'.
First, let us be clear that few measurements of soil properties
are accurate to more than three ®gures. Determining the
concentration of an element in the soil typically incurs a
laboratory error of 2±5% of the true value. Sampling
¯uctuation is likely to swell the error substantially, so that
only the ®rst two ®guresare likely to be meaningful and in that
sense signi®cant.
Mean values from replicated sampling are more precise, and
you are generally justi®ed in reporting them with three ®gures,
but no more. Likewise, the quoted values of standard
deviations and errors should be limited to three ®gures.
Table 1 shows what is required.
You may feel justi®ed in quoting variances with four or
even ®ve ®gures because you present them as intermediate
results which readers may wish to process further by adding
them, subtracting one from another, and by taking their square
roots.
Your values may run into tens of thousands or more in the
SI units. As an example take matric suction, which at wilting
point is around 1 500 000 Pa. Rather than report a result as, say,
1 540 000 Pa, however, change the units to kPa and write
1540 kPa, or even 1.54 MPa. If the values are so large or so
small that they lie well outside what we can express using
familiar pre®xes then use powers of 10 to scale them. In the
above example the suction would be written `1.54 3 106 Pa'.
But please do not write `1.54 E06' or `1540 * 103'; they are
computer code.
So, to summarize this section. Present your numerical
results to three ®gures, unless you are sure that more are
signi®cant, and choose familiar units that enable you to do so
conveniently. If you need further guidance then see Monteith's
(1984) eminently sensible article on the matter.
Transformations
Many authors are vaguely aware of transformations. Their
textbooks tell them of the dire consequences of analysing non-
normal data, and that unless they do something about such data
damnation will surely follow. Would the consequences really
be so dire? Should they transform?
Statistics to support soil research 335
# 2001 Blackwell Science Ltd, European Journal of Soil Science, 52, 331±340
The reason put forward in most textbooks is for statistical
inference. It is again the matter of statistical signi®cance. The
usual parametric tests are based on the normal distribution, and
so if the data come from some other distribution then the tests
might mislead. This may be true in some instances, but the
usual t tests and F ratios computed by the analysis of variance
are robust for comparing means; they are rather insensitive to
departures from normality. Further, as above, statistical
signi®cance is of secondary importance anyway, so we must
probe deeper.
The most serious departure from normality usually encoun-
tered with soil data is skewness, i.e. asymmetry, as in
Figure 1. The normal distribution is symmetric, its mean is
at its centre (its mode), and so the mean of the data estimates
this central value without ambiguity. The mean of data from a
skewed distribution does not estimate the mode, nor does the
median (the central value in the data). We are therefore left
uncertain as to the meaning of the statistics. A second feature
of skewed data is that the variances of subsets depend on their
means. If the data are positively skewed (again the usual
situation) then the variances increase with increasing mean.
This is undesirable when making comparisons. Third, estima-
tion is `inef®cient' where data are skewed, by which I mean
that the errors are greater than they need be or, put another
way, more data are required to achieve a given precision than
would be if the distribution were normal. If we can transform
data to approximate normality we overcome these disadvan-
tages. We achieve symmetry, and hence remove ambiguity
concerning the centre. We stabilize the variances. And we
make estimation ef®cient. These are the principal reasons for
it, and of these the second is perhaps the most important.
Of course, no real data are exactly normal; all deviate more
or less from normality. So, when should we transform? How
large a departure from normality should we tolerate? There is
no ®rm answer. Certainly the solution is not to apply a
signi®cance test, as many authors do, a matter that I illustrate
below. What is needed is a little exploration of the data plus
judgement.
Start by drawing a histogram. Does it look symmetrical? If
so superimpose on it a normal curve computed from the mean
and variance of the data. The normal curve has the formula
y ˆ 1
�

2�
p exp ÿ…zÿ �†
2
2�2
( )
; …8†
where y is the probability density. Scale it to ®t the histogram
by multiplying by the number of observations and by the width
of the bins. Now, does the curve look as though it ®ts well? If
it does then go no further along this road; you need not
transform.
If the histogram is skewed then compute the skewness
coef®cient in addition to the mean and variance. This
dimensionless quantity can be obtained via the third moment
of the data about their mean:
1 ˆ 1
N S3
XN
iˆ1
…zÿ z†3: …9†
A symmetric histogram has 
1 = 0. Values of 
1 greater than
zero signify positive skewness, i.e. long upper tails to the
distribution and a mean that exceeds the median, which is
common. Negative values of 
1 signify negative skewness, and
are unusual.
Now, if 
1 is positive and less than 0.5 you should not need
to transform the data. If 0.5 < 
1 < 1 then you might try taking
square roots; and if 
1 > 1 then a transformation to logarithms
is likely to give approximate normality, so try it.
It will be helpful, and salutary, to illustrate this with an
example. The data are 433 measurements of available
phosphorus, P, in the topsoil (0±20 cm) of Broom's Barn
Farm in 1960 (Webster & McBratney, 1987). Their histogram
(Figure 1a) shows them to be strongly positively skewed, and
this is corroborated by the skewness coef®cient of 3.95, in
Table 1. Transforming to logarithms makes the histogram
(Figure 1b) more nearly symmetric, and as the skewness in the
logarithms is now only 0.34 we can be satis®ed. To emphasize
the point I have drawn the normal curve on Figure 1(b), and it
®ts reasonably well. The curve on Figure 1(a) is that of the
lognormal distribution.
Were we to judge the goodness of ®t by a signi®cance test
we might compute a �2 from the differences between the
observed frequencies and the theoretical ones and compare the
result with �2 for probability 0.05. We should reject the
hypothesis of normality on the original scale. But we might be
tempted to reject it for the logarithms too; the probability
associated with �2 = 26.7 with 18 degrees of freedom (Table 1)
is only 0.088. Suppose, however, we had only 44 data; how
should we judge then? Figure 2 shows the histograms (a) on
the original scale, and (b) for the logarithms. They correspond
well with the full set of data, as do the summary statistics. In
particular, the skewness coef®cient is almost the same. If now
we compute �2 and apply our signi®cance test we still
conclude that the distribution on the original scale is non-
normal, but we should accept without any hesitation that the
logarithms are near enough normally distributed; with �2 = 9.1
with 9 degrees of freedom the probability is 0.42. It illustrates
how sensitive signi®cance tests can be when you have many
data and why you should not use such tests for judging whether
to transform.
We can see how the transformation stabilizes the variances.
I have drawn subsets of 44 from the full set of data, computed
their means and variances, and plotted their variances against
their means. Figure 3(a) shows that the variance increases
strongly with increasing mean. Taking logarithms produces a
result in which there is virtually no relation (Figure 3b).
336 R. Webster
# 2001 Blackwell Science Ltd, European Journal of Soil Science, 52, 331±340
Finally, you should realize that these simple functionschange only the general form of the distribution; they do not
change the detail. If you want to normalize the detail then you
will need the more elaborate normal score transform;
Goovaerts (1997) describes how to do it.
Principal components
Another kind of transformation is to principal components, and
it is one that has recently returned to fashion and brought with
it its own brand of misunderstanding.
When biologists and pedologists ®rst gained access to
general-purpose computers in the 1960s principal component
analysis (PCA) was one of their delights. For the ®rst time they
had a tool to enable them to analyse large sets of correlated
multivariate data and identify `structure' in them. The structure
of interest might be the relations of the units (individuals,
sampling points) to one another and perhaps clustering, or it
might be the general relations among the variates and how they
cluster. Other scientists, especially psychologists, sought to
interpret the principal components, and they elaborated the
technique for the purpose.
Once the pedologists had mastered PCA they put it into
their repertory, and they got it out and dusted it down
when desired. It did not make the news, even though it has
been instrumental in several penetrating studies. But its
popularity is increasing again as soil biologists seek to
understand the microbial ecology of soil. Samples of soil
are subject to a battery of tests, such as those of Biolog, or
their phospholipids are fractionated and their concentrations
measured ± see the recent paper by Fritze et al. (2000) in
this Journal. The result is a set of p values (variates) for
each of N samples of soil (units). These are assembled into
a data matrix with N rows and p columns.
One may imagine this matrix as a distribution of units as N
points in a Euclidean space of p dimensions, as if it were a
scatter graph with p orthogonal axes. However, even the most
elastic imagination has dif®culty in stretching to a space of 20
or 30 dimensions, and when there is correlation among the
variates we may rightly seek to reduce the dimensionality to
one that we can envisage and hope that the structure we seek is
revealed in just a few dimensions. Principal component
analysis helps us in that search. Pursuing the geometric
representation, the analysis ®nds new axes in the multi-
dimensional space such that the ®rst lies in the direction of
maximum variance; the second, orthogonal to the ®rst, lies in
the direction of maximum variance in the residuals from the
®rst; the third takes up the maximum variance in the residuals
Figure 3 Graphs of variance against mean
for 10 subsets of 44 phosphorus data (left)
on the original scale in mg l±1 and (right)
after transformation to common logarithms.
Figure 2 Histograms of a subset of 44 data
on available phosphorus (a) in mg l±1, and
(b) transformed to common logarithms. The
curves in (b) and (a) are of the normal and
lognormal distributions, respectively.
Statistics to support soil research 337
# 2001 Blackwell Science Ltd, European Journal of Soil Science, 52, 331±340
from the ®rst and second, to which it is orthogonal; and it
continues until there is no variance left.
The mathematics of PCA can be found in any good book on
multivariate statistics, e.g. Mardia et al. (1979) and Webster &
Oliver (1990), and it is not my purpose here to repeat that, or
even to summarize it. Further, it is easy to do the analyses now
in many statistical packages, and herein lie the ®rst pitfalls for
the unwary.
Which matrix?
The analysis may be done on any one of three matrices, the
matrix of sums of squares and products, S, the variance±
covariance matrix, C, or the correlation matrix. The ®rst two
are
S = XTX, (10)
where matrix X contains the data from which the means have
been subtracted, i.e. centred data, and the superscript T
signi®es the transpose, and
C ˆ 1
N ÿ 1 X
TX: …11†
The correlation matrix, R, is formed from the variance±
covariance matrix by
rij ˆ cij= ciicjjp …12†
for all i and j = 1,2, ¼, p. Analysing matrices S and C will give
sensibly the same result apart from a scaling factor, N ± 1. In
general the results will differ from those obtained by analysing
matrix R, for the following reason. The outcomes in the ®rst
two instances will be dominated by those original variates with
the largest variances. Thus, if our data comprise available P
with a variance of 26.5 (mg l±1)2 and pH with a variance of
0.25 then the P will swamp pH; we shall not notice the effect
of the latter. Working with matrix R gives them equal weight;
effectively it standardizes the variances to 1, and clearly this is
much more sensible. If you are in any doubt of what your
statistical package is doing then you should standardize the
variates to unit variance before doing the PCA.
What to report
The output from a PCA should contain three sets of values,
namely the eigenvalues (or latent roots), the eigenvectors (or
latent vectors), and component scores. The eigenvalues,
denoted �1, �2, ¼, �p, are the variances along the new axes,
and they are usually tabulated in order from largest to smallest
with the proportions of the total variance for which they
account. You should always report the ®rst few in a table, as in
Table 2 on the left and Table 6 of Fritze et al. (2000).
The eigenvectors, a1, a2, ¼, ap, contain the cosines of the
angles between the original axes and the new ones. These are
also of interest, and you may tabulate them or graph their
elements as points in as many dimensions as you think ®t. The
magnitude of the elements can help to give meaning to the
components; the elements are effectively weights, and the
larger they are in absolute value the stronger is their in¯uence.
You will often ®nd that interpretation is aided by converting
the eigenvectors to correlations, i.e. to the product±moment
correlation coef®cients between the principal components and
the original data, by
bij ˆ aij

�j=�2i
q
; …13†
where aij denotes the ith element of the jth eigenvector, �j is
the jth eigenvalue, and �i
2 is the variance of the ith original
variate. A scatter diagram of these in a circle of unit radius
shows how strong they are (the nearer they plot to the
circumference the stronger they are in those dimensions), and
their proximity to one another shows how related the original
variables are to one another. Figure 4, of the relations of trace
metals in the soil of the Swiss Jura, is an example.
However, here is a second potential pitfall. Many packages
produce tables of output called `loadings'. These loadings may
be either eigenvectors or correlation coef®cients, and the
documentation does not always state which. Authors copy the
results into their papers without knowing, and they are
nonplussed when we editors insist on their telling us precisely
what they are reporting.
Multiplying the centred data by the eigenvectors, columns in
matrix A, gives the third set of values, the component scores,
Y:
Y = XA. (14)
The columns of Y are the new variates. They are best plotted
as scatter diagrams, one against another, for the ®rst few
components. These scatter graphs show the relations between
the units in the reduced dimensions.
Table 2 Eigenvalues from a principal component analysis of the
standardized values of seven trace metals in the Swiss Jura (from
Webster et al., 1994)
Order Eigenvalue Percentage
Accumulated
percentage
1 4.123 58.90 58.90
2 1.342 19.17 78.07
3 0.681 9.72 87.79
4 0.344 4.91 92.70
5 0.229 3.27 95.97
6 0.162 2.31 98.28
7 0.120 1.72 100.00
338 R. Webster
# 2001 Blackwell Science Ltd, EuropeanJournal of Soil Science, 52, 331±340
How many dimensions?
As above, the main aim of PCA is to reduce the dimensionality
and obtain a few variates that capture most of the information
in the data. The question almost inevitably arises: how few? ±
how many components should one retain? There is rarely an
unequivocal answer. Tests have been proposed for the purpose,
but their outcomes are at best guides, and we usually have to
make our own judgements on what is meaningful and good
sense.
The ®rst component is often one of size. In the example
from which Tables 2 and 3 derive, trace metals in the soil of
the Swiss Jura, all seven metals lie well to the right of centre in
Figure 4, showing that units are either rich in all the metals or
poor in them all. The second component, however, discrimi-
nates between the metals Cd, Co, Cr and Ni, and the group of
Cu, Pb and Zn. These together account for almost 80% of the
variance among the seven metals, and the authors thought it
not worth investigating the remaining 20%.
To conclude this section, principal component analysis may
be regarded as a rigid rotation of the data to new axes. It is no
more and no less. It is mathematical rather than statistical. It
embodies no distributional assumptions; it does not lead
directly to any tests of signi®cance. It is also naõÈve in the sense
that it takes no account of anything you know apart from the
data themselves. If the results turn out to have pedological or
biological meaning then that is good fortune rather than a
matter of design. This is not to say that PCA is not valuable; on
the contrary, it has proved remarkably, even surprisingly,
helpful. There are often moderate to strong correlations in
data, so that the leading few principal axes account for large
proportions of the variance, and a projection from the full
space on to them will give a picture containing most of the
information in the data. We shall have reduced the dimension-
ality to one that is manageable, we can examine the scatter in
those few dimensions for structure, we can transform the new
components further to extract more meaning, and we can
analyse them as representatives of the full set of data.
You should see PCA as the beginning of an investigation, an
exploration of your data, not as an end in itself.
Regression
I described regression, guided authors on its correct applica-
tion, and warned of its improper use in a previous article
(Webster, 1997). Yet misunderstanding and abuse continue,
and I do not know what more I can do to educate authors on
the subject. Papers in which regression has been applied
thoughtlessly continue to pour into my of®ce, and it is no
exaggeration to say that in most the regression is inadequately
explained, inappropriate, unnecessary, or just plain wrong. The
only thing that seems to have changed since Mark & Church
(1977) reported the dismal record of regression in earth
science is that it is easier to do ± every computing system now
has a regression button. If you are contemplating a regression
analysis for this Journal then read my article ®rst, and if you do
not understand it please consult a professional statistician.
Epilogue
Most soil research requires only well-established standard
statistics. The analyses have been programmed and are readily
available in computer packages, and we should be able to trust
the outcome. Soil scientists' main failings are now in choosing
the statistics appropriate for their purposes and in presenting
their results with understanding. Bear in mind ®nally that
statistical processing and analyses are means to ends in soil
research and that their outcomes must make pedological sense.
Figure 4 Scatter of the seven trace metals plotted as their
correlation coef®cients between them and the ®rst two principal
components in the unit circle (from Webster et al., 1994). See text
for explanation.
Table 3 First three eigenvectors from the PCA of the standardized
trace metal data
Order
1 2 3
Cd 0.396 ±0.248 ±0.558
Co 0.330 ±0.327 0.722
Cr 0.388 ±0.274 ±0.303
Cu 0.327 0.580 0.178
Ni 0.416 ±0.313 0.187
Pb 0.338 0.551 ±0.034
Zn 0.457 ±0.133 ±0.086
Statistics to support soil research 339
# 2001 Blackwell Science Ltd, European Journal of Soil Science, 52, 331±340
References
Dytham, C. 1999. Choosing and Using Statistics: A Biologist's Guide.
Blackwell Science, Oxford.
Fritze, H., PietikaÈinen, J. & Pennanen, T. 2000. Distribution of
microbial biomass and phospholipid fatty acids in Podzol pro®les
under coniferous forest. European Journal of Soil Science, 51,
565±573.
Goovaerts, P. 1997. Geostatistics for Natural Resources Evaluation.
Oxford University Press, New York.
Lewontin, R.C. 1966. On the measurement of relative variability.
Systematic Zoology, 15, 171±172.
Mardia, K.V., Kent, J.T. & Bibby, J.M. 1979. Multivariate Analysis.
Academic Press, London.
Mark, D.M. & Church, M. 1977. On the misuse of regression in earth
science. Journal of the International Association of Mathematical
Geology, 9, 63±77.
Monteith, J.L. 1984. Consistency and convenience in the choice of
units for agricultural science. Experimental Agriculture, 20,
105±117.
RadojevicÂ, M. & Bashkin, V.N. 1999. Practical Environmental
Analysis. Royal Society of Chemistry, Cambridge.
Webster, R. 1997. Regression and functional relations. European
Journal of Soil Science, 48, 557±566.
Webster, R. & McBratney, A.B. 1987. Mapping soil fertility at
Broom's Barn by simple kriging. Journal of the Science of Food
and Agriculture, 38, 97±115.
Webster, R. & Oliver, M.A. 1990. Statistical Methods in Soil and
Land Resource Survey. Oxford University Press, Oxford.
Webster, R., Atteia, O. & Dubois, J.-P. 1994. Coregionalization of
trace metals in the soil in the Swiss Jura. European Journal of Soil
Science, 45, 205±218.
340 R. Webster
# 2001 Blackwell Science Ltd, European Journal of Soil Science, 52, 331±340

Outros materiais

Outros materiais