Outliers
(extreme scores) the data set below names DATASET1.XLS is the same data you
may have downloaded from Lecture 1 called sample1.XLS
- you could step by step along with this file if you want the practice.
in review from Lecture #2, we have just calculated that
the mean AGE for 16 cases in the data set
is 25.1 years
the mean, however is quite sensitive to distribution scores
that are highly deviant from the mean, scores that are often called outliers
or extreme scores
in the next few Excel frames, we will see the impact of
a single additional case with a deviant age
the purpose for going through this exercise is to illustrate
that even a single outlier has a strong pull on the mean in the direction
of the outlier score
To our existing data set of 16 cases, let's add a 17th case
with an extreme score for AGE - see the yellow highlighted row below.
say we have another student for a total of 17 students in
the class
further let's say that the 17th student is retired and has
gone back to school & he has an AGE of 74
please note that the new data appears in the table above
with the yellow row 18 containing the additional data
below you will seee, as the next step, I have sorted this
array by AGE from youngest to oldest - it is not necessary to sort
the data to compute the mean, but it allows us to get a better grasp of score
frequency distribution
you can see for example that all but four of the AGE values
< 30 years
look at the effect on the mean AGE, which has increased
by almost 3 years from 25.1 to 28.0 yrs
another benefit of sorting by AGE is that it is also fairly
easy to eyeball the median AGE
the median for the original 16 students is 22 yrs
the median all 17 cases is also 22 yrs
what do you suppose would happen if we added an 18th
case with AGE = 74 and then computed the mean for all cases? perhaps you should
try this!
Now let's substitute an AGE for this 17th case to make the
one additional score more moderate
rather than have the 17th case be so extreme at 74 yrs let's
make the 17th case only 31 yrs
we substitute 31 for the value 74 and have Excel recompute
the mean AGE
we can see that the mean has increased from the original
25.1 by less than 0.5 yrs to 25.5 yrs
resorting the table using column E for AGE to eyeball the
median
also, look at the median with 17 cases using the 31 value
of AGE substituted for 74...the median has remained the same
Let's change gears a little and attempt to have Excel compute
the mean of another column
Let's shift our attention to another column
what do you suppose will happen if we attempt to compute
the mean of column J
column J contains the shoe width data, expressed in letters
ranging from A to E
we see that something is amiss; cell J20 contains the notation
"#DIV/0!"
this indicates that there has been an attempt to divide
by 0; as you recall from algebra, division by zero is undefined
Excel cannot deal with mean calculations for non-numeric
cells; Excel will do no better with either the median or the mode when alphabetic
codes are used to represent values.
by the by, what kind of measurement scale is represented
by the variable in column J?
Recoding the alphabetic
values to numbers?
in the instance of SWIDTH we could assign integers from
1 to 5 respectively to the letters A through E
it will take five steps to perform the replacement of all
five letters with numbers, one step per replacement
in this first step we will replace "A" with "1"
click at the top of column J, resulting in the highlighting
of the entire column (see below)
note that column J is shaded, and by going to the Edit menu
and....
by pulling down the Edit menu and selecting 'Replace..."
a pop-up is generated as above
when this pop-up initially appears, there is nothing in
either the "Find" or "Replace" boxes
in the "Find what:" box enter the character to
be replaced; this example has an "A"
in the "Replace with:" box indicate the desired
change; this example has a "1"
now by clicking the "Replace All" button the changes
above are made - in column J only, all cells with the letter A were replaced
with the number 1
we need to repeat this process 4 more times to replace
the Bs, Cs, Ds, and Es
in a small table such as this, we could perhaps make
these replacement faster by selecting each cell and changing the letter
to a number
however, in a larger table with hundreds of cases, this
process of using the Replace command is a real time save
since we highlighted only column J, we can click the "Replace
All" button and only the values in this column will be affected by this
replacement command
this "Replace" command is both POWERFUL
and DANGEROUS
failure to exercise caution will result in elements
being replaced that were not intended and/or failure to replace elements
that were intended.
if for example, we had not highlighted column J before
initiating the Replace, we would see that some unwanted replacements occurred
To illustrate the potential harm induced by careless use of
the Replace command let's look at the following illustration
below is the same table as just presented, but we are going
to replace Bs with 2s
rather than highlight column J as I should, I am going to
neglect this IMPORTANT step
please note that there is no portion of the table with highlighting;
thus the entire spreadsheet will be scanned for replacements when the "Replace
All" button is clicked
also please note, that I have deselected the box for "Find
entire cells only" - what do you suppose check box does?
below is the result of clicking the "Replace All"
button
you can see that not only did the Bs in column J get replaced
by 2s, but every other B in this table was also replaced with a 2
the variable title in column D was changed, but was
not intended
in column M for eye color, please note all the first
character 2s replacing Bs
it is extremely important to stay alert and check your work
when making these kinds of changes
obviously if you unintentionally and unknowingly make replacements
and then perform statistical operations, your results will probably not achieve
the proper solution
I would encourage you to spend a little time experimenting
with the Replace command to learn how it functions
Let's fast forward several steps to get this table completely
recoded
after replacing all the alphabetic codes in column J for
variable SWIDTH, five more columns F, K, L, M, & N also need attention
the Ms and Fs in column F will become a dichotomously coded
variable: M=0, F=1 - why choose these values?
columns K, L, N are yes/no responses and will also become
a dichotomously coded variable: N=0, Y=1 - why choose these values?
finally, column M contains eye colors, including blue, brown,
green, and hazel which we will arbitrarily assign the codes 1, 2, 3, &
4 respectively
the completely recoded data set, table, matrix or array
appears above
we should update our codebook to reflect the recoding changes
to illustrate both the original data and how the original data points were
transformed
Summary of Central Tendency
Use the
Use the
Use the
MODE for these 4
MEDIAN for these 3
MEAN for these 2
nominal
ordinal
interval
ordinal
interval
ratio
interval
ratio
ratio
Clearly there are multiple choices of when to use the mode,
median, and mean. Thus, the scale of measurement does not necessarily dictate
which measure of central tendency to use. Proper selection of central tendency
is determined by factors such as the ease of computation, the desired accuracy,
and need for drawing inferences.
Recently we discussed the importance of the mean in describing
a distribution. Although the mean is quite important, we would miss a great
deal if we did not bother to look at other important characteristics of a distribution.
For example, in two distributions, would it be possible to have the same mean,
but have other notable features differentiating the distributions?
There are two curves below. One is apparently a frequency distribution
for Group 1 and the second is a frequency
distribution for Group 2.
Compare and contrast these two curves.
both appear to be normal curves
both curves appear to have the same mean
there is a higher frequency of scores close to the mean
in Group 1 relative to Group
2
The standard deviation which dependent upon all the scores
in a distribution is both, the most widely used and most stable measure of variability.
the symbol used for standard deviation is S and the square
of S (S2) is variance
the formula for hand calculations
of standard deviation looks worse than it is to calculate
the deviation score formula is neat & compact, but
unless you have an integer mean the deviation scores about the mean are
cumbersome - so this formula is really of limited utility
the raw score formula is more imposing in appearance,
but really not so bad to calculate
X = raw score and N = sample size
remembering to manage the parentheses where most students
have difficulties
the example in the book is a bit contrived, but it illustrates
how simply this calculation can be made, particularly when there are only
five scores
not surprisingly, the resulting standard deviation = 2.0
returned by Excel agrees with the calculations
of a few moments ago
while it is IMPORTANT
to recognize the formula for the variance and standard deviation I would prefer
that you do two things rather than memorize the terms in the expression
be able to discriminate this formula from others we
will be examining
understand what is being calculated - a sort of average
- an average deviation about the mean
also IMPORTANT...
you should know that the value of standard deviation
can never be less than zero
standard deviations must be compared within the same
frame of reference - data that has been measured with the same instrument
the standard deviation is roughly equal to 1/5 the value
of the range in samples studied by behavioral scientists
subtracting a constant from each X in a data set does
not affect the calculation of the standard deviation
the value of the standard deviation is related to the
NORMAL DISTRIBUTION
the value of the standard deviation is equal to the
distance between the mean and and the point of inflection (sometimes referred
to as the point where the line slope changes)
Questions to think about...
does it make sense to obtain the mode on ratio, interval,
ordinal, and nominal data - why?
under what circumstances might one prefer to use the
mode rather than median or mean? why?
what happens to the mean if every component of the mean
has a constant added to it?
what happens to the mean if every component of the mean
is multiplied by a constant factor?
why do statisticians concern themselves about extreme
cases (sometimes called outliers) and agonize over whether or not extreme
cases should be omitted from data analyses?
what type of distribution is present when the mode, median,
and mean are equal?
why is variability important in the study of distributions?
how is variability similar to an average?
if you had to decide that central tendency or variability
was more important, which would you choose and why?
this checklist should help with preparing assignments
Text Reading & Text Problems
EXTREMELY IMPORTANT: Make sure that you have read
De Veaux through Chapter 6,
Also IMPORTANT: Read De Veaux's Review of Part I on
page
154 - if you do not understand this entire
page well, I highly recommend that you review content areas in the first
six chapters. Notice that there are
40 problems following this page - I am NOT
assigning the following list of problems, but I strongly suggest that you
be able to solve them without looking up the answer: 1, 2, 4, 5, 6, 7, 9,
11, 12, 14, 18, 21, 25, 29, 31, 33, 36 - this exercise
could be a sort of wake-up call as to how well you are understanding the
material
so far.
The indirect process of hypothesis testing always sends
a large segment of the class reeling. Thus, please also STUDY, not just
read,
the supplemental class notes on statistical power.
Textbook reading is very highly recommended on hypothesis
testing (pages 508 - 516)
and statistical power (pages 543 - 551).
Most students feel that simply knowing the definitions
for hypothesis
and
power
is sufficient.
These topics are highly related and much deeper than appearing on the surface.
It will be hard to over-prepare on these topics, and I suggest raising
as many questions as you can, as soon as you can.
The following are chapter 6 and additional problems to solve
and submit for assignment #3
A. The class data set is used to recode the variable EYE
1) recode the data to appear similarly (but not exactly)
as it does in the above display (for the varible
EYE) with all alphabetic code characters changed to numerics. Please recode
the variable EYE as listed to the codes supplied below:
EYE - BLU = 1; BRN = 2; GRN = 3.
submit as the solution [to A.1] two columns, including
your recoding, of the variable EYE, but make sure you also provide the
column ID (the first column)
2)With respect to this variable EYE, for each of the following
summary (descriptive statistics) explain how useful (& why or why not
each is useful) these statistics are: mean, median, mode, standard deviation
B. Frequency Distributions, & Mean, Median, Mode computations
1) use the Histogram Excel tool to plot a frequency polygon
graph of COOH. Make sure that you use the following bins: 0, 2, 4, 6, 8, 10
- do not forget to title the graph, and label the axes.
2) calculate the mean, median, & mode for COOH and draw
each of these lines onto the frequency polygon x-axis to intersect the polygon
3) describe the characteristics of the plot - how many
modes, what sort of symmetry, and identify outliers if any.
4) use the Histogram Excel tool to create a frequency polygon
graph of H2O. Make sure that you use the following bins: 0, 4, 8, 12 - do
not forget to title the graph, and label the axes.
5) calculate the mean, median, &
mode for H2O& draw corresponding lines reflecting each of these measures
of central tendency onto the frequency polygon x-axis to intersect the polygon
6) describe the characteristics of the plot - how many modes,
what sort of symmetry, identify outliers if any.
C. Comparing Means on column graphs
1) use Excel to compute the mean & SD of age
in the class data set for SMOKE = No
2) use Excel to compute the mean
& SD of age
in the class data set for SMOKE = Yes
3) plot a column graph illustrating mean age as a function
of SMOKE for the means computed in the previous two steps
4) interpret this graph to the best of your ability
5) make one or two relevant statistical comments about
the means, the SD, or the graph
D. For extra
credit - Have you noted from looking at the computational formulae
for the mean that the denominator is N? Have you noted that the denominator
of the standard deviation is N minus 1 (N-1). I have indicated that standard
deviation is an average of sorts. Is it curious as to why the denominator for
the mean is N, but the denominator for the standard deviation is N minus 1 (N-1).
There is however, an alternative formula for the standard deviation in which
the denominator is simply N.
1) Why are we using the formula for standard deviation with
N-1 in the denominator?
2) Why would one use the alternate formula for standard
devation with N in the denominator?