The following is a list of useful words and terms introduced
in Lecture #1. This list is important,
but it may take until the end of the semester before you feel comfortable with
each of these.
Definitions
Statistics
Descriptive Statistics
Inferential Statistics
Data
Information
Variability
Confidence
Sample
Population
Science
Operationalization
Measurement
Variable
Measurement Scales (This
list is every bit, if not more important than the one above!)
mean height data from the Fall 2004 class for 11
brown-eyed and 8 non-brown-eyed students
the x-axis contains the CATEGORICAL (or we can also say
NOMINAL) data & the y-axis is RATIO (or we can say quantitative)
the next graph below shows the same data as in the previous
plot
the appearance is strikingly different compared to the previous
plot - WHY?
which graph is better for displaying the present results
& WHY?
although this an example of a NOMINAL variable on the x-axis,
there is usually little utility in looking at the way height
varies as a function of eye color - WHY?
Graphing Relationships
the following graph is a scatter plot from the
Fall 2004 class data set
both the x- & y-axis data are ratio scale
would it be possible for us to reverse the axes and put
height on the y- and weight on the x-axis?
how do you interpret this graph entitled Weight as a Function
of Height in Fall 2004 Students HFIT-565 Students?
the following details surrounding summation notation is not mandatory material,
but it could help your understanding of some of the formulae we encounter in
the text.
Single Summation
Mathematical and statistical notations are often looked at
with great trepidation, but there is really no reason for fright, particularly
if one will reduce a long expression to a few terms. In the brief Chapter 3,
we see the introduction of summation. The first formula presented is that of
an arithmetic mean.
M = mean
= summation
X = the values for each case in a set; the set X
must be defined
N = number of cases in the set of X values
Say we have a set of four scores (X values), 1, 2, 3, 4, and
we want to apply the formula for the mean. Simply add each of the four X values
(1+2+3+4=10). Now we can divide by the number of cases (N=4), and we see that
M=2.5. You have probably used this formula hundreds of times, but never thought
of it in this form of expression.
We next see the introduction of notation as follows.
Please do not be confused by this. We will use this expression
only for a few problems. The sub and super scripts for the S
indicate the subset of values for which the sum operation should be applied.
If 1 and N are specified as in the above formula, we use all values of X inclusive,
from the first through the last member of the set.
For example the introduction of parentheses can make a great
deal of difference in two expressions that appear similar except for the parentheses.
- in this expression, sum the
square of each value, but in the following expression:
- we need to complete the
summation before we can raise the sum to the second power.
In other words, it will be important to obey
the meaning of parentheses. Just as in algebra, if parentheses are not obeyed,
it is probable that you will derive a solution other than a correct one. We
can apply the same data that we used above to solve the formula for the arithmetic
mean and apply these data to each of these last two expressions requiring squares
of values. The X values are 1, 2, 3, 4.
first, for the sum of squares we get:
next for the sum squared we get:
I want you to be able to read and understand formulae presented
in this style because it will give you a more thorough sense of different statistics
than simply relying on computer outputs. However, in the same fashion that an
individual can get through life without knowing multiplication tables, as long
as they have a calculator, it will be possible to get through this course without
knowing statistical formulae, as long as one has a computer.
the most frequent score for a given variable in a data set;
used for quick estimates of central tendency
the mode is easy to obtain, but subject to extreme fluctuation
with small changes in a few scores
if accuracy is not critical, but there is a need for speed,
the mode is your choice
the mode is the most appropriate measure of central tendency
for categorical data
as an example, let's turn to our class data spreadsheet
and determine the mode for the variable SMOKE
(the data below are from a previous semester, but they can be used to illustrate
several important points about central tendency)
I have selected only the first two column of the class data
set
I have recoded the No and Yes responses for SMOKE
to 0 and 1 respectively
we will do this first by hand, but then using Excel's built-in
function
first, look at the depiction of a subset of the class spreadsheet
below; we sometimes call a spreadsheet of numbers like this a
data set, or a table, or a matrix, or an array
- for our purposes these terms are synonymous
please note that the variable we will work with is in column
B - SMOKE
at a glance it is somewhat difficult to determine the mode
since column B data is in no particular order
we can SORT the data
to make the task easier, and just below is the result of the sort
do you know how to sort? do you
know how to sort without corrupting your array? click
here for a tip
we now have the mode for SMOKE (i.e., 0) displayed
in cell B24
clearly, it is overkill to use the Excel mode function in
a column of only 22 numbers
however, only a few more values in a column necessitates
using a more reliable method than eyeballing
moreover, if we were performing a more complicated function,
like finding the median or mean, it would probably be best to let a function
perform the task...we will see more about this in a few moments
The Median
mid-point of a distribution for a given variable with 50%
of scores above & below this mid-point
the median is well suited when ordinal scaled data have
been used or we have a skewed distribution
the median can be extremely easy to determine by hand when
there very few scores, especially when...
there are an odd number of unique, contiguous scores
there are an even number of unique, contiguous scores
finding the median becomes a little more challenging when
there are a large number of scores and/or...
one or more scores has a frequency greater than one
scores are not contiguous
when several scores are either missing, indeterminate, or
extreme, the median may be a good choice
does it make sense to obtain the median on ratio, interval,
and nominal data - why?
the median is of great importance when using non-parametric
statistics
let's move to an example with a bit of an expanded view
of our class data set to include the variables through COOH (number
of alcoholic drinks per week)
first by hand, and then with a function we will derive the
median of COOH
we can turn to the variable COOH in column F
the first step in the manual calculation will be to sort
the data
how do we know that the data have been sorted below in the
next frame?
the column F data have been sorted and we see that there
are 22 cases - please note that the first row is occupied with the variable names
and thus, there are only 22 cases despite the 23rd row being the last row
of the data set
since we need to have 11 scores both above and below the
median, we can see that this mid-point falls between rows 12 and 13
now, we will let Excel perform the median calculation on
the variable COOH
please note that entered in cell F24 is the Excel function
"=MEDIAN(F2:F23)
again, by simply clicking the green check symbol in the
formula bar or pressing the enter key, will result in Excel computing the
median for COOH and displaying the value (2.0) in cell F24
the most commonly used measure of central tendency and the
most laborious to compute is the mean
we will use the mean more than the other measures and thus
we need to know more about its properties
it is best used when the distribution of scores is balanced
(i.e., symmetrical, normal)
extreme scores carry greater weight than central scores
or stated another way, extreme scores tend to pull the mean in the direction
of the extreme values - this is IMPORTANT
we will talk next week about how extreme scores impact
variability
now however, we will use Excel to compute a mean for COOH
as you can see, the Excel function for computing the mean
is specified by the name "AVERAGE"
since the variable COOH is in column F, the cell
references F2:F23 are specified for the range of values to be used in the
mean computation
just as before by either clicking the green check symbol
or pressing the carriage return the computation is made
the mean COOH for your class is 3.2 alcoholic drinks
per week
it can be a big undertaking to hand calculate the mean for
a large number of values, but Excel can do this was ease
however, as easy as it is to
calculate the mean with Excel, can you give a good definition of the mean?
Preview for Next Week (measures of variability)
having Excel determine the standard deviation
let's stick with the same variable COOH
follow the same method as above for central tendency, but
use the STDEV function
by clicking the green check mark or hitting the carriage
return, the standard deviation will appear in cell F24
you can see below that the standard deviation = 4.4 for
these data for the variable COOH
in other words, on the average each of the
22 values for COOH varies by 4.4 drinks per week from
the mean (3.2 alcoholic drinks per week) of the COOH distribution
Although I am interested in your knowing about the purpose
of using "stem & leaf" as well as how to calculate the
"IQR," and the intent of using "boxplots,"
you will not be assigned any problems in which any of these three tools are
used. Also, for "data re-expression" (transformation), we will
not practice this technique during this course. If covered on a quiz or exam,
this material will only show up in a multiple choice or true/false question
rather than in a display using Excel.
this checklist should help with preparing
assignments
please remember to include
an answer sheet of answers only
please remember not to
overflow the right-hand margin
Text Reading & Text Problems
Read De Veaux Chapters 4 - 6
Text Problems
Problems Chapter 3: 47 please
make sure all answers are backed up with Excel calculations
Problems Chapter 4: 6, (in Excel,
draw a distribution for each part 6.a - 6.d and answer text questions)
9, 10, 34(ignore the questions 34.a - 34.d and
answer the following: 34.a make a column plot of game frequency with
the following bins [< 60, 61-65, 66-70, 71-75, 76-80, 81-85]).
Additional Problems
A. Now focus on your class data set.
If you had problems with your data entry and do not know what you did wrong,
you may download an Excel copy of the data by clicking here(not available until after Lecture #2). Your
have already computed the means for height and age for all students in the class.
Here are a few more computations.
1) Now calculate the mean weight for all students in your class data spreadsheet.
2) Convert the height in inches to height in centimeters
(cm) and calculate the mean in cm. What relationship do you see between the
mean in inches and mean in cm?
3) Convert the weight in pounds to weight in kilograms (kg)
and calculate the mean in kg. What relationship do you see between the mean
in pounds and mean in kg?
B. Produce two graphs similar to
the one above showing weight as a function
of height with your class data set.
1) make one graph with weight in pounds and height in inches
2) make a second graph with weight in kg and height in cm
3) what can you observe about the comparison of these two
plots
C. For extra credit, convert the weight of
all members of your class data set from pounds to stone
1) show the converted columns of values
expressed in stone
2) compute and display the mean
3) plot the weight data expressed in stone
as a function of case