non-parametric tests

Since I am dealing with small number of samples (20 interviews), it is useful to take into account the non-parametric tests (Wolfowitz, 1942) which are also referred to distribution-free tests. These tests have the obvious advantage of not requiring the assumption of normality or the assumption of homogeneity of variance. They compare medians rather than means and, as a result, if the data have one or two outliers, their influence is negated.

Why parametric tests are preferred, in general, for the same number of observations?

–> 1) more likely to lead to the rejection of a false null hypothesis. (i.e., they have more power of explanation)

Table1. The non-parametric analogue for the paired sample t-test and the independent samples t-test.

Parametric Test                                             Non-Parametric analogue

  • one-sample t-test                                           nothing quite comparable (chi-square)
  • Paired sample t-test                                       Wilcoxon T Test
  • Independent samples t-test                          Mann-Whitney U Test
  • Pearson’s correlation                                      Spearman’s correlation

Cases for considering non-parametric tests

  • Normality check -when a variable (s) abnormally distributed?
  • sample size – 100 or more observations (sample is large enough then, we assume that sample distribution is normal even if we are not sure that the distribution of the variable in the population is normal, as long as our sample is large enough). However, if sample size is small, then those tests can be used only if we are sure that the variable is normally distributed, and there is no way to test this assumption if the sample is small).
  • Problems in measurement – most common statistical techniques on the scale such as analysis of variance (and t-tests), regression, etc., assume that the underlying measurements are at least of interval, meaning that equally spaced intervals on the scale can be compared in a meaningful manner (e.g., B minus A is equal to D minus C). However, this assumption is very often not tenable, and the data rather represent a rank ordering of observations (ordinal) rather than precise measurements.
  • parametric and nonparametric methods –> considering these limits, the need is evident for statistical procedures that enable us to process data of “low quality,” from small samples, on variables about which nothing is known (concerning their distribution). Specifically, nonparametric methods were developed to be used in cases when the researcher knows nothing about the parameters of the variable of interest in the population (hence the name nonparametric). –> Nonparametric methods do not rely on the estimation of parameters (such as the mean or the standard deviation) describing the distribution of the variable of interest in the population. Therefore, these methods are also sometimes called parameter-free methods or distribution-free methods.

Brief Overview of Nonparametric Methods

  • Tests of differences between groups (independent samples);
    • two samples that we want to compare concerning their mean value for some variable of interest, use the t-test for independent samples–> nonparametric alternatives for this test are the Wald-Wolfowitz runs tests, the Mann-Whitney U test, and the Kolmogorov-Smirnov two-sample test. If multiple groups for comparison, we would use analysis of variance (see ANOVA/MANOVA; the nonparametric equivalent to this method are the Kruskal-Wallis analysis of ranks and the Median test.)
  • Tests of differences between variables (dependent samples);
  • Tests of relationships between variables

When to Use Which Method

The Wilcoxon matched pairs test assumes that one can rank order the magnitude of differences in matched observations in a meaningful manner.

The Kolmogorov-Smirnov two-sample test is not only sensitive to differences in the location of distribution (for example, differences in means) but is also greatly affected by differences in their shapes.

If this is not the case, rather use the Sign test. In general, if the result of a study is important (e.g., does a very expensive and painful drug therapy help people get better?), then it is always advisiable to run different nonparametric tests; should discrepancies in the results occur contingent upon which test is used, one should try to understand why some tests give different results.

On the other hand, nonparametric statistics are less statistically powerful (sensitive) than their parametric counterparts, and if it si important to detect even small effects, (e.g., is this food additive harmful to people?) one should be very careful in the choice of a test statistic.

SPSS example ==>

Another good source that illustrates the difference between typical nonparametric tests and their parametric “equivalents” ==>

Nonparametric Statisticsã

I shall compare the Wilcoxon rank-sum statistic with the independent samples t-test to illustrate the differences between typical nonparametric tests and their parametric “equivalents.”

Independent Samples t test Wilcoxon Rank-Sum Test
HÆ:  m1 = m2 HÆ:  Population 1 = Population 2
Assumptions: None for general test, but often assume:
Normal populations Equal shapes
Homogeneity of variance Equal dispersions
(but not for separate variances test)

Both tests are appropriate for determining whether or not there is a significant association between a dichotomous variable and a continuous variable with independent samples data.  Note that with the independent samples t test the null hypothesis focuses on the population means.  If you have used the general form of the nonparametric hypothesis (without assuming that the populations have equal shapes and equal dispersions), rejection of that null hypothesis simply means that you are confident that the two populations differ on one or more of location, shape, or dispersion.  If, however, we are willing to assume that the two populations have identical shapes and dispersions, then we can interpret rejection of the nonparametric null hypothesis as indicating that the populations differ in location.  With these equal shapes and dispersions assumptions the nonparametric test is quite similar to the parametric test.  In many ways the nonparametric tests we shall study are little more than parametric tests on rank-transformed data.  The nonparametric tests we shall study are especially sensitive to differences in medians.

If your data indicate that the populations are not normally distributed, then a nonparametric test may be a good alternative, especially if the populations do appear to be of the same non-normal shape.  If, however, the populations are approximately normal but heterogeneous in variance, I would recommend a separate variances t-test over a nonparametric test.  If you cannot assume equal dispersions with the nonparametric test, then you cannot  interpret rejection of the nonparametric null hypothesis as due solely to differences in location.

Conducting the Wilcoxon Rank-Sum Test

Rank the data from lowest to highest.  If you have tied scores, assign all of them the mean of the ranks for which they are tied.  Find the sum of the ranks for each group.  If n1 = n2, then the test statistic, WS, is the smaller of the two sums of ranks.  Go to the table (starts on page 715 of Howell) and obtain the one-tailed (lower tailed) p.  For a two-tailed test (nondirectional hypotheses), double the p.  If n1 ¹ n2, obtain both WS and WS¢ WS is the sum of the ranks for the group with the smaller n,  (see the rightmost column in the table), the sum of the ranks that would have been obtained for the smaller group if we had ranked from high to low rather than low to high.  The test statistic is the smaller of WS and WS¢.  If you have directional hypothesis, to reject the null hypothesis not only must the one-tailed p be less than or equal to the criterion, but also the mean rank for the sample predicted (in H1) to come from the population with the smaller median must be less than the mean rank in the other sample (otherwise the exact p = one minus the p that would have been obtained were the direction correctly predicted).

If you have large sample sizes, you can use the normal approximation procedures explained on pages 675-677 of Howell.  Computer programs generally do use such an approximation, but they may also make a correction for continuity (reducing the absolute value of the numerator by .5) and they may obtain the probability from a t-distribution rather than from a z-distribution.  Please note that the rank-sum statistic is essentially identical to the (better know to psychologists) Mann-Whitney U statistic. but the Wilcoxon is easier to compute.  If someone insists on having U, you can always transform your W to U (see page 678 in Howell).

Here is a summary statement for the problem on page 676 of Howell (I obtained an exact p from SAS rather than using a normal approximation):  A Wilcoxon rank-sum test indicated that babies whose mothers started prenatal care in the first trimester weighed significantly more (N = 8, M = 3259 g, Mdn = 3015 g, s = 692 g) than did those whose mothers started prenatal care in the third trimester (N = 10, M = 2576 g, Mdn = 2769 g, s = 757 g), W = 52, p = .034.

Power of the Wilcoxon Rank Sums Test

You already know  that the majority of statisticians reject the notion that parametric tests require interval data and thus ordinal data need be analyzed with nonparametric methods (Gaito, 1980).  There are more recent simulation studies that also lead one to the conclusion that scale of measurement (interval versus ordinal) should not be considered when choosing between parametric and nonparametric procedures (see the references on page 57 of Nanna & Sawilowsky, 1998).  There are, however, other factors that could lead one to prefer nonparametric analysis with certain types of ordinal data.  Nanna and Sawilowsky (1998) addressed the issue of Likert scale data.  Such data typically violate the normality assumption and often the homogeneity of variance assumption made when conducting traditional parametric analysis.  Although many have demonstrated that the parametric methods are so robust to these violations that this is not usually a serious problem with respect to holding alpha at its stated level (but can be, as you know from reading Bradley’s articles in the Bulletin of the Psychonomic Society), one should also consider the power characteristics of parametric versus nonparametric procedures.

While it is generally agreed that parametric procedures are a little more powerful than nonparametric procedures when the assumptions of the parametric procedures are met, what about the case of data for which those assumptions are not met, for example, the typical Likert scale data?  Nanna and Sawilowsky demonstrated that with typical Likert scale data, the Wilcoxon rank sum test has a considerable power advantage over the parametric t test.  The Wilcoxon procedure had a power advantage with both small and large samples, with the advantage actually increasing with sample size.

Wilcoxon’s Signed-Ranks Test

This test is appropriate for matched pairs data, that is, for testing the significance of the relationship between a dichotomous variable and a continuous variable with related samples.  It does assume that the difference scores are rankable, which is certain if the original data are interval scale.  The parametric equivalent is the correlated t-test, and another nonparametric is the binomial sign test.  To conduct this test you compute a difference score for each pair, rank the absolute values of the difference scores, and then obtain two sums of ranks:  The sum of the ranks of the difference scores which were positive and the sum of the ranks of the difference scores which were negative.  The test statistic, T, is the smaller of these two sums for a nondirectional test (for a directional test it is the sum which you predicted would be smaller).  Difference scores of zero are usually discarded from the analysis (prior to ranking), but it should be recognized that this biases the test against the null hypothesis.  A more conservative procedure would be to rank the zero difference scores and count them as being included in the sum which would otherwise be the smaller sum of ranks.  Refer to the table that starts on page 709 of Howell to get the exact one-tailed (lower-tailed) p, doubling it for a nondirectional test.  Normal approximation procedures are illustrated on page 681 of Howell.  Again, computer software may use a correction for continuity and may use t rather than z.

Here is an example summary statement using the data on page 682 of Howell:  A Wilcoxon signed-ranks test indicated that participants who were injected with glucose had significantly better recall (M = 7.62, Mdn = 8.5, s = 3.69) than did subjects who were injected with saccharine (M = 5.81, Mdn = 6, s = 2.86), T(N = 16) = 14.5, p = .004.

Kruskal-Wallis ANOVA

This test is appropriate to test the significance of the association between a categorical variable (k ³ 2 groups) and a continuous variable when the data are from independent samples.  Although it could be used with 2 groups, the Wilcoxon rank-sum test would usually be used with two groups.  To conduct this test you rank the data from low to high and for each group obtain the sum of ranks.  These sums of ranks are substituted into the formula on page 683 of Howell.  The test statistic is H, and the p is obtained as an upper-tailed area under a chi-square distribution on k-1 degrees of freedom.  Do note that this one-tailed p is appropriately used for a nondirectional test.  If you had a directional test (for example, predicting that Population 1 < Population 2 < Population 3), and the medians were ordered as predicted, you would divide that one-tailed p by k ! before comparing it to the criterion.

The null hypothesis here is:  Population 1 = Population 2 = ……… = Population k.  If you reject that null hypothesis you probably will still want to make “pairwise comparisons,” such as group 1 versus group 2, group 1 versus group 3, group 2 versus group 3, etc.  This topic is addressed in detail in Chapter 12 of Howell.  One may need to be concerned about inflating the “familywise alpha,” the probability of making one or more Type I errors in a family of c comparisons.  If k = 3, one can control this familywise error rate by using Fisher’s procedure (also known as “a protected test”):  Conduct the omnibus test (the Kruskal-Wallis) with the promise not to make any pairwise comparisons unless that omnibus test is significant.  If the omnibus test is not significant, you stop.  If the omnibus test is significant, then you are free to make the three pairwise comparisons with Wilcoxon’s rank-sum test.  If k > 3 Fisher’s procedure does not adequately control the familywise alpha.  One fairly conservative procedure is the Bonferroni procedure.  With this procedure one uses an adjusted criterion of significance, .  This procedure does not require that you first conduct the omnibus test, and should you first conduct the omnibus test, you may make the Bonferroni comparisons whether or not that omnibus test is significant.  Suppose that k = 4 and you wish to make all 6 pairwise comparisons (1-2, 1-3, 1-4, 2-3, 2-4, 3-4) with a maximum familywise alpha of .05.  Your adjusted criterion is .05 divided by 6, .0083.  For each pairwise comparison you obtain an exact p, and if that exact p is less than or equal to the adjusted criterion, you declare that difference to be significant.  Do note that the cost of such a procedure is a great reduction in power (you are trading an increased risk of Type II error for a reduced risk of Type I error).

Here is a summary statement for the problem on page 684 of Howell:  Kruskal-Wallis ANOVA indicated that type of drug significantly affected the number of problems solved, H(2, N = 19) = 10.36, p = .006.  Pairwise comparisons made with Wilcoxon’s rank-sum test revealed that ………  Basic descriptive statistics (means, medians, standard deviations, sample sizes) would be presented in a table.

Friedman’s ANOVA

This test is appropriate to test the significance of the association between a categorical variable (k ³ 2) and a continuous variable with randomized blocks data (related samples).  While Friedman’s test could be employed with k = 2, usually Wilcoxon’s signed-ranks test would be employed if there were only two groups.  Subjects have been matched (blocked) on some variable or variables thought to be correlated with the continuous variable of primary interest.  Within each block the continuous variable scores are ranked.  Within each condition (level of the categorical variable) you sum the ranks and substitute in the formula on page 685 of Howell.  As with the Kruskal-Wallis, obtain p from chi-square on k-1 degrees of freedom, using an upper-tailed p for nondirectional hypotheses, adjusting it with k! for directional hypotheses.  Pairwise comparisons could be accomplished employing Wilcoxon signed-ranks tests, with Fisher’s or Bonferroni’s procedure to guard against inflated familywise alpha.

Friedman’s ANOVA is closely related to Kendall’s coefficient of concordance.  For the example on page 685 of Howell, the Friedman tests asks whether the rankings are the same for the three levels of visual aids.  Kendall’s coefficient of concordance, W,  would measure the extent to which the blocks agree in their rankings.  .

Here is a sample summary statement for the problem on page 685 of Howell:  Friedman’s ANOVA indicated that judgments of the quality of the lectures were significantly affected by the number of visual aids employed, (2, n = 17) = 10.94, p = .004.  Pairwise comparisons with Wilcoxon signed-ranks tests indicated that …………………..  Basic descriptive statistics would be presented in a table.


It is commonly opined that the primary disadvantage of the nonparametric procedures is that they have less power than does the corresponding parametric test.  The reduction in power is not, however, great, and if the assumptions of the parametric test are violated, then the nonparametric test may be more powerful.

Everything You Ever Wanted to Know About Six But Were Afraid to Ask

You may have noticed that the numbers 2, 3, 4, 6, 12, and 24 commonly appear as constants in the formulas for nonparametric test statistics.  This results from the fact that the sum of the integers from 1 to n is equal to n(n + 1) / 2.

Effect Size Estimation

As you know, the American Psychological Association now emphasizes the reporting of effect size estimates.  Since the unit of measure for most criterion variables used in psychological research is arbitrary, standardized effect size estimates, such as Hedges’ g, η2, and w2 are popular.  What is one to use when the analysis has been done with nonparametric methods?  This query is addressed in the document “A Call for Greater Use of Nonparametric Statistics,” pages 13-15.  The authors (Leech & Onwuegbuzie) note that researchers who employ nonparametric analysis generally either do not report effect size estimates or report parametric effect size estimates such as g.  It is, however, known that these effect size estimates are adversely affected by departures from normality and heterogeneity of variances, so they may not be well advised for use with the sort of data which generally motivates a researcher to employ nonparametric analysis.

There are a few nonparametric effect size estimates (see Leech & Onwuegbuzie), but they are not well-known and they are not available in the typical statistical software package.  You can find SAS code for computing two nonparametric effect size estimates in the document “Robust Effect Size Estimates and Meta-Analytic Tests of Homogeneity” (Hogarty & Kromrey, SAS Users Group International Conference, Indianapolis, April, 2000).

Wilcoxon-Mann-Whitney test

The Wilcoxon-Mann-Whitney test is a non-parametric analog to the independent samples t-test and can be used when you do not assume that the dependent variable is a normally distributed interval variable (you only assume that the variable is at least ordinal).  You will notice that the SPSS syntax for the Wilcoxon-Mann-Whitney test is almost identical to that of the independent samples t-test.  We will use the same data file (the hsb2 data file) and the same variables in this example as we did in the independent t-test example above and will not assume that write, our dependent variable, is normally distributed.

npar test
 /m-w = write by female(0 1).

The results suggest that there is a statistically significant difference between the underlying distributions of the write scores of males and the write scores of females (z = -3.329, p = 0.001).

This entry was posted in statistics. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s