Reliability and Item Analysis

This topic discusses the concept ofreliability of measurementas used in social sciences (but not in industrial statistics or biomedical research). The termreliabilityused in industrial statistics denotes a function describing the probability of failure (as a function of time).

General Introduction

In many countries of research, the precise measuring of hypothesized procedures or variables ( theoreticalconstructs ) poses a challenge by itself. For illustration, in psychological science, the precise measuring of personality variables or attitudes is normally a necessary first measure before any theories of personality or attitudes can be considered. In general, in all societal scientific disciplines, undependable measurings of people ‘s beliefs or purposes will evidently halter attempts to foretell their behaviour. The issue of preciseness of measuring will besides come up in applied research, whenever variables are hard to detect. For illustration, dependable measuring of employee public presentation is normally a hard undertaking ; yet, it is evidently a necessary precursor to any performance-based compensation system.

In all of these instances, Reliability & A ; Item Analysismay be used to build dependable measuring graduated tables, to better bing graduated tables, and to measure the dependability of graduated tables already in usage. Specifically, Reliability & A ; Item Analysiswill assistance in the design and rating ofsum graduated tables, that is, graduated tables that are made up of multiple single measurings ( e.g. , different points, repeated measurings, different measuring devices, etc. ) . You can calculate legion statistics that allows you to construct and measure graduated tables following the so-calledclassical testing theorymodel.

The appraisal of scale dependability is based on the correlativities between the single points or measurings that make up the graduated table, comparative to the discrepancies of the points. If you are non familiar with thecorrelation coefficientor the discrepancy statistic, we recommend that you review the several treatments provided in theBasic Statisticssection.

The classical testing theory theoretical account of scale building has a long history, and there are many text editions available on the topic. For extra elaborate treatments, you may mention to, for illustration, Carmines and Zeller ( 1980 ) , De Gruitjer and Van Der Kamp ( 1976 ) , Kline ( 1979, 1986 ) , or Thorndyke and Hagen ( 1977 ) . A widely acclaimed “ authoritative ” in this country, with an accent on psychological and educational testing, is Nunally ( 1970 ) .

Testing hypotheses about relationships between points and tests.Using Structural Equation Modeling and Path Analysis ( SEPATH ) , you can prove specific hypotheses about the relationship between sets of points or different trials ( e.g. , test whether two sets of points measure the same concept, analyze multi-trait, multi-method matrices, etc. ) .
Basic Ideas

Suppose we want to build a questionnaire to mensurate people ‘s biass against foreign- made autos. We could get down out by bring forthing a figure of points such as: “ Foreign autos lack personality, ” “ Foreign autos all look the same, ” etc. We could so subject those questionnaire points to a group of topics ( for illustration, people who have ne’er owned a foreign-made auto ) . We could inquire topics to bespeak their understanding with these statements on 9-point graduated tables, anchored at1=disagreeand9=agree.

True tonss and error.Let us now consider more closely what we mean by precise measuring in this instance. We hypothesize that there is such a thing ( theoretical concept ) as “ bias against foreign autos, ” and that each point “ taps ” into this construct to some extent. Therefore, we may state that a topic ‘s response to a peculiar point reflects two facets: foremost, the response reflects the bias against foreign autos, and 2nd, it will reflect some esoteric facet of the several inquiry. For illustration, see the point “ Foreign autos all look the same. ” A topic ‘s understanding or dissension with that statement will partly depend on his or her general biass, and partly on some other facets of the inquiry or individual. For illustration, the topic may hold a friend who merely bought a really different looking foreign auto.

Testing hypotheses about relationships between points and tests.To trial particular hypotheses about the relationship between sets of points or different trials ( e.g. , whether two sets of points measure the same concept, analyze multi- trait, multi-method matrices, etc. )

Classical Testing Model

To sum up, each measuring ( response to an point ) reflects to some extent the true mark for the intended construct ( bias against foreign autos ) , and to some extent esoteric, random mistake. We can show this in an equation as:
Ten = tau + mistake
In this equation, Xrefers to the several existent measuring, that is, capable ‘s response to a peculiar point ; tauis normally used to mention to thetrue mark, anderrorrefers to the random mistake constituent in the measuring.


In this context the definition ofreliabilityis straightforward: a measuring is dependable if it reflects largely true mark, comparative to the mistake. For illustration, an point such as “ Red foreign autos are peculiarly ugly ” would probably supply an undependable measuring of biass against foreign- made autos. This is because there likely are ample single differences refering the likes and disfavors of colourss. Therefore, this point would “ capture ” non merely a individual ‘s bias but besides his or her colour penchant. Therefore, the proportion of true mark ( for bias ) in topics ‘ response to that point would be comparatively little.

Measures of reliability.From the above treatment, one can easy deduce a step or statistic to depict the dependability of an point or graduated table. Specifically, we may specify anindex of reliabilityin footings of the proportion of true mark variableness that is captured across topics or respondents, comparative to the entire ascertained variableness. In equation signifier, we can state:

Sum Scales

What will go on when we sum up several more or less dependable points designed to mensurate bias against foreign-made autos? Suppose the points were written so as to cover a broad scope of possible biass against foreign-made autos. If the mistake constituent in topics ‘ responses to each inquiry is genuinely random, so we may anticipate that the different constituents will call off each other out across points. In somewhat more proficient footings, the expected value or mean of the mistake constituent across points will be zero. The true mark constituent remains the same when summing across points. Therefore, the more points are added, the more true mark ( comparative to the mistake mark ) will be reflected in the sum graduated table.

Number of points and reliability.This decision describes a basic rule of trial design. Namely, the more points there are in a graduated table designed to mensurate a peculiar construct, the more dependable will the measuring ( sum graduated table ) be. Possibly a slightly more practical illustration will farther clear up this point. Suppose you want to mensurate the tallness of 10 individuals, utilizing merely a petroleum stick as the measuring device. Note that we are non interested in this illustration in the absolute rightness of measuring ( i.e. , in inches or centimetres ) , but instead in the ability to separate faithfully between the 10 persons in footings of their tallness. If you measure each individual merely one time in footings of multiples of lengths of your rough measuring stick, the attendant measuring may non be really dependable. However, if you measure each individual 100 times, and so take the norm of those 100 measurings as the sum-up of the several individual ‘s tallness, so you will be able to do really precise and dependable differentiations between people ( based entirely on the petroleum measuring stick ) .

Let ‘s now look at some of the common statistics that are used to gauge the dependability of a sum graduated table.

Cronbach ‘s Alpha

To return to the bias illustration, if there are several topics who respond to our points, so we can calculate the discrepancy for each point, and the discrepancy for the sum graduated table. The discrepancy of the amount graduated table will be smaller than the amount of point discrepancies if the points step thesamevariability between topics, that is, if they measure some true mark. Technically, the discrepancy of the amount of two points is equal to the amount of the two variancesminus ( two times ) the covariance, that is, the sum of true mark discrepancy common to the two points.

We can gauge the proportion of true mark discrepancy that is captured by the points by comparing the amount of point discrepancies with the discrepancy of the sum graduated table. Specifically, we can calculate:

This is the expression for the most common index of dependability, viz. , Cronbach ‘s coefficientalpha ( ) . In this expression, thesi**2 ‘s denote the discrepancies for thekindividual points ; ssum**2denotes the discrepancy for the amount of all points. If there is no true mark but merely mistake in the points ( which is esoteric and alone, and, hence, uncorrelated across topics ) , so the discrepancy of the amount will be the same as the amount of discrepancies of the single points. Therefore, coefficientalphawill be equal to zero. If all points are absolutely dependable and step the same thing ( true mark ) , so coefficient alpha is equal to 1. ( Specifically,1- ( si**2 ) /ssum**2will become equal to ( k-1 ) /k ; if we multiply this byk/ ( k-1 ) we obtain 1. )

Alternate terminology.Cronbach’salpha, when computed for binary ( e.g. , true/false ) points, is indistinguishable to the so-calledKuder-Richardson-20formula of dependability for amount graduated tables. In either instance, because the dependability is really estimated from the consistence of all points in the amount graduated tables, the dependability coefficient computed in this mode is besides referred to as theinternal-consistency dependability.

Split-Half Dependability

An alternate manner of calculating the dependability of a sum graduated table is to split it in some random mode into two halves. If the amount graduated table is absolutely dependable, we would anticipate that the two halves are absolutely correlated ( i.e. , R = 1.0 ) . Less than perfect dependability will take to less than perfect correlativities. We can gauge the dependability of the sum graduated table via theSpearman-Brown disconnected halfcoefficient:

rsb= 2rxy/ ( 1+rxy )

In this expression, rsbis the split-half dependability coefficient, andrxyrepresents the correlativity between the two halves of the graduated table.

Correction for Attenuation

Let us now consider some of the effects of less than perfect dependability. Suppose we use our graduated table of bias against foreign-made autos to foretell some other standard, such as subsequent existent purchase of a auto. If our graduated table correlates with such a standard, it would raise our assurance in thevalidityof the graduated table, that is, that it truly measures biass against foreign-made autos, and non something wholly different. In existent trial design, thevalidationof a graduated table is a drawn-out procedure that requires the research worker to correlate the graduated table with assorted external standards that, in theory, should be related to the construct that is purportedly being measured by the graduated table.

How will validity be affected by less than perfect scale dependability? The random mistake part of the graduated table is improbable to correlate with some external standard. Therefore, if the proportion of true mark in a graduated table is merely 60 % ( that is, the dependability is merely.60 ) , so the correlativity between the graduated table and the standard variable will beattenuated, that is, it will be smaller than the existent correlativity of true tonss. In fact, the cogency of a graduated table is ever limited by its dependability.

Given the dependability of the two steps in a correlativity ( i.e. , the graduated table and the standard variable ) , we can gauge the existent correlativity of true tonss in both steps. Put another manner, we cancorrectthe correlationfor fading:

rxy, corrected= rxy/ ( rxx*ryy ) ?

In this expression, rxy, correctedstands for the corrected correlativity coefficient, that is, it is the estimation of the correlativity between the true tonss in the two measuresxandy. The termrxydenotes the uncorrected correlativity, andrxxandryydenote the dependability of steps ( graduated tables ) xandy. You can calculate the fading rectification based on specific values, or based on existent natural information ( in which instance the dependabilities of the two steps are estimated from the information ) .

Planing a Reliable Scale

After the treatment so far, it should be clear that, the more dependable a graduated table, the better ( e.g. , more valid ) the graduated table. As mentioned earlier, one manner to do a sum graduated table more valid is by adding points. You can calculate how many points would hold to be added in order to accomplish a peculiar dependability, or how dependable the graduated table would be if a certain figure of points were added. However, in pattern, the figure of points on a questionnaire is normally limited by assorted other factors ( e.g. , respondents get tired, overall infinite is limited, etc. ) . Let us return to our bias illustration, and sketch the stairss that one would by and large follow in order to plan the graduated table so that it will be dependable:

Measure 1:

Generating items.The first measure is to compose the points. This is basically a originative procedure where the research worker makes up as many points as possible that seem to associate to biass against foreign-made autos. In theory, one should “ try points ” from the sphere defined by the construct. In pattern, for illustration in marketing research, focal point groupsare frequently utilised to light as many facets of the construct as possible. For illustration, we could inquire a little group of extremely committed American auto purchasers to show their general ideas and feelings about foreign-made autos. In educational and psychological testing, one commonly looks at other similar questionnaires at this phase of the graduated table design, once more, in order to derive as broad a position on the construct as possible.

Measure 2:

Choosing points of optimal difficulty.In the first bill of exchange of our bias questionnaire, we will include as many points as possible. We so administer this questionnaire to an initial sample of typical respondents, and analyze the consequences for each point. First, we would look at assorted features of the points, for illustration, in order to identifyfloororceilingeffects. If all respondents agree or disagree with an point, so it evidently does non assist us know apart between respondents, and therefore, it is useless for the design of a dependable graduated table. In trial building, the proportion of respondents who agree or disagree with an point, or who answer a trial point right, is frequently referred to as theitem trouble. In kernel, we would look at the point means and standard divergences and extinguish those points that show utmost agencies, and zero or about zero discrepancies.

Measure 3:

Choosing internally consistent items.Remember that a dependable graduated table is made up of points that proportionally measure largely true mark ; in our illustration, we would wish to choose points that measure largely prejudice against foreign-made autos, and few esoteric facets we consider random mistake. To make so, we would look at the followers:


Summary for graduated table: Mean=46.1100 Std.Dv.=8.26444 Valid n:100
Cronbach alpha: .794313 Standardized alpha: .800491
Average inter-item corr. : .297818


Mean if

Var. if

StDv. if


Multp. Roentgen

Alpha if








Shown above are the consequences for 10 points. Of most involvement to us are the three right-most columns. They show us the correlativity between the several point and the entire amount mark ( without the several point ) , the squared multiple correlativity between the several point and all others, and the internal consistence of the graduated table ( coefficientalpha ) if the several point would be deleted. Clearly, items5and6 ” stick out, ” in that they are non consistent with the remainder of the graduated table. Their correlativities with the amount graduated table are.05and.12, severally, while all other points correlate at.45or better. In the right-most column, we can see that the dependability of the graduated table would be about.82if either of the two points were to be deleted. Therefore, we would likely cancel the two points from this graduated table.

Measure 4:

Returning to Step 1.After canceling all points that are non consistent with the graduated table, we may non be left with adequate points to do up an overall dependable graduated table ( retrieve that, the fewer points, the less dependable the graduated table ) . In pattern, one frequently goes through several unit of ammunitions of bring forthing points and extinguishing points, until one arrives at a concluding set that makes up a dependable graduated table.

Tetrachoric correlations.In educational and psychological testing, it is common to useyes/notype points, that is, to motivate the respondent to reply either yes or no to a inquiry. An alternate to the regular correlativity coefficient in that instance is the so-calledtetrachoriccorrelation coefficient. Normally, the tetrachoric correlativity coefficient is larger than the standard correlativity coefficient, hence, Nunally ( 1970, p. 102 ) discourages the usage of this coefficient for gauging dependabilities. However, it is a widely used statistic ( e.g. , in mathematical mold ) .


Benson, J. ( 1998 ) . Developing a strong plan of concept proof: a trial anxiousness illustration. Educational Measurement: Issues and Practice, 17, 10-17.

Crocker, L. , & A ; Algina, J. ( 1986 ) . Introduction to classical and modern trial theory. Philadelphia: Harcourt Brace Jovanovich College Publishers.

DeVellis, R.F. ( 1991 ) . Scale Development: theory and applications ( Applied Social Research Methods Series, Vol. 26 ) . Newbury Park: Sage.

Gregory, R.J. ( 1992 ) . Psychological testing: history, rules and applications. Boston: Allyn and Bacon.