In this paper, it is suggested that many of the difficulties arising from the use of open-ended and investigative tasks in ‘high-stakes’ assessments of mathematical achievement arise from an over-emphasis on interpreting these assessments in terms of an individual’s past, present or future capabilities (perlocutionary speech acts). As an alternative, it is proposed that high-stakes assessments of open-ended and investigative working mathematics be regarded as illocutionary speech acts, which inaugurate individuals into communities of practice.
In the 1955 William James lectures J L Austin, discussed two different kinds of ‘speech acts’ —illocutionary and perlocutionary (Austin, 1962). Perlocutionary speech acts are speech acts about what has, is or will be. In contrast, illocutionary speech acts are performative (Butler, 1997)— in other words, by their mere utterance they bring into being what John Searle calls social facts (Searle, 1995). For example, the verdict of a jury in a trial is an illocutionary speech act— by its utterance, it does what it says, since the defendant becomes innocent or guilty simply by virtue of the announcement of the verdict— the jury’s announcement creates a social fact (in this case, the guilt or innocence of the defendant). Once a jury has declared someone guilty, they are guilty, whether or not they really committed the act of which they are accused, until that verdict is set aside by another (illocutionary) speech act. What the judge says about the convict’s crime, however, is perlocutionary, since it is a speech act about the crime.
Another example of an illocutionary speech act is the wedding ceremony, where the speech act of one person (the person conducting the ceremony saying "I now pronounce you husband and wife") brings into being the social fact of the marriage.
In my view a great deal of the confusion that currently surrounds educational assessment arises from the confusion of these two kinds of speech acts. Put simply, most educational assessments are treated as if they were perlocutionary speech acts, whereas in my view they are more properly regarded as illocutionary speech acts.
The validity of educational assessments
In the predominant view of educational assessment it is assumed that the individual to be assessed has a well-defined amount of knowledge, expertise or ability, and the purpose of the assessment task is to elicit evidence regarding the level of this knowledge, expertise or ability (Wiley & Haertel, 1996). This evidence must then be interpreted so that inferences about the underlying knowledge, expertise or ability can be made. The crucial relationship is therefore between the task outcome (typically the observed behaviour) and the inferences that are made on the basis of the task outcome. Validity is therefore not a property of tests, nor even of test outcomes, but a property of the inferences made on the basis of these outcomes. As Cronbach noted over forty years ago, "One does not validate a test, but only a principle for making inferences" (Cronbach & Meehl, 1955 p297).
Within this view, the use of assessment results is perlocutionary, because the inferences made from assessment outcomes are statements about the student. Inferences within the domain assessed (Wiliam, 1996a) can be classified broadly as relating to achievement or aptitude (Snow, 1980). Inferences about achievement are simply statements about what has been achieved by the student, while inferences about aptitudes make claims about the student’s skills or abilities. Other possible inferences relate to what the student will be able to do, and are often described as issues of predictive or concurrent validity (Anastasi, 1982 p145).
More recently, it has become more generally accepted that it is also important to consider the consequences of the use of assessments as well as the validity of inferences based on assessment outcomes. Some authors have argued that a concern with consequences, while important, go beyond the concerns of validity— George Madaus for example uses the term impact (Madaus, 1988). Others, notably Samuel Messick in his seminal 100,000 word chapter in the third edition of Educational Measurement, have argued that consideration of the consequences of the use of assessment results is central to validity argument. In his view, "Test validation is a process of inquiry into the adequacy and appropriateness of interpretations and actions based on test scores" (Messick, 1989 p31).
Messick argues that this complex view of validity argument can be regarded as the result of crossing the basis of the assessment (evidential versus consequential) with the function of the assessment (interpretation versus use), as shown in figure 1.
The upper row of Messick’s table relates to traditional conceptions of validity, while the lower row relates to the consequences of assessment use. One of the consequences of the interpretations made of assessment outcomes is that those aspects of the domain that are assessed come to be seen as more important than those not assessed, resulting in implications for the values associated with the domain. For example, if open-ended and investigative work in mathematics is not formally assessed, this is often interpreted as an implicit statement that such aspects of mathematics are less important than those that are assessed. One of the social consequences of the use of such limited assessments is that teachers then place less emphasis on (or ignore completely) those aspects of the domain that are not assessed.
The incorporation of open-ended and investigative work into ‘high-stakes’ assessments such as school-leaving and university entrance examinations can be justified in each of the facets of validity argument identified by Messick.
fidelity: can we be sure that all the assessment evidence elicited by the task is actually ‘captured’ in some sense, either by being recorded in a permanent form, or by being observed by the individual making the assessment?
interpretation: can we be sure that the captured evidence is interpreted appropriately?
The other major threat to reliability arises from difficulties in interpretation. There is considerable evidence that different raters will often grade a piece of open-ended work differently, although, as Robert Linn has shown, this is in general a smaller source of unreliability than task variability.
Much effort has been expended in trying to reduce this variability amongst raters by the use of more and more detailed task specifications and scoring rubrics. I have argued elsewhere (Wiliam, 1994a) that these strategies are counterproductive. Specifying the task in detail removes from the student the need to define what, exactly, is to be attempted, thus rendering the task more like an exercise, or, at best, a problem (Reitman, 1965). The original impetus for open-ended work— that the student should have a role in what counts as a resolution of the task — is negated.
Similarly, developing more precise scoring rubrics does reduce the variability between raters, but only at the expense of restricting what is to count as an acceptable resolution of the task. If the students are given details of the scoring rubric, then their open-ended task is reduced to a straightforward exercise, and if they are not, they have to work out what it is the teacher wants. In other words they are playing a game of ‘guess what’s in teacher’s head’, again negating the original purpose of the open-ended task. Empirical demonstration of these assertions can be found by visiting almost any English school where lessons relating to the statutory ‘coursework’ tasks are taking place (Hewitt, 1992; Wiliam, 1993).
These difficulties are inevitable as long as the assessments are required to perform a perlocutionary function, making warrantable statements about the student’s previous performance, current state, or future capabilities. Attempts to ‘reverse engineer’ assessment results in order to make claims about what the individual can do have always failed, not least because of the effects of compensation between aspects of the assessments.
However, many of the difficulties raised above diminish considerably if the assessments are regarded as serving an illocutionary function. To see what this would entail, it is instructive to consider what might be regarded as one of the most prestigious of all educational assessments— the PhD.
Assessments as illocutionary speech acts
In most countries, the PhD is awarded for a ‘contribution to original knowledge’, and is awarded as a result of an examination of a thesis, usually involving an oral examination. Although the award is technically made by an institution, the decision to award a PhD is generally made on the recommendation of examiners. In some countries, this can be the judgement of a single examiner, while in others it will be the majority recommendation of a panel of as many as six. The important point for our purposes is that in effect the degree is awarded as the result of a speech act of a single person (i.e. the examiner where there is just one, or the chair of the panel where there is more than one). The perlocutionary content of this speech act is negligible, because, if we are told that someone has a PhD, there are very few inferences that are warranted. In other words, when we ask "What is it that we know about what this person has/can/will do now that we know they have a PhD?" the answer is "Almost nothing" simply because PhD theses are so varied. Instead, the award of a PhD is better thought of not as an assessment of aptitude or achievement, or even as a predictor of future capabilities, but rather as an illocutionary speech act that inaugurates an individual’s entry into a community of practice.
The notion of a community of practice is an extension of the notion of a speech community from sociolinguistics, and has been used by authors such as Jean Lave to describe a community that, to a greater or lesser extent, ‘does things the same way’ (Lave & Wenger, 1991). New members begin as peripheral participants in the community of practice, and over a period of time, by absorbing the values and norms of the community, move towards full participation.
Attempts to make sense of the assessment of open-ended tasks such as PhDs, and, more prosaically, mathematics portfolios, in terms of the traditional notions of norm-referenced and criterion-referenced assessments have been unsuccessful (Wiliam, 1994a). There is no well-defined norm group, and even if there were, there would be no way of ensuring that the norm group represented the range of all possibilities for a PhD. There are also no criteria, apart from the occasional set of ‘guidelines’ which are never framed precisely enough to ensure that they are interpreted similarly by different raters. Consistency in the assessment of PhDs, to the extent that it exists at all (and this, of course, is debatable), is not in any sense objective. There is no explicit reference to a norm group, nor is the judgement based on reference to a set of criteria. It might be argued that many universities require that a PhD represents "a contribution to original knowledge, either by the discovery of new facts or by the exercise of independent critical power", but this is far too imprecise to be regarded as a criterion, and in any case, the criterion is never interpreted literally. For example, the number of characters and words in this paper is not known to anyone at present, so a simple count of these would be ‘new facts’, but it is certain that this would not be awarded a PhD in any university. PhD assessments are therefore neither norm- nor criterion-referenced. Instead, any consistency in the judgement of PhD examiners exists by virtue of a shared construct within the community of practice. For this reason, I have termed these construct-referenced assessments. The judgements are neither objective nor subjective — rather they are ntersubjective — and the evidence is that they can be made dependable, even with relatively new members of the community (Wiliam, 1994b).
John Searle (op. cit.) illustrates this by an interview between a baseball umpire and journalist who was seeking to establish whether the umpire’s judgements during his career had been objective or subjective:
Umpire: The way I called them was the way they were.
The arguments sketched out above apply equally well to mathematics education. The assessment of students’ open-ended and investigative work in mathematics can be assessed in the same way that an apprentice’s ‘work sample’ is assessed.
An apprentice carpenter, nearing the end of her apprenticeship, will be asked to demonstrate her capabilities in making (say) a chair or a chest, and a student nearing the end of a particular phase of their mathematical education could be asked to assemble a portfolio of their work. Decisions about how much time is allowed, how much support is given, and to what extent a mathematical portfolio is required to be the individual’s unaided work will vary from community to community (for an interesting discussion of the extent to which T.S. Eliot’s poem The Wasteland can be attributed to him, rather than as a joint effort with Ezra Pound, see Wineberg, 1997). In some communities it may be felt important to establish an individual’s ability to act alone. In others, it will be far more appropriate to establish the individual’s ability to work with others in arriving at a solution.
While the portfolio will provide some information about the student’s past achievements and future capabilities, this will be limited by the variability in the circumstances under which the portfolio was prepared. However the portfolio will be capable of indicating the extent to which the individual can be regarded as a member of a community of practice.
These aims do not conflict
at all with the aims of certifying students for further stages of education
or employment and they are often much more consistent with the demands
of industry than the individualistic approaches so favoured in educational
systems in Western societies. Indeed, if we take seriously the arguments
emerging from work on socially-shared and socially-distributed cognition
(for example Resnick, Levine, & Teasley,1991; Salomon, 1993), we would
be less interested in what an individualcould achieve on their own, but
more interested in what they could achieve as part of a community. If wefurther
accept that it does not make sense to talk of knowledge being ‘inside the
individual’s head’, but constituted in the social interactions between
individuals, as isincreasingly being accepted (Clark 1997, Hendriks-Jansen,
1997), we wouldno longer speak of ‘intelligent individuals’ but ‘individuals
intelligent incommunities of practice’.
In this paper, I have
argued thatregarding the assessment of open-ended and investigative work
inmathematics as illocutionary, rather than perlocutionary, speech acts
substantially alleviates many of the problemscommonly encountered in the
assessment of such work. The score or markgiven to a piece of work indicates
the extent to which the individual (orthe group) has acquired thevalues
and norms of the community of practice, and therefore the extent towhich
they are full or peripheral participants in that community. Suchjudgements
are neither norm- nor criterion-referenced, but rather construct-referenced,
relying for their dependability on the existence of a sharedconstruct of
what it means to be a full participant.
Austin, J. L. (1962). How to do things with words. Oxford, UK:Clarendon Press.
Butler, J. (1997). Excitable speech. London, UK: Routledge.
Clark, A. (1997). Being there: putting brain, body and world together again. Cambridge, MA: MIT Press.
Cronbach, L. J. & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281-302.
Hendriks-Jansen, H. (1997). Catching ourselves in the act: situatedactivity, interactive emergence, evolution and human thought. Cambridge, MA: MIT Press.
Hewitt, D. (1992). Train spotters’ paradise. MathematicsTeaching (140), pps 6-8.
Lave, J. & Wenger, E. (1991). Situated learning: legitimate peripheral participation. Cambridge, UK: Cambridge University Press.
Linn, R. L. & Baker, E. L. (1996). Can performance-based student assessmentby psychometrically sound? In J. B. Baron & D. P. Wolf (Eds.), Performance-based assessment—challenges and possibilities: 95th yearbook of the National Society for the Study of Education part 1 (pp. 84-103). Chicago, IL: National Society for the Study of Education.
Madaus, G. F. (1988). The influence of testing on the curriculum. In L. N.Tanner (Ed.) Critical issues in curriculum: the 87th yearbook of the National Society for the Study of Education (part 1) (pp. 83-121). Chicago, IL: University of Chicago Press.
Messick, S. (1989). Validity. In R. L. Linn (Ed.) Educational measurement (pp. 13-103). Washington, DC: American Council onEducation/Macmillan.
Reitman, W. R. (1965). Cognition and though: an information processing approach. New York, NY: Wiley.
Resnick, L. B.; Levine, J. M. & Teasley, S. D. (1991). Perspectives onsocially shared cognition. Washington, DC: American PsychologicalAssociation.
Salomon, G. (Ed.) (1993). Distributed cognitions: psychological andeducational considerations. Cambridge, UK: Cambridge University Press.
Searle, J. R. (1995). The construction of social reality. London, UK:Allen Lane, The Penguin Press.
Shavelson, R. J.; Baxter, G. P. & Pine, J. (1992). Performance assessments: political rhetoric and measurement reality. Educational Researcher, 21(4), pps 22-27.
Snow, R. E. (1980). Aptitude and achievement. In W. B. Schrader (Ed.) New directions for testing and measurement: measuring achievement, progress over a decade: no 5 (pp. 39-59). San Francisco, CA: Jossey-Bass.
Wiley, D. E. & Haertel, E. H. (1996). Extended assessment tasks: purposes,definitions, scoring and accuracy. In M. B. Kane & R. Mitchell (Eds.), Implementing performance assessment: promises, problems and challenges (pp. 61-89). Mahwah, NJ: Lawrence Erlbaum Associates.
Wiliam, D. (1992). Some technical issues in assessment: a user’sguide. British Journal for Curriculum and Assessment, 2(3), pps 11-20.
Wiliam, D. (1993). Paradise postponed? Mathematics Teaching (144), pps 20-23.
Wiliam, D. (1994a). Assessing authentic tasks: alternatives to mark-schemes. Nordic Studies in Mathematics Education, 2(1), pps 48-68.
Wiliam, D. (1994b). Reconceptualising validity, dependability andreliability for national curriculum assessment. In D. Hutchison & I.Schagen (Eds.), How reliable is national curriculum assessment? (pp. 11-34). Slough, UK: National Foundation for Education Research.
Wiliam, D. (1996a). National curriculum assessments and programmes ofstudy: validity and impact. British Educational Research Journal, 22(1), pps 129-141.
Wiliam, D. (1996b). Standards in examinations: a matter of trust? The Curriculum Journal, 7(3), pps 293-306.
Wineberg, S. (1997). T.S. Eliot, collaboration, and the quandaries ofassessment in a rapidly changing world. Phi Delta Kappan, 79(1),pps 59-65.