The Teacher's Most
Difficult Job

CONTENTS
RATIONALE
COMPETENCIES
INFORMATION SECTION
1.1 Need for Different Forms of Evaluation
1.2 List Different Forms of Evaluation
1.3 Balanced As Well
As Broad
2.1 Characteristics of Written Tests--Validity
2.2 Characteristics of Written Tests--Reliability
2.3 Relationship of Validity and Reliability
2.4 Fairness
3.1 Essay Tests
3.2 Short Answer Tests
3.3 True-False Tests
3.4 Matching Tests
3.5 Multiple Choice Tests
3.6 Factors Affecting
the Total Test
4.1 Evaluating Student Laboratory Work
4.2 Project Evaluation
4.3 Experiments/Exercises
4.4 Role Playing
Activities
5.1 Final Grades
Why Go To So Much
Trouble?--A Brief Summary
There are three primary components of the education process: designing (goal setting and identifying content), presentation (teaching), and assessment (testing and evaluating). This monograph deals with the third component--assessment. Educational experiences, including courses, projects, assignments, and activities, are all selected/designed on the basis of goals and objectives. The degree to which the goals and objectives are met forms the basis for evaluation and, in many ways, dictates the methods of evaluation. Programs, teachers, facilities, administrators, courses, students, support services, and many other aspects of education must all be evaluated but the topic of this particular publication is assessment of student achievement.
Evaluating students, especially that phase of evaluation which requires assigning grades, is the toughest part of a teacher's job. Fledgling teachers should not feel that the agony they experience the first few times they assign grades will go away as they gain experience. Indeed, as teachers gain experience they generally find evaluation to be a tougher job--requiring their utmost patience and concern. No part of a teacher's job is more important or demanding than evaluation, but the helpful tips and guidelines presented in this monograph will help teachers avoid some serious pitfalls.
In addition to being
both important and difficult, evaluation of student achievement is
also somewhat value laden. Every educator assumes a personal set of
beliefs, hunches, and values which determine, to a large degree, what
forms of evaluation will be used, the standards to be set, and how
the evaluating criteria are to be applied. The reader should be aware
that the author of this monograph has his own biases and "bag of
tricks" which have served him well. Every attempt has been made to
insure that a well balanced array of professionally accepted
practices is presented, but there will certainly be those educators
(with and without experience) who will disagree with some portions of
this publication. Professional difference of opinion is certainly
allowed and--as long as evaluation is systematic, frequent, broad in
scope, and applied fairly--there is plenty of room for variations in
technique. Large amounts of time are required to do evaluation
correctly, so it is important to do the best possible job with the
time we invest.
TASK 1.0 Explain the
need for and importance of balanced and varied evaluation
techniques.
PERFORMANCE
OBJECTIVES:
1.1 After reading
this monograph, you will be able to explain the need for different
forms of evaluation in Technology Education courses.
1.2 After studying
this package, you will be able to list several forms of evaluation
appropriate for use in Technology Education courses.
1.3 After reading
this monograph, you will be able to explain the importance of balance
in evaluation of students.
TASK 2.0 List and
explain the most important characteristics of good tests.
PERFORMANCE
OBJECTIVES:
2.1 After studying
this publication, you will be able to intelligently discuss face
validity as it applies to teacher-made tests.
2.2 After studying
this monograph, you will be able to explain reliability in
educational measurement.
2.3 After reviewing
this publication, you will be able to discuss the relationship
between validity and reliability in testing.
2.4 After reading
this monograph, you will be able to discuss other factors which
contribute to the general fairness of a teacher-made test.
TASK 3.0 Write
acceptable test items of several types.
PERFORMANCE
OBJECTIVES:
3.1 With the help of
this booklet, you will be able to write and correctly score essay
test items.
3.2 With the help of
this monograph, you will be able to prepare well designed short
answer test items for teacher-made tests.
3.3 With the help of this monograph, you will be able to write improved true-false test items.
3.4 Using this
monograph as an aid, you will be able to prepare matching test items
and cite appropriate uses for them.
3.5 Following the
guidelines in this monograph, you will be able to improve the quality
of the multiple choice test items which you write for your
teacher-made tests.
3.6 After studying
this monograph, you will be able to list and explain several factors
affecting the quality of the whole test.
TASK 4.0
Systematically evaluate student laboratory work.
PERFORMANCE
OBJECTIVES:
4.1 After reading
this publication, you will be able to discuss three major types of
student lab work which need to be evaluated in Technology Education
classes.
4.2 With the help of
this monograph, you will be able to develop student self-evaluation
forms for use in evaluating projects in your classes.
4.3 After reading
this monograph, you will be able to properly evaluate student
laboratory experiments and exercises.
4.4 After studying
this publication, you will be able to establish improved procedures
for evaluating role playing activities in your laboratory.
TASK 5.0 Explain how
to fairly and objectively assign grades to students.
PERFORMANCE
OBJECTIVES:
5.1 With the help of
this monograph, you will be able to fairly and objectively assign
term and course grades to students.
5.2 Using this monograph as a guide, you will be able to plan your evaluation system so that it assesses many aspects of student behavior.
1.1 Need for Different Forms of Evaluation
One of the most important facts that all teachers learn very early in their pre-service education is that each student is an individual. The importance of individual differences and their impact on the education process is expounded even further in professional, psychology, methods, and measurement courses. This emphasis and overlap was not caused by coincidence, lack of planning, or attempts to "beat a dead horse". The overlap was planned and is necessary because that is probably the only true fact which we have nearly proven in all of our research on education: Each person is a unique individual and individuals learn in different ways and at different paces. This is the reason why assessment of student achievement must be broad in scope. Just as each student perceives and assimilates information differently, so each student is best able to demonstrate the attainment of knowledge in ways unique to himself or herself.
Assume that you have
taught a unit on identification and characteristics of woods. Bob
made an easy A on your written test about the characteristics of
woods while Laura scored a disappointing 40%. However, when they were
selecting wood to use in making water skis, Bob chose cherry for
appearance sake. Laura used mahogany and ash laminated together. When
you asked Laura about her choice, she pointed out the straight grain
patterns and the excellent gluing and water resistant qualities of
the woods. Which child has really achieved your instructional goals
in this area? Both students had knowledge (perhaps even the same
knowledge), but each was able to display the knowledge in a different
way. Technology Education is a special learning environment which
emphasizes learning in all three domains (affective, cognitive, and
psychomotor) more than most subjects and, since we teach in all three
domains, we must also test in all three domains!
1.2 List Different Forms of Evaluation
Technology teachers should be able to evaluate the work of students by many methods and in various settings--some traditional and some unique to our field. Some examples of ways to evaluate student achievement and growth are:
Written tests,
Written quizzes,
Papers,
Oral tests and quizzes,
Document application of knowledge in activities,
Portfolio review,
Review questions,
Notebooks,
Performance tests,
Observe student lab work,
Evaluate quality of products,
Evaluate quality of exercises,
Safety performance tests,
Monitor group activities,
Indirect questions to determine change of opinion,
Document evidence of attitudinal (value) changes,
Student-self evaluation of projects, and
Document leadership/followership qualities.
Surely, there are
still other methods which can be used to evaluate Technology
Education students in specific situations. The list above is not
considered to be exhaustive, but it should help to point out the
nature of broad evaluation techniques.
1.3 Balanced As Well As Broad
In the example given in section 1.1, if the teacher tested knowledge of the characteristics of woods only on the written test, Laura's knowledge would not have been discovered at all. Likewise, a teacher who did not use any written tests would not have observed Bob's knowledge. Between these two extremes, there are many possible ways to combine test and non-test assessment techniques. This is an area which will require professional judgement on the teacher's part. The simplistically obvious solution would be to make tests and written work count for fifty percent of the grade and let projects and lab work comprise the remaining fifty percent. Some educators disagree with this technique because they feel that it results in an academically "watered down" course and reduced standards. Others would argue that "doing" is far more important than "book learning" and would contend that tests do not show much. The advice given here is to strive for balance and use assessment techniques which truly reflect what is learned and done in your course. An evaluation schema that gives relatively equal emphasis to both academic learning and performance is generally desirable, but this is so only if the techniques used to assess knowledge and practical ability are valid, reliable, and fair. Much of the remainder of this monograph will deal with the development of measurement techniques which meet these criteria.
The time spent in
evaluation is an important commodity. If administering, grading and
reviewing a test detracts from time which could be used for
additional lectures or laboratory work, students would experience a
loss of educational opportunity by taking tests. Some research
findings, however, contradict this assumption--it has been
demonstrated that students learn more when they are tested and that
the act of taking the test is a beneficial learning activity. The
threat of the fact that a test is upcoming was only marginally
effective, but the actual experience of the test was quite effective.
This was true for both in-class and take-home tests. So, time spent
in testing is neither lost nor wasted--it is a valuable aid to
learning and serves an important evaluation function too.
2.1 Characteristics of Written Tests--Validity
There are several types of paper and pencil tests, each with unique characteristics and limitations which makes it more or less useful in given situations. There are three characteristics, however, which are essential in all types of tests--whether they are written or not. These characteristics will be discussed under the headings validity, reliability, and fairness.
The first characteristic of a good test is validity. There are several types of validity which can be discussed in relation to educational and psychological measures, but the most important type of validity needed by teacher-made tests is "face validity". Face validity simply means that the test actually does test what it is supposed to test. This may seem silly and redundant, but it is very important. A test covering content about screen process printing which is written in English may be very valid in America, but this same test would cease to be valid if it were administered in China. In the latter case, the test would test the students' command of our language more than it would their knowledge about screen printing. This example is extreme, but it illustrates an important concept. The, all too common, "trick question" which requires students who know the subject matter to out-guess the teacher's attempt to be witty or cute is the worst enemy of face validity. If you wish to write good tests, make every possible effort to avoid obscuring the point of a question with trickery and riddles.
Consider this
item:
Which of the following parts of a car is held in the hand most often?
a. clutch
b. gearshift lever
c. light switch
d. stearing wheel
e. gas cap
Most students would
choose d, but that is wrong because it is misspelled! Have you
seen this sort of trick? Was the intent of the question to determine
how well students could identify the uses of parts of the car or how
well students could spell? Such trickery destroys face validity.
2.2 Characteristics of Written Tests--Reliability
The second characteristic of a good test, reliability, is related to the first. Reliability means, simply stated, that the test tests the same thing each time it is administered--that its meaning and measuring ability do not change. We could call validity "truthfulness" and reliability "consistency". A test may be said to be consistent (reliable) if it gives relatively the same results each time it is used with similar groups of people. It is fairly easy to establish the reliability of a large standardized test. A simplified method of determining the reliability of a test would be to administer it to a group of students, wait about two days, and then re-administer the same test to the same group. After scoring both tests and plotting the scores on a graph, you would likely find that the scores of the second testing would be a few points higher than those of the first testing; but, more importantly, you should find that the same general ranking of scores occurred in both testings. That is, the students who scored high on the original testing should also score high on the retest. A test yielding these results would be said to have some degree of reliability--it would be consistent. If, on the other hand, the ranks reversed (with poor students receiving the high scores in one of the two testings), then the test would not be very reliable. The reliability of teacher-made tests depends mostly on two factors: clarity of items and test length.
A test is made up of small parts called items (sometimes the term "questions" is inappropriately substituted). Just as a chain derives its quality from each link, so a test gains reliability (and validity) from each item. The most probable cause of low reliability in teacher-made tests is poorly worded items. Items which are unclear, confusing, and ambiguous can damage the consistency of a test very quickly, and it does not take many weak items to utterly destroy the reliability of the whole test. True-false items are particularly vulnerable to ambiguity and they frequently give the brightest students the most trouble. Avoid using words with vague meanings, double negatives, and complicated sentence structures.
Refer back to the example item about the car parts (section 2.1). If students caught the teacher's attempt to be sneaky, and avoided response d, then they would have to choose another answer based on their interpretation of the ambiguous phrase "held in the hand most often". Does that mean "grasped by the hand most frequently" or does it mean "picked up and enclosed in the hand most often?" Does "in the hand" eliminate the possibility of using both hands? This item is worded very poorly and its lack of clarity would certainly harm the reliability of the test.
The second factor
affecting reliability is test length. Tests with two or more items on
each topic/concept/fact are more reliable than tests with only one
item per bit of information. In fact, until the test becomes so long
that it creates serious fatigue problems, the longer the test is, the
more reliable it will be. It is easy to see how this works. Imagine
that you wished to test five concepts. You could make up a five item
test, a ten item test, or perhaps even a forty item test with
concepts duplicated by different types of items. Suppose also that,
regardless of which length test you use, there is one bad item which
discriminates reversely (the most capable students answer incorrectly
but the weak students get it right). A student who truly knows the
material, and who should normally get a score of 100%, but who
(wrongfully) missed this one ambiguous item would score as follows
depending on total test length:
5 item test 80%
10 item test 90%
40 item test
97.5%
Which score most
nearly estimates this student's true ability? The longer test would
have much greater reliability. When possible, try to have at least
two items per information bit.
2.3 Relationship of Validity and Reliability
A test that is not
reliable cannot be valid, but the converse of this statement is not
true. Alternately stated, a test must be reliable (consistent) in
order to be valid (to truthfully test the right thing) but a test
could easily test the wrong thing entirely (be quite invalid) and do
so very accurately and consistently (repeatedly, reliably). The
scores derived in test-retest situations using the printing test in
China (see section 2.1) would show remarkable reliability because the
rank orders of the scores would change very little, but it would
never be valid because it would always test knowledge of English
rather than knowledge of screen printing. Thus, a test must be
reliable to be valid, but it could be reliable without being
valid.
2.4 Fairness
Fairness is the third characteristic of a good test. Fairness includes some aspects of reliability and validity (as they would be discussed in a psychological measurement text) and other factors such as clarity of items, quality of the testing environment, and unbiased scoring. We have all felt, at one time or another, that a test was unfair because it was given after a week off, or when schedules were mixed up and time was cut short, or in a hot room, or with unclear instructions, or ...
Fairness is a
difficult characteristic to deal with because it actually is the
resulting effect of many small factors. Anything that reduces
reliability or validity will certainly detract from the fairness of a
test. Additionally, the following factors affect test fairness:
Time proximity to material presentation,
Temperature,
Lighting,
Length of test (fatigue factor),
Time available,
Objectivity of scoring,
Noise and disturbances,
Clarity of instructions,
Appropriateness of types of items used,
Quality of test reproduction, and
Amount of material
covered on one test.
Teachers may increase
the fairness of their tests the most by applying a variation of the
"golden rule" to testing: Test as you would like to be tested
yourself. Consideration for students' needs, comfort, and even their
feelings combined with efforts to make a valid, reliable instrument
and to score it objectively and without bias will usually result in
fair testing. Educational psychology and measurement books list other
characteristics of tests and several subtle variations of validity
and reliability, but the three characteristics illustrated here are
sufficient for teachers to apply in practical situations.
3.1 Essay Tests
There are several types of written tests from which the teacher may choose. Each type of test has its own unique capabilities and limitations which make it more or less useful in given situations. Section 3 discusses these different types of tests beginning with essay tests.
Essay or discussion tests are truly free response tests. They are the easiest tests to write, but the most difficult to grade. They are the least objective of all tests but they do allow each student to tell what he or she knows. Their biggest advantage is that there are very few cues to jog students' memories, and they provide opportunities for students to write--thus reinforcing writing skills. They are difficult from the students' perspective because of the magnitude of the task. The main weakness is that, because scoring is difficult and typically subjective, students usually respond with their best actual answer and then "shoot the bull" a bit in an effort to grope for a few more points. The most unfortunate thing is that they generally get those points! Even a student who knows almost nothing about the subject can usually beat around the bush enough to get ten to forty percent of the points. The author has even seen papers on which students had freely admitted that they did not know the answer to the question provided but they did know xxx..., and they were awarded partial credit for this absolutely inappropriate response.
Teachers choosing to use essay items should plan for scoring while writing the items and assign a definite meaningful point value to each item--not just 10 or 20 points for the sake of tradition or to make the numbers total 100. This is done by making an outline of the desired perfect response while writing the item and then counting the number of elements required in the response. Assign points to these elements, usually one point each but sometimes differentially if some elements have far greater importance than others. Include additional points for order if a sequence is important (as in operating a machine). Indicate the point value of each item on the test so that students will know what is required of them.
When scoring the test, mark all students' responses to item one, then shuffle the papers and mark all students' responses to item two, then shuffle again, etc. Likewise, avoid looking at the students' names and try to ignore their handwriting. These techniques will aid objectivity by masking students' identities and by limiting the tendency of the scorer to assume that a student who performed well or poorly on one item will do the same on subsequent ones. To score the item response, simply scan the answer looking hurriedly for the main words/phrases which appear on the key outline. As each element is found, mark a number behind it starting with one and following in order to the end of the answer. The last number you write will be the point value of the response. Do not give credit for any elements that were not on your key, regardless of how plausible they seem. This means that you must be very careful and complete when you write your perfect answer outlines for the key. If you did give credit for something that was not on your key (because one of your best students used it in place of a required element) then you would be ruining the little bit of objectivity that this scoring method has. To do so would require two things: 1) that you would go back and re-score all papers with this new element counted correct and 2) that you could say with clear conscience and truthfulness that the new element is so obviously correct (more correct indeed than the one you originally intended!) that you would have made the same effort to re-score all the papers if the only student who had used the new element had been the weakest student in the class. Since very, very few student-discovered novel answers of this sort will be encountered in one's career--if the teacher is doing a good job in preparing the key, that is--the best advice is not to accept answers that are not on your key.
Here are an example
item, the key, and a student's response which has been properly
scored:
(5 points) 3. Briefly
explain how to score an essay item.
Key For Item 3: Outline key while writing items.
Score item 1 for all, then 2, etc.
Avoid names.
Mark points in sequence.
Do not credit any
bull.
Student's Response:
Make an outline of the answer you want when you write the test.1 The outline should be brief and easy to use but no one but you will see it. Grade everyone's question number one first and then do all question number two2 without looking at names.3 Don't give credit for extra garbage.4 Check spelling and punctuation for each item and take points away for incomplete sentences.
The last element that
was marked was number 4 so the student's score is 4 points. If care
is taken to overcome the lack of objectivity in scoring, discussion
items can measure some things which are very difficult to test
otherwise.
3.2 Short Answer Tests
Short answer tests
are similar to essay tests in that they demand recall knowledge
rather than mere recognition. They are, however, easier to score than
discussion items, and they are not as difficult a task for students
who write poorly. Actually, short answer items could include anything
from completion items, in which the student fills in a blank with one
or two words, to actual questions which require mini essays or short
paragraphs as answers. In the latter case, the same general
guidelines apply as did with essay tests. When completion items are
used, it is essential that the item (stem) be worded so carefully
that there could be only one possible answer which would correctly
fill the blank. For example, consider these items:
1. The instrument
used to draw horizontal lines is the .
2. The is
used to guide the pencil when drawing straight horizontal lines.
Which item is better?
Both items would usually elicit the desired response (T-square), but
there could be students who would be confused enough by the first
item to put "pencil" as their answer and, even though it was not the
intended answer, it would be very difficult to draw those horizontal
lines without a pencil! Another thing to avoid in writing completion
items is giving students unintended cues to the correct answer.
3. An
is used to check the rough diameter of stock on the lathe.
If students were confused about whether the proper answer should be "micrometer" or "outside caliper", they have two very good cues to the answer which are totally unrelated to the true subject matter being tested: First, the fact that there are two blanks is a giveaway; and second, "an" is not the proper article to use with "micrometer". These foolish errors in test construction are easy to avoid if you will look for them and carefully proofread your tests.
Scoring completion items should be very straightforward. As with any other items, the point value of each item should be indicated for students on the test and a key should be developed very carefully as the test is written. If both of these steps are properly attended to, then scoring completion items may be done by merely reading each student's paper and marking through any answers that are incorrect with a simple horizontal line. After all items are marked, the teacher has only to count the number of lines marked on the paper to see how many points should be taken away from the completion section of the test. While marking the answers, it is a good idea to write the code "NT" in any blanks which the student skipped or items which he or she did not attempt to answer. The code will clearly indicate that the students did "Not Try" to answer those items. This measure is needed to prevent a few dishonest students from filling in the answers in blank spaces after the test is returned and then trying to claim that you made a mistake while scoring the test.
Short answer tests
can be very useful tools, they are relatively easy to make up and
grade, but the teacher must be careful not to make them only test low
level learning and facts--this is a natural tendency because it is
somewhat difficult to word an item which can be answered with only a
few words on higher levels of learning. The biggest hinderance to
using short answer tests for higher level learning is that the
wording of the stem may become so complex (in order to elicit the
desired response) that it confuses students and becomes
ambiguous.
3.3 True-False Tests
True-false items are the most misused of all types of items.
They can be used
effectively in certain situations, but they are especially prone to
flaws resulting in reduced reliability and validity for the whole
test. Even a very good test can be totally destroyed by inserting
just a few poorly planned true-false items. The reasons for this are
partly obvious and partly obscure. First, it should be noted that it
is extremely difficult to write anything (even a very short
statement) which is totally true or totally false. When
you use true-false items, however, this is either formally pointed
out in the instructions or it is implicitly there because students
already know the rules of the game--and true-false items frequently
are just that, a game. In order to write these types of pure
statements, one must absolutely avoid words like "always" and "never"
because there is almost nothing which is always or never.
4. The sun always
rises in the East. (Well, if it rises, then it will do so in the
east, at least it has up until now!)
5. You should never
use ungrounded power tools. (What about those new ones with the
double insulation systems? They aren't even sold with the old three
prong plugs.)
6. You must always
wear safety glasses in the lab. (This one is probably alright, but
who would miss it? There are, very likely, both some obscure cases in
which the item would be false, and better ways to test this knowledge
as well.)
If, despite these
warnings, you do decide to use true-false items, here are some
pitfalls to avoid. Make the true-false section of the test the lowest
in point value because it will be the least reliable group of items.
Never "lift" items straight from the textbook. This applies whether
you intended to use the statement as it was written or to change a
few words to make it false. Be very careful to avoid double negatives
and complex sentence structures that will confuse students and mask
the meaning of the statements. Instruct students to respond by making
a clear vertical capital (gothic) letter T for true or F for false.
Small letters t and f are often easily confused and will usually
penalize your better students and lead to unfair testing--this is
precisely why true-false items can bring you more problems than they
are worth. Many teachers mistakenly feel that true-false items are
desirable because they are easy to write and easy to grade, but, if
you think they are easy to write, then you are most likely misusing
them and writing very poor ones as well. The advice here is that
there is usually a better type of item for most situations and,
therefore, true-false items should be used very rarely and very
carefully.
3.4 Matching Tests
Another very popular,
but also dangerous, type of item is matching. Like true-false items,
these can be used effectively in some settings, but they are often
misused. Teachers generally believe that matching items are easy to
write--just a matter of putting a column of stems (questions)
opposite a column of responses (answers). The problem is that there
are few things that occur in large enough groups of similar form to
make matching any more beneficial than completion or multiple choice
items. Let's examine a matching section from a typical metals
technology unit test:
1. used to light torches A. riser
2. center which moves B. use when cutting threads
3. pour metal in here C. vernier caliper
4. abrasive cloth D. striker
5. coated with zinc E. tin snips
6. air escapes here F. sprue
7. measures round stock G. live center
8. oil H. tools
9. sheet metal is cut with I. galvanized
10. cutting parts for
lathe J. emery
There are many problems with this set of items. The first obvious fault is that, since there are equal numbers of cues and response choices, students will be able to figure out at least one answer each (and probably more) by elimination and pairing up whatever is left. Another problem is that we have mixed apples and oranges--or even more realistically, apples with tables--because there are some answers that could only go with certain cues. The only instance in which this is not a problem is when there are a lot of very similar types of information bits which need to be tested, like a list of events and their dates in history. That would be a good place to use a matching test. Accordingly, items 1, 2, 7, 9, and 10 above (types of tools matched to their uses) could be used effectively together in a matching test. But items 4, 5, and 8 definitely do not belong in this sequence. Item 8 has another problem. It is the only item on this side that has a one word cue so it is a "dead giveaway" that the longest answer will go with it. There are two items about the sprue and the riser on which every student will either make two points or zero points. This is because there is nothing else which could be matched with them. Simply adding the choice "vents" would completely change the difficulty and value of the test.
Matching tests should
be constructed with more response choices than there are item cues.
The items should all be similar and the structure should be uniform
enough that any answer could plausibly be placed with any cue by one
who did not know the material well. The lists of cues and/or
responses should never be very long (ten to fifteen is really about
as long as they should be) so that students do not get bogged down
looking for an answer after they know what it is. The instructions
should clearly point out that response choices are to be indicated by
placing the vertical capital (gothic) letter for the chosen response
in the blank provided--either beside the item's cue number or on a
separate answer sheet. As with true-false tests, if you felt that
matching tests were easy ones to write, you have probably written
some pretty poor ones by now! Matching items can be very good,
sometimes they are the very best items to use, but they can also lead
to traps that could ruin a potentially good test.
3.5 Multiple Choice Tests
Multiple choice items
are the most commonly used ones in large standardized tests. They
have several advantages in that setting. Multiple choice items may be
written to test at almost any level of the taxonomy of the cognitive
domain. Skillfully worded items can challenge even the most capable
students. Since the answer is always available as a possible
selection with other potential choices (called distractors), these
items require recognition rather than recall of information. These
are very difficult items to write, but the ease with which they may
be scored is quite attractive to many teachers. With new technology
making computerized scoring, record keeping, and grade calculation
available to more and more teachers, it is likely that multiple
choice items will become even more common in the very near future.
Any form of testing which the teacher chooses to use will require a
lot of time in one way or another. Essay tests are quick and easy to
write but laborious to grade; multiple choice tests are the exact
opposite in this regard--they take a great amount of advance planning
and care in their construction but they may be scored in one sitting
with little effort. Despite these big advantages, multiple choice
tests are not without problems of their own. Some of these problems
and ways to avoid them will be discussed by examining some typical
multiple choice items:
1. Which of the following plastics is used to replace glass in windows?
A. nylon
B. casein
C. acrylic
D. A. B. S.
E. oak
This item is not too
bad. It is simply stated and would not confuse students. It would
determine if students recognized that acrylics are used in glazing
applications. We would hope that they knew this because they
understood that acrylics have good clarity (transparent qualities),
but they could have merely memorized that acrylics were used to
replace glass. If the general characteristic (clarity) were the
important information bit, then the stem should have asked which
plastic had that quality. Both types of items would be fine--the
choice would depend on how you taught the original information. If
you did not use the example of glazing in your teaching, then the
item, as stated, would be testing a high level of the cognitive
domain (application) but if you did stress this use of acrylics, then
the item would test at the lowest level of recognition. The only real
problem with this item is that the distractor E would not be selected
by anyone and it therefore makes the item easier. Students who do not
know the answer and who guess have a one in four chance to get this
item correct because they will have eliminated E from the beginning.
Distractors should not be sneaky (tricky), just as in any other type
of item, but they should be plausible enough to attract those who do
not know the material. Here is another example:
2. Sandpaper:
A. is used to smooth wood after it has been cut and it comes in many colors and types of grit on a vinyl backing.
B. can be used wet or dry if the backing and the adhesive are both water resistant.
C. is always made with garnet grit.
D. is really not made with sand and should be cut to size on the squaring sheer.
E. should be used
before wood is planed in the surfacer.
This item has some
real problems. First, you should always make the stem the long part
of the item and the responses the short part (this applies equally to
matching items) to reduce reading fatigue and confusion for students.
By the time students finish reading all the choices in this item,
they will have forgotten what they were asked and what the first
choices said. Another thing to remember is that the rules of multiple
choice items require the selection of the "best" response. There may
be more than one alternative (choice) which correctly completes the
statement, but the student's task is to select the alterative which
best completes the statement. For this reason, you must be very
careful when writing distractors. Many students fear and despise
multiple choice items because this subtle factor can cause them much
anxiety, especially when teachers use many potentially correct
alternatives. To some students, even very skillfully worded multiple
choice items seem sneaky or tricky.
3. Electric current flows:
A. best at night time
B. faster than a Chevrolet
C. from ground to positive
D. 186,000 mps
E. faster than sound
waves
Essentially, B, C, D, and E are all correct choices here. If the teacher has just finished a unit which included the speed of current flow and the subject of direction of current flow has not been studied, then alternative C is a fairly harmless distractor for all except the few highly motivated students who have read ahead or studied other materials and therefore have enough sophistication in the subject to know that this statement has entirely different meanings depending on whether you mean "electron flow" current or "conventional" current flow. The answer sought by the item is D, and that is the closest estimate of the speed of electric current which is quoted in most textbooks, but a bright student who is very well informed will know that this figure is only an estimate and not "exactly" correct, so (realizing that the job is to pick the best of competing correct answers) this student may be drawn to answer incorrectly due to a higher level of sophistication about the subject. Do not think that this is unusual, it happens all the time. Students have even pointed out such inconsistencies on the big standardized tests like the SAT and won their cases. It is difficult to make up distractors that will be of any value without getting into this sort of trap, so be on guard against these defects.
Do not force yourself
to make up multiple choice items where they are not needed. If you
can only think of two good choices for the alternatives, then you
should probably write a true-false item instead of a multiple choice
item with "yes" and "no" as the only alternatives. Likewise, if there
are several things that are very similar that you wish to have
students pair-up, matching items are far better than a series of ten
multiple choice items, all with the same set of alternatives. If you
really wish for students to have the "right" word on the tip of their
tongues, then a short answer item tests that information better
because they will have to recall that right word, not just pick it
out of a group of others. But, when you wish to test students'
recognition of things in the context of other things, or concepts in
applied situations, multiple choice items are the only good
alternative. The advice here is: if you can test it an easier way,
you should because there are many ways to foul up the job with
multiple choice items; but, if there is no simple way to test it, be
careful to make the multiple choice item as difficult as possible
without it being tricky. Teacher-made multiple choice items are
frequently either ridiculously easy (because none of the distractors
distract at all), or they are absolutely unfair because the
distractors have faults or the wording is ambiguous.
3.6 Factors Affecting the Total Test
Sometimes, students will experience a fear of failure reaction when they take a test and will not be able to exhibit knowledge and skills that they do possess. This is very likely to happen, even to good students, if the first item on a long test is a very difficult one. Most teachers overcome this by placing a few of the easier items at the beginning of the test.
Another problem which can affect the whole test is the arrangement of the items. Items of the same type should be grouped together. A long test may have, for instance, a set of true-false items, followed by a group of multiple-choice items, and then a couple of short essay questions. Such an arrangement is sound because the true-false items are typically easier for students to react to and they can also be done more quickly than the other items on the test. It is unwise to have items of different types interspersed together because it causes extra confusion for students. The author has seen cases in which a student responded "true" to a short answer item which really required a specific word as the answer.
Sometimes, in the course of trying to sample the same concept or bit of information more than once, a test will give a clue or even provide the answer to an item which comes later in the test. Testing the same bit of information with more than one item has already been established as a good practice, so the test maker must carefully proofread the entire test, while making up the key, to look for this sort of problem. Additionally, misspelled words and poor grammar can change meanings and confuse students.
Mechanically, the
test should be very easy to read; quality of reproduction is as
important as any other characteristic of a good test. There should be
some space separating each question from the ones before and after
it. If students are to respond to the items by writing on the actual
test copy, then there must be adequate space for the responses. The
directions should be clear, concise, and simple to follow--just as
with tricky questions, ambiguous instructions confuse and bewilder
students. Remember, a poorly worded item can cause students who know
the material to wrongfully lose a point or two, but poorly worded
directions can result in students missing large sections of the test
regardless of their level of knowledge and skill. Instructions should
make clear what type of response is expected, what each item's point
value is, and the total number of points on the whole test.
4.1 Evaluating Student Laboratory Work
There are several
types of student laboratory work, but, for the purpose of simplicity,
we will treat three major types (realizing that there may be
individual, team, or even large group versions of all three types).
The three types discussed here will be: 1) projects, 2)
exercises/experiments, and 3) role-playing activities. The first
important point in considering the evaluation of student lab work is
to determine how the lab work fits into the instructional program and
the degree of importance that it has. If, for example, Joan was
studying about welding and she had to perform several exercises to
learn the fundamentals prior to actually applying her knowledge by
welding a couple of joints on her project, what relative importance
would the exercises have? Should they be evaluated at all? Should
they count more than or less than the project grade? If you did not
evaluate them at all, would the students still complete important
learning exercises or would they race ahead to work on their
projects? What grade should a student receive on welding if that
student does a very sloppy job during the learning phase (on the
exercises) but shows very good competence when welding the project?
How do you judge the quality of the exercise and the project? All of
these issues must be dealt with by each teacher in recognition of the
unique learning setting that exists in that teacher's laboratory.
4.2 Project Evaluation
There are some
crucial differences between projects and exercises. Firstly, projects
usually take considerably more time to complete than exercises.
Secondly, projects usually require some degree of planning by the
student (sometimes they even require designing or redesigning).
Lastly, projects are generally complex activities which combine
several processes and/or materials together in an integrated way,
whereas an exercise could be an isolated event dealing with one
process, like the following examples:
1. Make five soldered
connections.
2. Typeset "The quick
red fox jumps over the lazy brown dog" ten times.
3. Knurl a 3" section
on 1" diameter aluminum.
4. Inject mold a
small plastic part.
These are all useful learning activities, but none of them involve multiple techniques and none of them make anything nice to take home to show to Mom. The methods used to evaluate projects must take these important features of projects into account--in other words, the finished product is only part of the grade because the process (including planning, following proper procedure, learning value, and other factors) is equally important.
EVALUATION FORM -- ELECTRONIC SIREN
Scale: 5 = Best in class Rate each of the
4 = Very Good following factors
3 = Good using this scale.
2 = OK Then briefly
1 = Poor explain why each
0 = Fail rating is
given.
Tea. Stu.
1. How does the project look? (e.g., finish, labels, arrangement of switches, craftsmanship on case)
2. How well does the project work? (Do all functions operate properly?)
3. Design (Are the printed circuit board and enclosure well designed for their functions?)
4. Planning (Are drawings and plans neat and complete? Is a plan of procedure included?)
5. Learning Value (Was this a valuable learning activity? Did you learn a number of new techniques, processes, or facts by building this project?)
TOTAL POINT
VALUE FINAL GRADE = A technique which has been used
successfully for years is to have students evaluate their own
projects first and then the teacher can review the students'
self-evaluations and assign the final grades. The simplest way to
provide for this self evaluation by students is to have the students
complete a project evaluation form. In fact, it is best to let the
students see the form early in the planning and production stages of
the project so that they know what you will be looking for when you
grade the work. An example project evaluation form for an electronics
project appears below. After students have completed the evaluation
form, the teacher rates the project on the same factors and assigns a
grade based on the total number of points. Students will usually rate
themselves fairly--if anything, most students tend to under-rate
their own work. This type of self-evaluation form is especially good
for helping students to understand their project grades and what you
expect of them when they construct projects. Another advantage of
this type of form is that it adds objectivity to the evaluation
system. Without this type of objectivity, a teacher could become so
impressed with a beautiful finish and an attractive design that a
project of little learning value which did not even work could be
graded very highly because only subjective evaluation was used.
4.3 Experiments/Exercises
Experiments and
exercises are grouped together because they may be evaluated in much
the same way and because many learning activities which teachers and
laboratory manuals call "experiments" really are exercises--students
are following a prescribed set of instructions to complete a task
which illustrates a concept or skill. Students are not generally
trying to discover new facts and principles; if they were, then the
process would be worth more than the finished product just as it is
in real science. Exercises are usually completed in a short period of
time. Frequently, very valuable exercises can be completed in a
single lab period. Exercises are simple, that is that they usually
include few steps and processes, and they can often be fairly
evaluated via relatively subjective assessment by the teacher.
Allowing students to self-evaluate exercises is a good idea but the
evaluation form need not be nearly as lengthy as one for projects.
Likewise, if a student has made some slight mistakes on an exercise
but shows improvement in the later stages of the work, he or she may
still deserve a good grade on the exercise; remember that it is
learning we wish to encourage, not just the creation of one
pretty lettering plate! Many teachers successfully use a go-no-go
standard for exercises. Students must complete the exercise to a
minimum level of some designated standard or they do it over until
they do attain that standard. This position has both merits and
problems in practical and philosophical realms. Some activities are
better suited to this type of "all or none" test. It is finally up to
you (the teacher) to determine what will work best in your
situation.
4.4 Role Playing Activities
Role playing activities require some special evaluation techniques which are not very similar to those used for projects or exercises/experiments. For one thing, role playing activities are almost always group endeavors whereas projects and exercises may or may not be. Another real difference is that role plays frequently do not yield a tangible item which can be evaluated after completion of the activity. This means that the most important part of the evaluation must occur DURING the conduction of the activity rather than after it is over.
As with all other
types of evaluation, the key to successful assessment of role playing
activities is to eliminate as much subjectivity as possible by
breaking the assessment down into small, independent parts and
establishing an objective rating for each one. An evaluation form
should be created before the beginning of the role play. The form may
be very simple, indeed it will be more useful if it is simple.
Students should be informed of what behaviors are desired (and will
be rated on the form) before they begin the activity. If possible,
try to work the evaluation into the role play so that it makes some
logical sense and students can see that if they perform their roles
realistically and in the most appropriate ways, that they will
receive the highest grades for the activity. An example will help to
illustrate the meaning here. Suppose that you are teaching a unit on
collective bargaining and one of the activities used will be a role
play in which students are divided into management and labor and they
are to write a new contract. The old (fictitious) contract is
provided. Students should be rewarded for tough-minded, diligent
negotiation which includes compromises but also some willingness to
hold out for needed revisions in the contract. Their grade could
easily be made to depend on:
1. Final contract
provisions: Both sides benefit if there is good balance, so everyone
would get good grades.
2. Forcefulness of
negotiations: If one side or the other were too easily pressed to
give up important points, then they would lose something in real
life. So in our role play we would take a few points away from their
grades.
3. Good faith
bargaining: Likewise, being too forceful and not bargaining in good
faith may lead to forced arbitration (by the teacher of course!); the
penalties of downtime and lost revenues will be paid in the grades of
the students.
4. Length of strike
or lockout: If a plant is crippled for a long time, everybody
suffers. So all grades should be reduced if a strike exceeds a
predetermined length of time. This way, students will realize that
they have a lot to gain or lose if they strike--just as in life!
When related into real grades, if everyone just plays along half-heartedly then they all get basic B's or C's. If they risk something by bargaining skillfully and hard (possibly even striking), they stand the chance of making big gains in life, and their grades will go up. If the strike lasts too long or the teacher has to arbitrate the situation then the grades suffer to reflect what happens in the real world.
Some degree of self
assessment and/or peer assessment may be used in role playing
situations but it is not nearly as effective as it is with projects
and exercises because there is no after-the-fact tangible evidence
and, since students do not observe the process as carefully as the
teacher would, student input in this situation is less likely to be
objective or valuable. Many role plays are brief and not as complex
as the example above. Simple observation by the teacher is usually
adequate for such activities, but it should be done in terms of some
preset standard or objective and the assessment should be made while
the play is being conducted--observation notes should be recorded--so
there will be at least some degree of objectivity. Videotaping role
plays could be a good way to facilitate increased objectivity and
student self-evaluation. It could also yield a tangible product, if
desired.
5.1 Final Grades
There are four to six extremely trying days each year for all teachers. They are those tense days in which teachers must decide on grades to report to parents. Teachers worry, study, second guess, and gnash their teeth during those days. How do you assign grades fairly? What role do grades really have?
The purpose of grades is to record student achievement and report it to parents. This is a simple and justifiable procedure which seemingly should not be so tumultuous or value-laden. But there are some problems. Grades are typically reported on a qualitative basis in steps (A, B, C, etc.) and there is little provision, if any, in most school systems for telling parents that the student is on the borderline between two of these lock steps. What is worse is that many school systems have policy statements that require teachers to grade on some predetermined number scale, i.e., 94% to 100% = A, 86% to 93% = B, and so forth. Such scales have the appearance of establishing high standards for the school district, but they are absolutely indefensible in terms of educational measurement. These policies ignore the fact that teachers do not have the time, resources, or even the ability to create and standardize their tests like the SAT and other large scale tests. For a teacher to say that a score of 90% on his or her teacher-made test is going to be a B regardless of who gets it or in what setting, is ridiculous--but this is what the policies of many school districts require. The best defense that a teacher has against such policies forcing the assignment of grossly inaccurate and meaningless grades is to have adequate records of all the work students have done in the class and to base the grade on the total picture rather than on test grades alone. Actually, this is just good practice anyway, regardless of the policies of the school system. Recall that this principle was illustrated in sections 1.1 through 1.3 of this monograph. The student's term or final grade should reflect his or her total growth and skill development in the course and should not be based solely on one type of evaluation instrument. Balanced evaluation techniques which truly reflect the course and which provide ample opportunities for students to exhibit acquired skills and knowledge in a variety of settings will yield the most valid assessment of student achievement. When all of the students' work has really been considered, at levels which really reflect the relative importance and emphasis given to each type of work in the course, then the teacher should very objectively assign the grades according to the scores without placing too much trust in hunches (or the opinions of others) concerning standards being too low or too high. Only the individual teacher knows the situation and expectations in his or her classroom. Only the individual teacher can determine how grades should be assigned in a particular class. This is precisely why it is so important for the teacher to develop a weighting scale which assigns relative importance to all items that will be evaluated. Students should be aware of the importance to be placed on each item. Item importance should reflect course goals and objectives. Group projects and role plays should usually be of less importance than individual work because they provide too much opportunity for weak and lazy students to benefit from the work of their peers. Some cooperation factor (percentage of grade for class participation, cleanup, and attitude) is justified in these laboratory classes but it should be less than ten percent of the total grade and it must be justified with some tangible records--teacher memory and informal observation will simply not do!
Here is an example of
a weighting scale for a six-week grade period in a typical
Manufacturing Technology course:
Cooperation & Cleanup 10%
Experiment (wind tunnel test rockets) 5%
Rocket (individual project) 20%
Role Play (employment interview) 10%
Mass Production Project (phase 1 activities) 10%
Test Number 1 10%
Test Number 2 10%
Test Number 3 10%
Quiz Average 5%
Technical Report
10%
This weighting scale
allots about equal importance to "doing" things and written work. The
big individually completed project is the most important item on the
whole scale. There are three two-week tests rather than one or two
big ones. Quizzes were also given, but they do not count very
much--they were probably given mainly to keep students "on their
toes" with evaluation as a minor secondary motive. There is also some
written work which was not sampled in a tense test situation (the
Technical Report). The teacher using this weighting scale should be
able to safely assume that students were given adequate forums and
formats in which to demonstrate their knowledge and skills.
Why Go To So Much Trouble?--A Brief Summary
If teachers will be
systematic, will inform students adequately of expected behavior,
will evaluate often instead of all at the end, will keep in touch
with parents and counselors when problems arise, will carefully write
their tests, will base evaluation on many different types of student
work, will devise forms which force some degree of objectivity into
even the most subjective aspects of evaluation, will continue to
revise evaluation instruments/ntechniques, and will provide students
with multiple means of exhibiting learned behaviors, then they will
be doing the best that they can do to fairly evaluate their students.
Will they ever make mistakes? Yes, even with all this attention,
mistakes will be made--but they should be rarer with this systematic
approach. Will they still feel tense and pressured at grade time? If
they really are good teachers who truly care about their students,
they will, but they will be able to honestly say that they are doing
the best that they can do--the best that it is possible to do. Will
the evaluation process become easier? No, it will become more
accurate; it will become less haphazard; it will become more of a
constant undertaking rather than a sporadically occurring major job,
but it will not become less time consuming or easier. Evaluation
which is done properly requires a lot of the teacher's time. It is
just like the television commercial about the automobile oil filter
in which the mechanic says "you can pay me now [a few dollars for
the filter] or you can pay me later [several hundred dollars
for an overhaul]." Evaluation is going to take time. The choice
is whether to plan it carefully and invest some time "up front" to
achieve good evaluations or to wait and try to do the whole job at
the last minute and settle for very rough, unsupported guesses about
student achievement. Your amount of time invested will be substantial
either way. You just as well get what you pay for.