STUDENT EVALUATION

The Teacher's Most Difficult Job




Monograph 11
of the
Virginia Council on Technology Teacher Education

W. J. Haynie, III
Monograph Author
North Carolina State University


Revised 1993





CONTENTS


RATIONALE

COMPETENCIES

INFORMATION SECTION

1.1 Need for Different Forms of Evaluation

1.2 List Different Forms of Evaluation

1.3 Balanced As Well As Broad

2.1 Characteristics of Written Tests--Validity

2.2 Characteristics of Written Tests--Reliability

2.3 Relationship of Validity and Reliability

2.4 Fairness

3.1 Essay Tests

3.2 Short Answer Tests

3.3 True-False Tests

3.4 Matching Tests

3.5 Multiple Choice Tests

3.6 Factors Affecting the Total Test

4.1 Evaluating Student Laboratory Work

4.2 Project Evaluation

4.3 Experiments/Exercises

4.4 Role Playing Activities

5.1 Final Grades

Why Go To So Much Trouble?--A Brief Summary






RATIONALE

There are three primary components of the education process: designing (goal setting and identifying content), presentation (teaching), and assessment (testing and evaluating). This monograph deals with the third component--assessment. Educational experiences, including courses, projects, assignments, and activities, are all selected/designed on the basis of goals and objectives. The degree to which the goals and objectives are met forms the basis for evaluation and, in many ways, dictates the methods of evaluation. Programs, teachers, facilities, administrators, courses, students, support services, and many other aspects of education must all be evaluated but the topic of this particular publication is assessment of student achievement.

Evaluating students, especially that phase of evaluation which requires assigning grades, is the toughest part of a teacher's job. Fledgling teachers should not feel that the agony they experience the first few times they assign grades will go away as they gain experience. Indeed, as teachers gain experience they generally find evaluation to be a tougher job--requiring their utmost patience and concern. No part of a teacher's job is more important or demanding than evaluation, but the helpful tips and guidelines presented in this monograph will help teachers avoid some serious pitfalls.

In addition to being both important and difficult, evaluation of student achievement is also somewhat value laden. Every educator assumes a personal set of beliefs, hunches, and values which determine, to a large degree, what forms of evaluation will be used, the standards to be set, and how the evaluating criteria are to be applied. The reader should be aware that the author of this monograph has his own biases and "bag of tricks" which have served him well. Every attempt has been made to insure that a well balanced array of professionally accepted practices is presented, but there will certainly be those educators (with and without experience) who will disagree with some portions of this publication. Professional difference of opinion is certainly allowed and--as long as evaluation is systematic, frequent, broad in scope, and applied fairly--there is plenty of room for variations in technique. Large amounts of time are required to do evaluation correctly, so it is important to do the best possible job with the time we invest.


COMPETENCIES

TASK 1.0 Explain the need for and importance of balanced and varied evaluation techniques.

PERFORMANCE OBJECTIVES:

1.1 After reading this monograph, you will be able to explain the need for different forms of evaluation in Technology Education courses.

1.2 After studying this package, you will be able to list several forms of evaluation appropriate for use in Technology Education courses.

1.3 After reading this monograph, you will be able to explain the importance of balance in evaluation of students.

TASK 2.0 List and explain the most important characteristics of good tests.

PERFORMANCE OBJECTIVES:

2.1 After studying this publication, you will be able to intelligently discuss face validity as it applies to teacher-made tests.

2.2 After studying this monograph, you will be able to explain reliability in educational measurement.

2.3 After reviewing this publication, you will be able to discuss the relationship between validity and reliability in testing.

2.4 After reading this monograph, you will be able to discuss other factors which contribute to the general fairness of a teacher-made test.

TASK 3.0 Write acceptable test items of several types.

PERFORMANCE OBJECTIVES:

3.1 With the help of this booklet, you will be able to write and correctly score essay test items.

3.2 With the help of this monograph, you will be able to prepare well designed short answer test items for teacher-made tests.

3.3 With the help of this monograph, you will be able to write improved true-false test items.

3.4 Using this monograph as an aid, you will be able to prepare matching test items and cite appropriate uses for them.

3.5 Following the guidelines in this monograph, you will be able to improve the quality of the multiple choice test items which you write for your teacher-made tests.

3.6 After studying this monograph, you will be able to list and explain several factors affecting the quality of the whole test.

TASK 4.0 Systematically evaluate student laboratory work.

PERFORMANCE OBJECTIVES:

4.1 After reading this publication, you will be able to discuss three major types of student lab work which need to be evaluated in Technology Education classes.

4.2 With the help of this monograph, you will be able to develop student self-evaluation forms for use in evaluating projects in your classes.

4.3 After reading this monograph, you will be able to properly evaluate student laboratory experiments and exercises.

4.4 After studying this publication, you will be able to establish improved procedures for evaluating role playing activities in your laboratory.

TASK 5.0 Explain how to fairly and objectively assign grades to students.

PERFORMANCE OBJECTIVES:

5.1 With the help of this monograph, you will be able to fairly and objectively assign term and course grades to students.

5.2 Using this monograph as a guide, you will be able to plan your evaluation system so that it assesses many aspects of student behavior.

INFORMATION SECTION

1.1 Need for Different Forms of Evaluation

One of the most important facts that all teachers learn very early in their pre-service education is that each student is an individual. The importance of individual differences and their impact on the education process is expounded even further in professional, psychology, methods, and measurement courses. This emphasis and overlap was not caused by coincidence, lack of planning, or attempts to "beat a dead horse". The overlap was planned and is necessary because that is probably the only true fact which we have nearly proven in all of our research on education: Each person is a unique individual and individuals learn in different ways and at different paces. This is the reason why assessment of student achievement must be broad in scope. Just as each student perceives and assimilates information differently, so each student is best able to demonstrate the attainment of knowledge in ways unique to himself or herself.

Assume that you have taught a unit on identification and characteristics of woods. Bob made an easy A on your written test about the characteristics of woods while Laura scored a disappointing 40%. However, when they were selecting wood to use in making water skis, Bob chose cherry for appearance sake. Laura used mahogany and ash laminated together. When you asked Laura about her choice, she pointed out the straight grain patterns and the excellent gluing and water resistant qualities of the woods. Which child has really achieved your instructional goals in this area? Both students had knowledge (perhaps even the same knowledge), but each was able to display the knowledge in a different way. Technology Education is a special learning environment which emphasizes learning in all three domains (affective, cognitive, and psychomotor) more than most subjects and, since we teach in all three domains, we must also test in all three domains!

1.2 List Different Forms of Evaluation

Technology teachers should be able to evaluate the work of students by many methods and in various settings--some traditional and some unique to our field. Some examples of ways to evaluate student achievement and growth are:

Written tests,

Written quizzes,

Papers,

Oral tests and quizzes,

Document application of knowledge in activities,

Portfolio review,

Review questions,

Notebooks,

Performance tests,

Observe student lab work,

Evaluate quality of products,

Evaluate quality of exercises,

Safety performance tests,

Monitor group activities,

Indirect questions to determine change of opinion,

Document evidence of attitudinal (value) changes,

Student-self evaluation of projects, and

Document leadership/followership qualities.

Surely, there are still other methods which can be used to evaluate Technology Education students in specific situations. The list above is not considered to be exhaustive, but it should help to point out the nature of broad evaluation techniques.

1.3 Balanced As Well As Broad

In the example given in section 1.1, if the teacher tested knowledge of the characteristics of woods only on the written test, Laura's knowledge would not have been discovered at all. Likewise, a teacher who did not use any written tests would not have observed Bob's knowledge. Between these two extremes, there are many possible ways to combine test and non-test assessment techniques. This is an area which will require professional judgement on the teacher's part. The simplistically obvious solution would be to make tests and written work count for fifty percent of the grade and let projects and lab work comprise the remaining fifty percent. Some educators disagree with this technique because they feel that it results in an academically "watered down" course and reduced standards. Others would argue that "doing" is far more important than "book learning" and would contend that tests do not show much. The advice given here is to strive for balance and use assessment techniques which truly reflect what is learned and done in your course. An evaluation schema that gives relatively equal emphasis to both academic learning and performance is generally desirable, but this is so only if the techniques used to assess knowledge and practical ability are valid, reliable, and fair. Much of the remainder of this monograph will deal with the development of measurement techniques which meet these criteria.

The time spent in evaluation is an important commodity. If administering, grading and reviewing a test detracts from time which could be used for additional lectures or laboratory work, students would experience a loss of educational opportunity by taking tests. Some research findings, however, contradict this assumption--it has been demonstrated that students learn more when they are tested and that the act of taking the test is a beneficial learning activity. The threat of the fact that a test is upcoming was only marginally effective, but the actual experience of the test was quite effective. This was true for both in-class and take-home tests. So, time spent in testing is neither lost nor wasted--it is a valuable aid to learning and serves an important evaluation function too.

2.1 Characteristics of Written Tests--Validity

There are several types of paper and pencil tests, each with unique characteristics and limitations which makes it more or less useful in given situations. There are three characteristics, however, which are essential in all types of tests--whether they are written or not. These characteristics will be discussed under the headings validity, reliability, and fairness.

The first characteristic of a good test is validity. There are several types of validity which can be discussed in relation to educational and psychological measures, but the most important type of validity needed by teacher-made tests is "face validity". Face validity simply means that the test actually does test what it is supposed to test. This may seem silly and redundant, but it is very important. A test covering content about screen process printing which is written in English may be very valid in America, but this same test would cease to be valid if it were administered in China. In the latter case, the test would test the students' command of our language more than it would their knowledge about screen printing. This example is extreme, but it illustrates an important concept. The, all too common, "trick question" which requires students who know the subject matter to out-guess the teacher's attempt to be witty or cute is the worst enemy of face validity. If you wish to write good tests, make every possible effort to avoid obscuring the point of a question with trickery and riddles.

Consider this item:

Which of the following parts of a car is held in the hand most often?

a. clutch

b. gearshift lever

c. light switch

d. stearing wheel

e. gas cap

Most students would choose d, but that is wrong because it is misspelled! Have you seen this sort of trick? Was the intent of the question to determine how well students could identify the uses of parts of the car or how well students could spell? Such trickery destroys face validity.

2.2 Characteristics of Written Tests--Reliability

The second characteristic of a good test, reliability, is related to the first. Reliability means, simply stated, that the test tests the same thing each time it is administered--that its meaning and measuring ability do not change. We could call validity "truthfulness" and reliability "consistency". A test may be said to be consistent (reliable) if it gives relatively the same results each time it is used with similar groups of people. It is fairly easy to establish the reliability of a large standardized test. A simplified method of determining the reliability of a test would be to administer it to a group of students, wait about two days, and then re-administer the same test to the same group. After scoring both tests and plotting the scores on a graph, you would likely find that the scores of the second testing would be a few points higher than those of the first testing; but, more importantly, you should find that the same general ranking of scores occurred in both testings. That is, the students who scored high on the original testing should also score high on the retest. A test yielding these results would be said to have some degree of reliability--it would be consistent. If, on the other hand, the ranks reversed (with poor students receiving the high scores in one of the two testings), then the test would not be very reliable. The reliability of teacher-made tests depends mostly on two factors: clarity of items and test length.

A test is made up of small parts called items (sometimes the term "questions" is inappropriately substituted). Just as a chain derives its quality from each link, so a test gains reliability (and validity) from each item. The most probable cause of low reliability in teacher-made tests is poorly worded items. Items which are unclear, confusing, and ambiguous can damage the consistency of a test very quickly, and it does not take many weak items to utterly destroy the reliability of the whole test. True-false items are particularly vulnerable to ambiguity and they frequently give the brightest students the most trouble. Avoid using words with vague meanings, double negatives, and complicated sentence structures.

Refer back to the example item about the car parts (section 2.1). If students caught the teacher's attempt to be sneaky, and avoided response d, then they would have to choose another answer based on their interpretation of the ambiguous phrase "held in the hand most often". Does that mean "grasped by the hand most frequently" or does it mean "picked up and enclosed in the hand most often?" Does "in the hand" eliminate the possibility of using both hands? This item is worded very poorly and its lack of clarity would certainly harm the reliability of the test.

The second factor affecting reliability is test length. Tests with two or more items on each topic/concept/fact are more reliable than tests with only one item per bit of information. In fact, until the test becomes so long that it creates serious fatigue problems, the longer the test is, the more reliable it will be. It is easy to see how this works. Imagine that you wished to test five concepts. You could make up a five item test, a ten item test, or perhaps even a forty item test with concepts duplicated by different types of items. Suppose also that, regardless of which length test you use, there is one bad item which discriminates reversely (the most capable students answer incorrectly but the weak students get it right). A student who truly knows the material, and who should normally get a score of 100%, but who (wrongfully) missed this one ambiguous item would score as follows depending on total test length:

5 item test 80%

10 item test 90%

40 item test 97.5%

Which score most nearly estimates this student's true ability? The longer test would have much greater reliability. When possible, try to have at least two items per information bit.

2.3 Relationship of Validity and Reliability

A test that is not reliable cannot be valid, but the converse of this statement is not true. Alternately stated, a test must be reliable (consistent) in order to be valid (to truthfully test the right thing) but a test could easily test the wrong thing entirely (be quite invalid) and do so very accurately and consistently (repeatedly, reliably). The scores derived in test-retest situations using the printing test in China (see section 2.1) would show remarkable reliability because the rank orders of the scores would change very little, but it would never be valid because it would always test knowledge of English rather than knowledge of screen printing. Thus, a test must be reliable to be valid, but it could be reliable without being valid.

2.4 Fairness

Fairness is the third characteristic of a good test. Fairness includes some aspects of reliability and validity (as they would be discussed in a psychological measurement text) and other factors such as clarity of items, quality of the testing environment, and unbiased scoring. We have all felt, at one time or another, that a test was unfair because it was given after a week off, or when schedules were mixed up and time was cut short, or in a hot room, or with unclear instructions, or ...

Fairness is a difficult characteristic to deal with because it actually is the resulting effect of many small factors. Anything that reduces reliability or validity will certainly detract from the fairness of a test. Additionally, the following factors affect test fairness:

Time proximity to material presentation,

Temperature,

Lighting,

Length of test (fatigue factor),

Time available,

Objectivity of scoring,

Noise and disturbances,

Clarity of instructions,

Appropriateness of types of items used,

Quality of test reproduction, and

Amount of material covered on one test.

Teachers may increase the fairness of their tests the most by applying a variation of the "golden rule" to testing: Test as you would like to be tested yourself. Consideration for students' needs, comfort, and even their feelings combined with efforts to make a valid, reliable instrument and to score it objectively and without bias will usually result in fair testing. Educational psychology and measurement books list other characteristics of tests and several subtle variations of validity and reliability, but the three characteristics illustrated here are sufficient for teachers to apply in practical situations.

3.1 Essay Tests

There are several types of written tests from which the teacher may choose. Each type of test has its own unique capabilities and limitations which make it more or less useful in given situations. Section 3 discusses these different types of tests beginning with essay tests.

Essay or discussion tests are truly free response tests. They are the easiest tests to write, but the most difficult to grade. They are the least objective of all tests but they do allow each student to tell what he or she knows. Their biggest advantage is that there are very few cues to jog students' memories, and they provide opportunities for students to write--thus reinforcing writing skills. They are difficult from the students' perspective because of the magnitude of the task. The main weakness is that, because scoring is difficult and typically subjective, students usually respond with their best actual answer and then "shoot the bull" a bit in an effort to grope for a few more points. The most unfortunate thing is that they generally get those points! Even a student who knows almost nothing about the subject can usually beat around the bush enough to get ten to forty percent of the points. The author has even seen papers on which students had freely admitted that they did not know the answer to the question provided but they did know xxx..., and they were awarded partial credit for this absolutely inappropriate response.

Teachers choosing to use essay items should plan for scoring while writing the items and assign a definite meaningful point value to each item--not just 10 or 20 points for the sake of tradition or to make the numbers total 100. This is done by making an outline of the desired perfect response while writing the item and then counting the number of elements required in the response. Assign points to these elements, usually one point each but sometimes differentially if some elements have far greater importance than others. Include additional points for order if a sequence is important (as in operating a machine). Indicate the point value of each item on the test so that students will know what is required of them.

When scoring the test, mark all students' responses to item one, then shuffle the papers and mark all students' responses to item two, then shuffle again, etc. Likewise, avoid looking at the students' names and try to ignore their handwriting. These techniques will aid objectivity by masking students' identities and by limiting the tendency of the scorer to assume that a student who performed well or poorly on one item will do the same on subsequent ones. To score the item response, simply scan the answer looking hurriedly for the main words/phrases which appear on the key outline. As each element is found, mark a number behind it starting with one and following in order to the end of the answer. The last number you write will be the point value of the response. Do not give credit for any elements that were not on your key, regardless of how plausible they seem. This means that you must be very careful and complete when you write your perfect answer outlines for the key. If you did give credit for something that was not on your key (because one of your best students used it in place of a required element) then you would be ruining the little bit of objectivity that this scoring method has. To do so would require two things: 1) that you would go back and re-score all papers with this new element counted correct and 2) that you could say with clear conscience and truthfulness that the new element is so obviously correct (more correct indeed than the one you originally intended!) that you would have made the same effort to re-score all the papers if the only student who had used the new element had been the weakest student in the class. Since very, very few student-discovered novel answers of this sort will be encountered in one's career--if the teacher is doing a good job in preparing the key, that is--the best advice is not to accept answers that are not on your key.

Here are an example item, the key, and a student's response which has been properly scored:

(5 points) 3. Briefly explain how to score an essay item.

Key For Item 3: Outline key while writing items.

Score item 1 for all, then 2, etc.

Avoid names.

Mark points in sequence.

Do not credit any bull.

Student's Response:

Make an outline of the answer you want when you write the test.1 The outline should be brief and easy to use but no one but you will see it. Grade everyone's question number one first and then do all question number two2 without looking at names.3 Don't give credit for extra garbage.4 Check spelling and punctuation for each item and take points away for incomplete sentences.

The last element that was marked was number 4 so the student's score is 4 points. If care is taken to overcome the lack of objectivity in scoring, discussion items can measure some things which are very difficult to test otherwise.

3.2 Short Answer Tests

Short answer tests are similar to essay tests in that they demand recall knowledge rather than mere recognition. They are, however, easier to score than discussion items, and they are not as difficult a task for students who write poorly. Actually, short answer items could include anything from completion items, in which the student fills in a blank with one or two words, to actual questions which require mini essays or short paragraphs as answers. In the latter case, the same general guidelines apply as did with essay tests. When completion items are used, it is essential that the item (stem) be worded so carefully that there could be only one possible answer which would correctly fill the blank. For example, consider these items:

1. The instrument used to draw horizontal lines is the  .

2. The is used to guide the pencil when drawing straight horizontal lines.

Which item is better? Both items would usually elicit the desired response (T-square), but there could be students who would be confused enough by the first item to put "pencil" as their answer and, even though it was not the intended answer, it would be very difficult to draw those horizontal lines without a pencil! Another thing to avoid in writing completion items is giving students unintended cues to the correct answer.

3. An is used to check the rough diameter of stock on the lathe.

If students were confused about whether the proper answer should be "micrometer" or "outside caliper", they have two very good cues to the answer which are totally unrelated to the true subject matter being tested: First, the fact that there are two blanks is a giveaway; and second, "an" is not the proper article to use with "micrometer". These foolish errors in test construction are easy to avoid if you will look for them and carefully proofread your tests.

Scoring completion items should be very straightforward. As with any other items, the point value of each item should be indicated for students on the test and a key should be developed very carefully as the test is written. If both of these steps are properly attended to, then scoring completion items may be done by merely reading each student's paper and marking through any answers that are incorrect with a simple horizontal line. After all items are marked, the teacher has only to count the number of lines marked on the paper to see how many points should be taken away from the completion section of the test. While marking the answers, it is a good idea to write the code "NT" in any blanks which the student skipped or items which he or she did not attempt to answer. The code will clearly indicate that the students did "Not Try" to answer those items. This measure is needed to prevent a few dishonest students from filling in the answers in blank spaces after the test is returned and then trying to claim that you made a mistake while scoring the test.

Short answer tests can be very useful tools, they are relatively easy to make up and grade, but the teacher must be careful not to make them only test low level learning and facts--this is a natural tendency because it is somewhat difficult to word an item which can be answered with only a few words on higher levels of learning. The biggest hinderance to using short answer tests for higher level learning is that the wording of the stem may become so complex (in order to elicit the desired response) that it confuses students and becomes ambiguous.

3.3 True-False Tests

True-false items are the most misused of all types of items.

They can be used effectively in certain situations, but they are especially prone to flaws resulting in reduced reliability and validity for the whole test. Even a very good test can be totally destroyed by inserting just a few poorly planned true-false items. The reasons for this are partly obvious and partly obscure. First, it should be noted that it is extremely difficult to write anything (even a very short statement) which is totally true or totally false. When you use true-false items, however, this is either formally pointed out in the instructions or it is implicitly there because students already know the rules of the game--and true-false items frequently are just that, a game. In order to write these types of pure statements, one must absolutely avoid words like "always" and "never" because there is almost nothing which is always or never.

4. The sun always rises in the East. (Well, if it rises, then it will do so in the east, at least it has up until now!)

5. You should never use ungrounded power tools. (What about those new ones with the double insulation systems? They aren't even sold with the old three prong plugs.)

6. You must always wear safety glasses in the lab. (This one is probably alright, but who would miss it? There are, very likely, both some obscure cases in which the item would be false, and better ways to test this knowledge as well.)

If, despite these warnings, you do decide to use true-false items, here are some pitfalls to avoid. Make the true-false section of the test the lowest in point value because it will be the least reliable group of items. Never "lift" items straight from the textbook. This applies whether you intended to use the statement as it was written or to change a few words to make it false. Be very careful to avoid double negatives and complex sentence structures that will confuse students and mask the meaning of the statements. Instruct students to respond by making a clear vertical capital (gothic) letter T for true or F for false. Small letters t and f are often easily confused and will usually penalize your better students and lead to unfair testing--this is precisely why true-false items can bring you more problems than they are worth. Many teachers mistakenly feel that true-false items are desirable because they are easy to write and easy to grade, but, if you think they are easy to write, then you are most likely misusing them and writing very poor ones as well. The advice here is that there is usually a better type of item for most situations and, therefore, true-false items should be used very rarely and very carefully.

3.4 Matching Tests

Another very popular, but also dangerous, type of item is matching. Like true-false items, these can be used effectively in some settings, but they are often misused. Teachers generally believe that matching items are easy to write--just a matter of putting a column of stems (questions) opposite a column of responses (answers). The problem is that there are few things that occur in large enough groups of similar form to make matching any more beneficial than completion or multiple choice items. Let's examine a matching section from a typical metals technology unit test:

1. used to light torches A. riser

2. center which moves B. use when cutting threads

3. pour metal in here C. vernier caliper

4. abrasive cloth D. striker

5. coated with zinc E. tin snips

6. air escapes here F. sprue

7. measures round stock G. live center

8. oil H. tools

9. sheet metal is cut with I. galvanized

10. cutting parts for lathe J. emery

There are many problems with this set of items. The first obvious fault is that, since there are equal numbers of cues and response choices, students will be able to figure out at least one answer each (and probably more) by elimination and pairing up whatever is left. Another problem is that we have mixed apples and oranges--or even more realistically, apples with tables--because there are some answers that could only go with certain cues. The only instance in which this is not a problem is when there are a lot of very similar types of information bits which need to be tested, like a list of events and their dates in history. That would be a good place to use a matching test. Accordingly, items 1, 2, 7, 9, and 10 above (types of tools matched to their uses) could be used effectively together in a matching test. But items 4, 5, and 8 definitely do not belong in this sequence. Item 8 has another problem. It is the only item on this side that has a one word cue so it is a "dead giveaway" that the longest answer will go with it. There are two items about the sprue and the riser on which every student will either make two points or zero points. This is because there is nothing else which could be matched with them. Simply adding the choice "vents" would completely change the difficulty and value of the test.

Matching tests should be constructed with more response choices than there are item cues. The items should all be similar and the structure should be uniform enough that any answer could plausibly be placed with any cue by one who did not know the material well. The lists of cues and/or responses should never be very long (ten to fifteen is really about as long as they should be) so that students do not get bogged down looking for an answer after they know what it is. The instructions should clearly point out that response choices are to be indicated by placing the vertical capital (gothic) letter for the chosen response in the blank provided--either beside the item's cue number or on a separate answer sheet. As with true-false tests, if you felt that matching tests were easy ones to write, you have probably written some pretty poor ones by now! Matching items can be very good, sometimes they are the very best items to use, but they can also lead to traps that could ruin a potentially good test.

3.5 Multiple Choice Tests

Multiple choice items are the most commonly used ones in large standardized tests. They have several advantages in that setting. Multiple choice items may be written to test at almost any level of the taxonomy of the cognitive domain. Skillfully worded items can challenge even the most capable students. Since the answer is always available as a possible selection with other potential choices (called distractors), these items require recognition rather than recall of information. These are very difficult items to write, but the ease with which they may be scored is quite attractive to many teachers. With new technology making computerized scoring, record keeping, and grade calculation available to more and more teachers, it is likely that multiple choice items will become even more common in the very near future. Any form of testing which the teacher chooses to use will require a lot of time in one way or another. Essay tests are quick and easy to write but laborious to grade; multiple choice tests are the exact opposite in this regard--they take a great amount of advance planning and care in their construction but they may be scored in one sitting with little effort. Despite these big advantages, multiple choice tests are not without problems of their own. Some of these problems and ways to avoid them will be discussed by examining some typical multiple choice items:

1. Which of the following plastics is used to replace glass in windows?

A. nylon

B. casein

C. acrylic

D. A. B. S.

E. oak

This item is not too bad. It is simply stated and would not confuse students. It would determine if students recognized that acrylics are used in glazing applications. We would hope that they knew this because they understood that acrylics have good clarity (transparent qualities), but they could have merely memorized that acrylics were used to replace glass. If the general characteristic (clarity) were the important information bit, then the stem should have asked which plastic had that quality. Both types of items would be fine--the choice would depend on how you taught the original information. If you did not use the example of glazing in your teaching, then the item, as stated, would be testing a high level of the cognitive domain (application) but if you did stress this use of acrylics, then the item would test at the lowest level of recognition. The only real problem with this item is that the distractor E would not be selected by anyone and it therefore makes the item easier. Students who do not know the answer and who guess have a one in four chance to get this item correct because they will have eliminated E from the beginning. Distractors should not be sneaky (tricky), just as in any other type of item, but they should be plausible enough to attract those who do not know the material. Here is another example:

2. Sandpaper:

A. is used to smooth wood after it has been cut and it comes in many colors and types of grit on a vinyl backing.

B. can be used wet or dry if the backing and the adhesive are both water resistant.

C. is always made with garnet grit.

D. is really not made with sand and should be cut to size on the squaring sheer.

E. should be used before wood is planed in the surfacer.

This item has some real problems. First, you should always make the stem the long part of the item and the responses the short part (this applies equally to matching items) to reduce reading fatigue and confusion for students. By the time students finish reading all the choices in this item, they will have forgotten what they were asked and what the first choices said. Another thing to remember is that the rules of multiple choice items require the selection of the "best" response. There may be more than one alternative (choice) which correctly completes the statement, but the student's task is to select the alterative which best completes the statement. For this reason, you must be very careful when writing distractors. Many students fear and despise multiple choice items because this subtle factor can cause them much anxiety, especially when teachers use many potentially correct alternatives. To some students, even very skillfully worded multiple choice items seem sneaky or tricky.

3. Electric current flows:

A. best at night time

B. faster than a Chevrolet

C. from ground to positive

D. 186,000 mps

E. faster than sound waves

Essentially, B, C, D, and E are all correct choices here. If the teacher has just finished a unit which included the speed of current flow and the subject of direction of current flow has not been studied, then alternative C is a fairly harmless distractor for all except the few highly motivated students who have read ahead or studied other materials and therefore have enough sophistication in the subject to know that this statement has entirely different meanings depending on whether you mean "electron flow" current or "conventional" current flow. The answer sought by the item is D, and that is the closest estimate of the speed of electric current which is quoted in most textbooks, but a bright student who is very well informed will know that this figure is only an estimate and not "exactly" correct, so (realizing that the job is to pick the best of competing correct answers) this student may be drawn to answer incorrectly due to a higher level of sophistication about the subject. Do not think that this is unusual, it happens all the time. Students have even pointed out such inconsistencies on the big standardized tests like the SAT and won their cases. It is difficult to make up distractors that will be of any value without getting into this sort of trap, so be on guard against these defects.

Do not force yourself to make up multiple choice items where they are not needed. If you can only think of two good choices for the alternatives, then you should probably write a true-false item instead of a multiple choice item with "yes" and "no" as the only alternatives. Likewise, if there are several things that are very similar that you wish to have students pair-up, matching items are far better than a series of ten multiple choice items, all with the same set of alternatives. If you really wish for students to have the "right" word on the tip of their tongues, then a short answer item tests that information better because they will have to recall that right word, not just pick it out of a group of others. But, when you wish to test students' recognition of things in the context of other things, or concepts in applied situations, multiple choice items are the only good alternative. The advice here is: if you can test it an easier way, you should because there are many ways to foul up the job with multiple choice items; but, if there is no simple way to test it, be careful to make the multiple choice item as difficult as possible without it being tricky. Teacher-made multiple choice items are frequently either ridiculously easy (because none of the distractors distract at all), or they are absolutely unfair because the distractors have faults or the wording is ambiguous.

3.6 Factors Affecting the Total Test

Sometimes, students will experience a fear of failure reaction when they take a test and will not be able to exhibit knowledge and skills that they do possess. This is very likely to happen, even to good students, if the first item on a long test is a very difficult one. Most teachers overcome this by placing a few of the easier items at the beginning of the test.

Another problem which can affect the whole test is the arrangement of the items. Items of the same type should be grouped together. A long test may have, for instance, a set of true-false items, followed by a group of multiple-choice items, and then a couple of short essay questions. Such an arrangement is sound because the true-false items are typically easier for students to react to and they can also be done more quickly than the other items on the test. It is unwise to have items of different types interspersed together because it causes extra confusion for students. The author has seen cases in which a student responded "true" to a short answer item which really required a specific word as the answer.

Sometimes, in the course of trying to sample the same concept or bit of information more than once, a test will give a clue or even provide the answer to an item which comes later in the test. Testing the same bit of information with more than one item has already been established as a good practice, so the test maker must carefully proofread the entire test, while making up the key, to look for this sort of problem. Additionally, misspelled words and poor grammar can change meanings and confuse students.

Mechanically, the test should be very easy to read; quality of reproduction is as important as any other characteristic of a good test. There should be some space separating each question from the ones before and after it. If students are to respond to the items by writing on the actual test copy, then there must be adequate space for the responses. The directions should be clear, concise, and simple to follow--just as with tricky questions, ambiguous instructions confuse and bewilder students. Remember, a poorly worded item can cause students who know the material to wrongfully lose a point or two, but poorly worded directions can result in students missing large sections of the test regardless of their level of knowledge and skill. Instructions should make clear what type of response is expected, what each item's point value is, and the total number of points on the whole test.

4.1 Evaluating Student Laboratory Work

There are several types of student laboratory work, but, for the purpose of simplicity, we will treat three major types (realizing that there may be individual, team, or even large group versions of all three types). The three types discussed here will be: 1) projects, 2) exercises/experiments, and 3) role-playing activities. The first important point in considering the evaluation of student lab work is to determine how the lab work fits into the instructional program and the degree of importance that it has. If, for example, Joan was studying about welding and she had to perform several exercises to learn the fundamentals prior to actually applying her knowledge by welding a couple of joints on her project, what relative importance would the exercises have? Should they be evaluated at all? Should they count more than or less than the project grade? If you did not evaluate them at all, would the students still complete important learning exercises or would they race ahead to work on their projects? What grade should a student receive on welding if that student does a very sloppy job during the learning phase (on the exercises) but shows very good competence when welding the project? How do you judge the quality of the exercise and the project? All of these issues must be dealt with by each teacher in recognition of the unique learning setting that exists in that teacher's laboratory.

4.2 Project Evaluation

There are some crucial differences between projects and exercises. Firstly, projects usually take considerably more time to complete than exercises. Secondly, projects usually require some degree of planning by the student (sometimes they even require designing or redesigning). Lastly, projects are generally complex activities which combine several processes and/or materials together in an integrated way, whereas an exercise could be an isolated event dealing with one process, like the following examples:

1. Make five soldered connections.

2. Typeset "The quick red fox jumps over the lazy brown dog" ten times.

3. Knurl a 3" section on 1" diameter aluminum.

4. Inject mold a small plastic part.

These are all useful learning activities, but none of them involve multiple techniques and none of them make anything nice to take home to show to Mom. The methods used to evaluate projects must take these important features of projects into account--in other words, the finished product is only part of the grade because the process (including planning, following proper procedure, learning value, and other factors) is equally important.


EVALUATION FORM -- ELECTRONIC SIREN




Scale: 5 = Best in class Rate each of the



4 = Very Good following factors


3 = Good using this scale.

2 = OK Then briefly

1 = Poor explain why each

0 = Fail rating is given.


Tea. Stu.




    1. How does the project look? (e.g., finish, labels, arrangement of switches, craftsmanship on case)




    2. How well does the project work? (Do all functions operate properly?)




    3. Design (Are the printed circuit board and enclosure well designed for their functions?)




    4. Planning (Are drawings and plans neat and complete? Is a plan of procedure included?)




    5. Learning Value (Was this a valuable learning activity? Did you learn a number of new techniques, processes, or facts by building this project?)




TOTAL POINT VALUE FINAL GRADE =   A technique which has been used successfully for years is to have students evaluate their own projects first and then the teacher can review the students' self-evaluations and assign the final grades. The simplest way to provide for this self evaluation by students is to have the students complete a project evaluation form. In fact, it is best to let the students see the form early in the planning and production stages of the project so that they know what you will be looking for when you grade the work. An example project evaluation form for an electronics project appears below. After students have completed the evaluation form, the teacher rates the project on the same factors and assigns a grade based on the total number of points. Students will usually rate themselves fairly--if anything, most students tend to under-rate their own work. This type of self-evaluation form is especially good for helping students to understand their project grades and what you expect of them when they construct projects. Another advantage of this type of form is that it adds objectivity to the evaluation system. Without this type of objectivity, a teacher could become so impressed with a beautiful finish and an attractive design that a project of little learning value which did not even work could be graded very highly because only subjective evaluation was used.

4.3 Experiments/Exercises

Experiments and exercises are grouped together because they may be evaluated in much the same way and because many learning activities which teachers and laboratory manuals call "experiments" really are exercises--students are following a prescribed set of instructions to complete a task which illustrates a concept or skill. Students are not generally trying to discover new facts and principles; if they were, then the process would be worth more than the finished product just as it is in real science. Exercises are usually completed in a short period of time. Frequently, very valuable exercises can be completed in a single lab period. Exercises are simple, that is that they usually include few steps and processes, and they can often be fairly evaluated via relatively subjective assessment by the teacher. Allowing students to self-evaluate exercises is a good idea but the evaluation form need not be nearly as lengthy as one for projects. Likewise, if a student has made some slight mistakes on an exercise but shows improvement in the later stages of the work, he or she may still deserve a good grade on the exercise; remember that it is learning we wish to encourage, not just the creation of one pretty lettering plate! Many teachers successfully use a go-no-go standard for exercises. Students must complete the exercise to a minimum level of some designated standard or they do it over until they do attain that standard. This position has both merits and problems in practical and philosophical realms. Some activities are better suited to this type of "all or none" test. It is finally up to you (the teacher) to determine what will work best in your situation.

4.4 Role Playing Activities

Role playing activities require some special evaluation techniques which are not very similar to those used for projects or exercises/experiments. For one thing, role playing activities are almost always group endeavors whereas projects and exercises may or may not be. Another real difference is that role plays frequently do not yield a tangible item which can be evaluated after completion of the activity. This means that the most important part of the evaluation must occur DURING the conduction of the activity rather than after it is over.

As with all other types of evaluation, the key to successful assessment of role playing activities is to eliminate as much subjectivity as possible by breaking the assessment down into small, independent parts and establishing an objective rating for each one. An evaluation form should be created before the beginning of the role play. The form may be very simple, indeed it will be more useful if it is simple. Students should be informed of what behaviors are desired (and will be rated on the form) before they begin the activity. If possible, try to work the evaluation into the role play so that it makes some logical sense and students can see that if they perform their roles realistically and in the most appropriate ways, that they will receive the highest grades for the activity. An example will help to illustrate the meaning here. Suppose that you are teaching a unit on collective bargaining and one of the activities used will be a role play in which students are divided into management and labor and they are to write a new contract. The old (fictitious) contract is provided. Students should be rewarded for tough-minded, diligent negotiation which includes compromises but also some willingness to hold out for needed revisions in the contract. Their grade could easily be made to depend on:

1. Final contract provisions: Both sides benefit if there is good balance, so everyone would get good grades.

2. Forcefulness of negotiations: If one side or the other were too easily pressed to give up important points, then they would lose something in real life. So in our role play we would take a few points away from their grades.

3. Good faith bargaining: Likewise, being too forceful and not bargaining in good faith may lead to forced arbitration (by the teacher of course!); the penalties of downtime and lost revenues will be paid in the grades of the students.

4. Length of strike or lockout: If a plant is crippled for a long time, everybody suffers. So all grades should be reduced if a strike exceeds a predetermined length of time. This way, students will realize that they have a lot to gain or lose if they strike--just as in life!

When related into real grades, if everyone just plays along half-heartedly then they all get basic B's or C's. If they risk something by bargaining skillfully and hard (possibly even striking), they stand the chance of making big gains in life, and their grades will go up. If the strike lasts too long or the teacher has to arbitrate the situation then the grades suffer to reflect what happens in the real world.

Some degree of self assessment and/or peer assessment may be used in role playing situations but it is not nearly as effective as it is with projects and exercises because there is no after-the-fact tangible evidence and, since students do not observe the process as carefully as the teacher would, student input in this situation is less likely to be objective or valuable. Many role plays are brief and not as complex as the example above. Simple observation by the teacher is usually adequate for such activities, but it should be done in terms of some preset standard or objective and the assessment should be made while the play is being conducted--observation notes should be recorded--so there will be at least some degree of objectivity. Videotaping role plays could be a good way to facilitate increased objectivity and student self-evaluation. It could also yield a tangible product, if desired.

5.1 Final Grades

There are four to six extremely trying days each year for all teachers. They are those tense days in which teachers must decide on grades to report to parents. Teachers worry, study, second guess, and gnash their teeth during those days. How do you assign grades fairly? What role do grades really have?

The purpose of grades is to record student achievement and report it to parents. This is a simple and justifiable procedure which seemingly should not be so tumultuous or value-laden. But there are some problems. Grades are typically reported on a qualitative basis in steps (A, B, C, etc.) and there is little provision, if any, in most school systems for telling parents that the student is on the borderline between two of these lock steps. What is worse is that many school systems have policy statements that require teachers to grade on some predetermined number scale, i.e., 94% to 100% = A, 86% to 93% = B, and so forth. Such scales have the appearance of establishing high standards for the school district, but they are absolutely indefensible in terms of educational measurement. These policies ignore the fact that teachers do not have the time, resources, or even the ability to create and standardize their tests like the SAT and other large scale tests. For a teacher to say that a score of 90% on his or her teacher-made test is going to be a B regardless of who gets it or in what setting, is ridiculous--but this is what the policies of many school districts require. The best defense that a teacher has against such policies forcing the assignment of grossly inaccurate and meaningless grades is to have adequate records of all the work students have done in the class and to base the grade on the total picture rather than on test grades alone. Actually, this is just good practice anyway, regardless of the policies of the school system. Recall that this principle was illustrated in sections 1.1 through 1.3 of this monograph. The student's term or final grade should reflect his or her total growth and skill development in the course and should not be based solely on one type of evaluation instrument. Balanced evaluation techniques which truly reflect the course and which provide ample opportunities for students to exhibit acquired skills and knowledge in a variety of settings will yield the most valid assessment of student achievement. When all of the students' work has really been considered, at levels which really reflect the relative importance and emphasis given to each type of work in the course, then the teacher should very objectively assign the grades according to the scores without placing too much trust in hunches (or the opinions of others) concerning standards being too low or too high. Only the individual teacher knows the situation and expectations in his or her classroom. Only the individual teacher can determine how grades should be assigned in a particular class. This is precisely why it is so important for the teacher to develop a weighting scale which assigns relative importance to all items that will be evaluated. Students should be aware of the importance to be placed on each item. Item importance should reflect course goals and objectives. Group projects and role plays should usually be of less importance than individual work because they provide too much opportunity for weak and lazy students to benefit from the work of their peers. Some cooperation factor (percentage of grade for class participation, cleanup, and attitude) is justified in these laboratory classes but it should be less than ten percent of the total grade and it must be justified with some tangible records--teacher memory and informal observation will simply not do!

Here is an example of a weighting scale for a six-week grade period in a typical Manufacturing Technology course:

Cooperation & Cleanup 10%

Experiment (wind tunnel test rockets) 5%

Rocket (individual project) 20%

Role Play (employment interview) 10%

Mass Production Project (phase 1 activities) 10%

Test Number 1 10%

Test Number 2 10%

Test Number 3 10%

Quiz Average 5%

Technical Report 10%

This weighting scale allots about equal importance to "doing" things and written work. The big individually completed project is the most important item on the whole scale. There are three two-week tests rather than one or two big ones. Quizzes were also given, but they do not count very much--they were probably given mainly to keep students "on their toes" with evaluation as a minor secondary motive. There is also some written work which was not sampled in a tense test situation (the Technical Report). The teacher using this weighting scale should be able to safely assume that students were given adequate forums and formats in which to demonstrate their knowledge and skills.

Why Go To So Much Trouble?--A Brief Summary

If teachers will be systematic, will inform students adequately of expected behavior, will evaluate often instead of all at the end, will keep in touch with parents and counselors when problems arise, will carefully write their tests, will base evaluation on many different types of student work, will devise forms which force some degree of objectivity into even the most subjective aspects of evaluation, will continue to revise evaluation instruments/ntechniques, and will provide students with multiple means of exhibiting learned behaviors, then they will be doing the best that they can do to fairly evaluate their students. Will they ever make mistakes? Yes, even with all this attention, mistakes will be made--but they should be rarer with this systematic approach. Will they still feel tense and pressured at grade time? If they really are good teachers who truly care about their students, they will, but they will be able to honestly say that they are doing the best that they can do--the best that it is possible to do. Will the evaluation process become easier? No, it will become more accurate; it will become less haphazard; it will become more of a constant undertaking rather than a sporadically occurring major job, but it will not become less time consuming or easier. Evaluation which is done properly requires a lot of the teacher's time. It is just like the television commercial about the automobile oil filter in which the mechanic says "you can pay me now [a few dollars for the filter] or you can pay me later [several hundred dollars for an overhaul]." Evaluation is going to take time. The choice is whether to plan it carefully and invest some time "up front" to achieve good evaluations or to wait and try to do the whole job at the last minute and settle for very rough, unsupported guesses about student achievement. Your amount of time invested will be substantial either way. You just as well get what you pay for.