Software Iteman
Test Item Analysis Using Microsoft Excel Spreadsheet Program by Chris ElvinIntroductionThis article is written for teachers and researchers whose budgets are limited and who do not have access to purposely designed item analysis software such as Iteman (2003). It describes how to organize a computer spreadsheet such as Microsoft Excel in order to obtain statistical information about a test and the students who took it.Using a fictitious example for clarity, and also a real example of a personally written University placement test, I will show how the information in a spreadsheet can be used to refine test items and make judicious placement decisions. Included is the web address for accessing the sample Excel files for the class of fictitious students (Elvin, ).BackgroundI had been teaching in high schools in Japan for many years, and upon receiving my first University appointment, I was eager to make a good impression. My first task was to prepare a norm-referenced placement test to separate approximately one hundred first year medical students into ten relative levels of proficiency and place them into appropriate classes. This would allow teachers to determine appropriate curricular goals and adjust the teaching methodology based more closely on students personal needs. It was also hoped that a more congenial classroom atmosphere, with less frustration or boredom, would enhance motivation and engender a true learning environment.The course was called Oral English Communication, so the placement test needed to measure this construct.
However, since time restricted us to no more than half an hour for administering the test, a spoken component for the test was ruled out. It had to be listening only, and in order to ensure reliablity, the questions had to be as many as possible. I decided I could only achieve this by having as many rapid-fire questions as possible within the time constraint. In order for the test to be valid, I focused on points that one might expect to cover in an oral English course for 'clever' first year university students. It was not possible to meet the students beforehand, so I estimated their level based on my experience of teaching advanced-level senior high school students.Organizing The Spreadsheet Part ATo show briefly how I compiled my students' test scores, I have provided here the results of a fabricated ten-item test taken by nine fictitious students. (see Table 1; to download a copy of this file, see Elvin,.) The purpose of this section of the spreadsheet is primarily to determine what proportion of the students answered each item, how many answered correctly, and how efficient the distractors were, It also helps the instructor prepare for item discrimination analysis in a separate part of the spreadsheet.
Fabricated 10-Item Test - Part A: Actual letter choicesABCDEFGHIJKL1IDITEM NUMBER02200201ArisaDAABCDAABD3200202KanaACDADCBCAA4200203SakiDBDBCDDABA5200204TomomiABBACCCADD6200205NatsumiCBABCDBCCD7200206HarukaCBABCDAABC8200207Momo.CBDDBAACB9200208YukaBDBCCDDBDB10200209RieCBABCBAABC11CORRECT ANSWERCBABCDAABC12A0.220.110.440.220.000.000.440.670.110.2213B0.110.560.330.560.000.220.220.110.440.1114C0.330.220.000.110.780.220.110.220.220.2215D0.220.110.220.110.220.560.220.000.220.4416TOTAL0.891.001.001.001.001.001.001.001.001.00. What proportion of students answered the question?It may be expected that for a multiple-choice test, all of the students would answer all of the questions. In the real world, this is rarely true. The quality of a test item may be poor, or there may be psychological, environmental, or administrative factors to take into consideration.
To try to identify these potential sources of measurement error, I calculate the ratio of students answering each question to students taking the test.In cell C16 of the Excel file spreadsheet, the formula bar reads =SUM (C12:C15), which adds the proportion of students answering A, B, C and D respectively. One student didnt answer question 1 (cell C8), so the total for this item is eight out of nine, which is 0.89. Perhaps she was feeling a little nervous, or she couldn't hear well because of low speaker volume or noise in the classroom.
Software Management Team
The point is, if it is possible to determine what was responsible for a student or students not answering, then it may also be possible to rectify it. In some cases, a breakdown of questions on a spreadsheet can contribute to the discovery of such problems.What proportion of students answered the question correctly?For question 1, the correct answer is C, as shown in cell C11. The proportion of students who chose C is shown in cell C14. To calculate this value, we use the COUNTIF function.
Software Maniacs
In cell C14, the formula bar reads =COUNTIF(C2:C10,'C')/9, which means that any cell from C2 to C10 which has the answer C is counted, and then divided by the number of test takers, which is nine for this test. This value is also the item facility for the question, which will be discussed in more detail later in this paper.How efficient were the distractors?We use the same function, COUNTIF, for finding the proportion of students who answered incorrectly. For item 5, cell G12 reads =COUNTIF(G2:G10,'A')/9 in the formula bar. The two other distractors are B, which is shown in cell G13 (=COUNTIF(G2:G10,B/9) and D, which is shown in cell G15 (=COUNTIF(G2:G10,D/9). For this question, seven students answered correctly (answer C), and two students answered incorrectly by choosing D. If this test were real, and had many more test takers, I would want to find out why A and B were not chosen, and I would consider rewriting this question to make all three distractors equally attractive.Preparing for item discrimination analysis in a separate part of the spreadsheetPart A of the spreadsheet shows the letter choices students made in answering each question. In Part B of the spreadsheet, I score and rank the students, and analyze the test and test items numerically.Organizing The Spreadsheet Part BThe area C22:L30 in part B of the spreadsheet (see Table 2; also Elvin, ) correlates to the absolute values of C2:L10 in part A of the spreadsheet in Table 1.
This means that even after sorting the students by total score in part B of the spreadsheet, the new positions of the ranked students will still refer to their actual letter choices in part A of the spreadsheet.Absolute cell references, unlike relative cell references, however, cannot be copied and pasted. They have to be typed in manually.
It is therefore much quicker to make a linked worksheet with copied and pasted relative cell references. The linked spreadsheet can then be sorted without fear of automatic recalculation, as would happen if working within the same spreadsheet using relative references. For the actual test, I used a linked spreadsheet. If you would like to see a linked file, a copy of one is available for download from my website (see Elvin, ).The purposes of part B of the spreadsheet are to. Fabricated 10-Item Test - Part B: Scoring and Ranking of StudentsABCDEFGHIJKLMN21IDITEM NUMBER0TOTAL22200206HarukaRie924200201Arisa725200203Saki626200205Natsumi627200204Tomomi328200207Momo229200208Yuuka230200202Kana031IF total0.330.560.440.560.780.560.440.670.440.22Reliability0.8732IF upper0.670.671.001.001.000.671.001.001.000.67Average5.0033IF lower0.000.000.000.000.330.330.330.330.000.00SD3.4334ID0.670.671.001.000.670.330.670.671.000.67SEM1.21.

A statistical analysis of the 2003 50-item test showed a great improvement compared to the previous year (see Table 3), with just ten items now falling outside the criteria guidelines. The average score of the 2003 test was very close to the ideal, but the reliability was still not as good as it should have been. Despite this, we were still able to identify and make provision for the highest and lowest scoring students, and feedback from all classes, thus far, has generally been very positive.ConclusionI plan to extend my database of acceptable test items to employ in developing the test for 2004.
The reliability should improve once the bad items are replaced with acceptable ones, and distractor efficiency analysis may help to pinpoint which acceptable items can be modified further. My main concern, however, is the very small standard deviation. If it remains stubbornly small, we may have to conclude that our students are simply too homogenous to be streamed effectively, and that may ultimately force us to reconsider establishing mixed-ability classes.ReferencesBrown, J.D. Testing in language programs. Upper Saddle River, NJ: Prentice Hall.Dudek, F.J. The continuing misinterpretation of the standard error of measurement.
Psychological Bulletin, 86, 335-337.Ebel, R.L. Essentials of educational measurement (3rd ed.). Englewood Cliffs, NJ: Prentice Hall.Elvin, C. Elvinsdata.xls Online.
Retrieved September 26, 2003, from.Elvin, C. Elvinsoutput.xls Online. Retrieved September 26, 2003, from.Hatch, E. & Lazaraton, A. The research manual: Design and statistics for applied linguists. Boston, MA: Heinle & Heinle.Iteman (Version 3.6). Computer software.
Paul, MN: Assessment Systems Corporation.Spearman, C. General intelligence, objectively determined and measured. American Journal of Psychology, 15, 201-293.Chris Elvin has a Masters degree in education from Temple University, Japan.
He is the current programs chair of the JALT Materials Writers special interest group, and former editor of The School House, the JALT junior and senior high school SIG newsletter. He is the author of, an oral communication coursebook published by EFL Press, and the owner and webmaster of, an English language learning website dedicated to young learners.
He currently teaches at Tokyo Womens Medical University, Soka University, Caritas Gakuen and St. Dominics Institute. His research interests include materials writing, classroom language acquisition and learner autonomy.
Contents.Sources There exist many free tools developed by researchers and educators. Important websites for free psychometric software include:.
CASMA at the University of Iowa, USA. REMP at the University of Massachusetts, USA. Software from Harold Doran. Software from Brad Hanson. Software from John Uebersax. Software from J. Patrick Meyer.


Software Manual
Software directory at the Institute for Objective Measurement. Software from Lihua Yao. Software from Larry Nelson. CRAN Task View: Psychometric Models and MethodsClassical test theory is an approach to psychometric analysis that has weaker assumptions than item response theory and is more applicable to smaller sample sizes.CITAS CITAS (Classical Item and Test Analysis Spreadsheet) is a free Excel workbook designed to provide scoring and statistical analysis of classroom tests.
Item responses (ABCD) and keys are typed or pasted into the workbook, and the output automatically populates; unlike some other programs, CITAS does not require any 'running' or experience in psychometric analysis, making it accessible to school teachers and professors.jMetrik jMetrik is free and open source software for conducting a comprehensive psychometric analysis. It was developed by J. Patrick Meyer at the. Current methods include classical item analysis, differential item functioning (DIF) analysis, confirmatory factor analysis, item response theory, IRT equating, and nonparametric item response theory. The item analysis includes proportion, point biserial, and biserial statistics for all response options.
Reliability coefficients include Cronbach's alpha, Guttman's lambda, the Feldt-Gilmer Coefficient, the Feldt-Brennan coefficient, decision consistency indices, the conditional standard error of measurement, and reliability if item deleted. The DIF analysis is based on nonparametric item characteristic curves and the Mantel-Haenszel procedure. DIF effect sizes and ETS DIF classifications are included in the output.
Confirmatory factor analysis is limited to the common factor model for congeneric, tau-equivalent, and parallel measures. Fit statistics are reported along with factor loadings and error variances.
IRT methods include the Rasch, partial credit, and rating scale models. IRT equating methods include mean/mean, mean/sigma, Haebara, and Stocking-Lord procedures.jMetrik also include basic descriptive statistics and a graphics facility that produces bar charts, pie chart, histograms, kernel density estimates, and line plots.jMetrik is a pure Java application that runs on 32-bit and 64-bit versions of Windows, Mac, and Linux operating systems. JMetrik requires Java 1.6 on the host computer.Iteman Iteman is a commercial program specifically designed for classical test analysis, producing rich text (RTF) reports with graphics, narratives, and embedded tables. It calculates the proportion and point biserial of each item, as well as high/low subgroup proportions, and detailed graphics of item performance.
It also calculates typical descriptive statistics, including the mean, standard deviation, reliability, and standard error of measurement, for each domain and the overall tests. It is only available from.Lertap Lertap5 (the 5th version of the Laboratory of Educational Research Test Analysis Program) is a comprehensive software package for classical test analysis developed for use on Windows and Macintosh computers with. It includes test, item, and option statistics, classification consistency and mastery test analysis, procedures for cheating detection, and extensive graphics (e.g., trace lines for item options, conditional standard errors of measurement, boxplots of group differences, histograms, scatterplots).DIF, differential item functioning, is supported in the Excel 2010, Excel 2013, Excel 2016, and Excel 2019 versions of Lertap5.
In latent classes: An integration of two approaches to item analysis. Applied Psychological Measurement, 14, 271-282. von Davier, M., & Rost, J. Polytomous mixed. Molenaar (Eds.), Rasch models, foundations, recent developments, and applications (pp. New York: Springer.
Rasch dichotomous model vs. One-parameter Logistic Model. Rasch Measurement Transactions, 2005, 19:3 p. 1032.
Yao, Lihua (2012-05-17). 'Multidimensional CAT Item Selection Methods for Domain Scores and Composite Scores: Theory and Applications'. 77 (3): 495–523. Loyd & Hoover, 1980. Marco, 1977.
Linn, Levine, Hastings, & Wardrop, 1981. Haebara, T. Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22, 144‐149. Stocking, M.L., & Lord, F.M.
Developing a common metric in item response theory. Applied Psychological Measurement, 7, 201-210. Lord, F.M. Applications of item response theory to practical testing problems. Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Han, K.
WinGen: Windows software that generates IRT parameters and item responses. Applied Psychological Measurement, 31, 457-459. CRAN Task View: Psychometric Models and Methods. Stata's IRT manual online.