Developing authentic assessment instrument based on multiple representations to measure students' critical thinking skills

The purpose of this study was to produce an instruments test used to measure students' critical thinking skills in natural science learning. This research uses a 4-D development model (define, design, develop, and disseminate) involving 118 students at the develop stage and 60 students at the disseminate stage. The instrument developed was an essay test based on multiple representations. Validity was proved by using CVI, and reliability was estimated by using the Item Response Theory. The results showed that the instrument had a very good foreign exchange value. This is reflected in the Aiken V scores on the aspects of substance, construct, language and appearance, respectively, about 1.00, 1.00, 1.00, and 1.00. According to Rasch analysis, the instruments has meet the assumption test for 14 items which is unidimensional, the local independence assumption test, and the parameter invariance assumption test. According to OUTFIT MNSQ Value, the items are fit with PCM 1-PL which functions normally in making measurements. Reliability estimated for items shows a very high consistency of measurement of 0.97 and for person shows a high consistency of measurement of 0.86. The results of the student's CTS measurement showed that the average score was 65.50, with a distribution of high, medium, and low abilities, respectively, about 16.67%, 63.33%, and 20.00%. Thus, according to these results, an authentic assessment based on multiple representations is suitable to measure students' critical thinking skills.


Introduction
The globalization era has been developed in several fields such as economics, politics, technology, and education. Especially in the field of education, schools are struggling to improve the quality of learning. The quality of learning is determined by, among other things, the quality of the assessment carried out by the teacher in the learning process. Assessment is one of the most important things that must be done in the learning process, (Maba, 2017) stated that assessment and learning are two things that cannot be separated. This is in accordance with the Permendikbud so that these tests can communicate representations to model and interpret them. Study from Abdurrahman et al. (2019); Susilaningsih et al. (2019); and Yanti et al. (2019) showed that multiple representation can improve the implementation of the learning process starting from planning, student activities during learning, responding to student's very good interest, and providing positive results on students' social skills. The results of the study show that multiple representation can improve student's problem solving ability (Prahani et al., 2016), increase creativity in analyzing tests (Mutia & Prasetyo, 2018), help students stimulate the development of thinking skills with various perspectives and approaches (Fonna & Mursalin, 2018).
Multiple representation tests to measure critical thinking skills need to be developed in each material. However, only a few materials in science subjects can accommodate the needs of the test. One of the materials that can support the needs of this type of assessment is environmental change material. Observations have been made at one of the junior high schools in Sleman district, namely MTS Negeri 2 Sleman. Based on observations on science learning, it was found that formative assessment for the cognitive aspects rarely used the high-level thinking instruments. Additional information obtained was that research had never been carried out related to the development of an authentic assessment instrument based on multiple representations in science learning to measure students' critical thinking skills before. Therefore, this study aims to produce an instruments test used to measure students' critical thinking skills in natural science learning.

Method
This study focuses on measuring the utility of a multiple representation-based critical thinking skills test instrument. This instrument consists of 14 questions, where 1 indicator is represented by 2 questions which have been developed based on the modified critical thinking skills indicators. This research uses a 4-D development model (define, design, develop, and disseminate) involving 118 students at the develop stage and 60 students at the disseminate stage. The instrument developed was an essay test based on multiple representations. The analysis of the content validity of the test instruments was analyzed using the Content Validity Index, and the empirical validity of the test instrument reliability was estimated by using the Item Response Theory with Winsteps program. There were 308 students as subject divided into 10 groups, five groups used MSLAM as the experimental class and five groups used the HOT Lab as a control class. All groups did experiments of series-parallel circuit on electrical and elasticity.

Instruments and Procedures
The data obtained are as follows: 1. Qualitative Data that comes from expert validators in the form of comments and suggestions for authentic assessment instruments for critical thinking skills. 2. Quantitative Data, namely: a. The assessment score of the expert validator of the authentic assessment instrument of critical thinking skills. b. Score of student learning outcomes on environmental change material using authentic assessment instruments for critical thinking skills. Data was collected from July to August 2020. The school as the place of research conducted was MTs Negeri 2 Sleman which located at Jalan Magelang km 17, Margorejo, Tempel, Sleman, Daerah Istimewa Yogyakarta. Data collection was carried out during the process of preparing the assessment instrument as well as in the learning assessment process in the classroom, including: (1) Testing the appropriateness of the authentic assessment instrument for critical thinking skills developed through validation by expert validators; (2) Taking data on students' cognitive learning outcomes using authentic assessment instruments for critical thinking skills after learning environmental change material is carried out; and (3) Observing the achievement of students' critical thinking skills in the cognitive aspects after using the developed authentic assessment instruments.

Data Analysis
The data collected from the instruments were analysed as follow:

Analysis of Validity
Content validity is the validity that is estimated through testing the test content with rational analysis or through professional judgment (Gardner & Dunkin, 2018). The data from the results of the assessment by expert validators from the validation sheet of the assessment instrument were analyzed to determine the validity of the content of the developed authentic assessment instrument. In this study, the content validity of the critical thinking skills assessment instrument was analyzed using the Content Validity Ratio (CVR) and the Content Validity Index (CVI). According to Lawshe (Lawshe, 1975), CVR is a content validity approach to determine the suitability of items with domains that are measured based on expert judgment. Content Validity Ratio (CVR) was obtained from a number of experts (panels) who were asked to examine each component of the measurement instrument. The technique of analyzing it is as follows: (1) Validator assessment criteria on the assessment data obtained from the validation is a score. The table is used to convert the score given by the validator into the assessment index value; (2) Calculating the CVR value; (3) Calculating Content Validity Index value (CVI-value); (4) Categorization of CVR and CVI result is in range -1 < 0 < 1 (Lawshe, 1975).

Empirical validity Analysis
According to Kvale (1989) empirical validity is the validity obtained based on experience by means of testing. Empirical validity is obtained through the results of test trials to respondents. In this study, the empirical validity of the critical thinking skills instrument was analyzed using the Winsteps 3.37 program with Rasch Modeling, which is the development of the analytical model by Georg Rasch from the response theory item 1 LP (one Logistic Parameter). The item fit with the Rasch model can explain whether the item of the instrument functions normally in making measurements or not. Item fit analysis provides a technique to control the quality needed to assess the validation of test items and person responses (Wright & Stone, 1988). Boone et al. (2014) added that the criteria used to check the suitability of instrument items to be considered in accordance with the model were by looking at the value of OUTFIT Mean Square (MNSQ), OUTFIT Z-standard (ZSTD), and Point Measure Correction (Pt Mean Corr).

Analysis of reliability
According to Lester et al. (2014) reliability means the extent to which the results of a measurement can be trusted. A measurement result can be trusted if the results obtained are relatively the same for several times. The reliability analysis of the test instruments was carried out with the help of the Winsteps 3.37 program. The Winsteps program can provide instrument reliability information, namely person spacing index and item spacing index, and Cronbach's Alpha value, namely the interaction between person and item (Fariña et al., 2019). Yanto (2019) state that the higher the item reliability, the more precise the overall item is analyzed according to the model used. The instrument can be said to be reliable if it has a Cronbach Alpha value> 0.7.

Analysis of level difficulty
The level of difficulty of the instrument items can be obtained in the analysis using the Winsteps program. Hambleton and Swaminathan (1985, p. 36) states that an item is said to be good if the level of difficulty is more than -2.0 or less than +2.0 (-2.0 <difficulty <+2.0).

Analysis of student's achievement
The level of student's ability in answering the instruments can be seen with the help of the Winsteps program with Rasch modeling. Fariña et al. (2019) state that the ability level of these students is indicated by the logit value on the person measure.

Analysis of e-learning achievement of Critical Thinking Skills
The score of the results obtained by students from the critical thinking instruments in the form of numbers is then converted into three categories (Permatasari et al., 2019).

Results and Discussion
Feasibility of authentic instruments of critical thinking skills Expert judgment trials aim to produce valid instruments in terms of content. A qualitative review has been carried out prior to testing the instruments and measurements involving experts. The aspects assessed at this stage include substance, construction, language and appearance. The validation results from the experts were then analyzed using the Aiken V equation to determine the value of each item. The results of the Aiken V scores on the aspects of substance, construct, language and appearance, respectively, are about 0.90, 0.85, 0.80, and 0.92. Aiken stated that the statistical significance of Aiken V can be determined by correlating the scale used with the number of experts. This study involved seven experts and five category scales with a significance level of 0.05 so that the Aiken V limit for each item was 0.75. Thus, it can be stated that, at the significance level of 0.05, all items are included in the content valid category.
The results in content validation for the written test assessment tool were analyzed using the Lawshe content validity where the CVR validity standard depends on the number of SMEs. The CVR value must meet 0.99 for the items to be declared valid. This applies to content validation using 7 SMEs (Lawshe, 1975). The CVR value obtained from each item is 1 and is fully presented in the attachment. The CVI value obtained from the average CVR is 1. Based on the CVR value that exceeds 0.99, all items are declared valid and fit for use for further research.
Based on the analysis of the instrument items using the Winsteps 3.37 program, the quality of the assessment instrument items can be seen. First, a test of the fulfilled assumptions, namely indicators, is carried out. Unidimensional means that each test item measures only one ability (Fu & Feng, 2018). The results of the analysis with Winsteps obtained Eigenvalues or raw variance data of 48.3% with Unexplained variance in 1st contrast of 7.0% and Unexplained variance in 2nd contrast of 6.5% for authentic assessment instruments of critical thinking skills.
According to Chan et al. (2014), the minimum requirement for unidimensionality is 20%. These results indicate that the unidirectionality of the instrument with a minimum requirement of 20% raw variance has been met. Student's CTS was measured by giving 7 questions, 1 question represent 1 indicator, which showed that students have a low level of CTS. RASCH analysis done included person reliability, item reliability, and fit item measure. Person reliability or score that shows how consistent the students are in answering correctly is 0.87 with a separation index of 2.57. In other words, students may answer questions, with the correct answers, in the "good" category (Boone et al., 2011). Separation index, or score that shows how well a sample of people is able to separate the items, of 2.14 indicated that students' score of CTS has good enough distribution (Boone & Noltemeyer, 2017). Next, determining a separation index that obtains 3.76 is good. These results indicate that respondents can be divided into four large groups, namely groups that have very high, high, low, and very low critical thinking skills scores. This classification indicates that the instrument made is able to distinguish students' abilities quite accurately.   Figure 1 shows that the value of the person measure shows the score of 0.02, which means that the student's average ability is almost the same as the difficulty level of the question (set by default 0.00). The OUTFIT-MNSQ column shows the score of 0.99 which indicates that the questions that have been given are in the acceptable category, or excellent in measuring. Furthermore, the OUTFIT-ZSTD score of -.1 has a high level of reliability. Referring to Boone et al. (2014); and Linacre (2012), the range of values for OUTFIT-MNSQ is from 0.5 to 1.5 and the range of values for OUTFIT-ZSTD is from -2 to +2. Next is The Cronbach alpha value that shows how the students' internal consistency is in answering questions. Therefore, this score is not fully part of the statistical analysis, but to show the reliability between students (who answer) and the questions (which are asked). In this study, the value of The Cronbach alpha value was 0.71 which means "acceptable", as shown in Figure 1. Figure 2 shows the results of the measured item test which shows the item's reliability value reaches 0.97. This means that the item is considered very good in measuring the ability of students with a separation index of 5.92, which is in the very good category. This result is supported by Fit Item Order (Figure 3) and variable map between item difficulties and student ability (see Figure 4).   Figure 3 shows the item analysis based on the fit order item. Based on the table, the item fit criteria for the model if the OUTFIT MNSQ is between 0.5 to 1.5, the ZSTD value is between -2.0 to +2.0, and the Pt Mean Corr value is 0.4 to 0.8 (Fariña et al., 2019). Based on this description, it can be concluded that 14 instrument items (7 indicators) of authentic assessment of critical thinking skills fit the Rasch model.     Figure 4 shows how the distribution of responses is associated with the question items answered. The question with the highest difficulty level is in the top position and the item with the lowest difficulty level is the one at the bottom. Labels X and Y indicate the type of question (X = Type 1 tested first; Y = Type 2 tested after the first one) and numbers 1-7 show indicators of critical thinking skills. Based on the data n Figure 3 and Figure 4, it can be seen that all the questions have good criteria for measuring students' critical thinking skills.

Descriptive analysis
In the implementation stage, the data were collected in two stages, namely pre-test and posttest. The data on the results of the implementation carried out are shown in Figure 5.
Based on Figure 5, there are two students who have very extreme scores both in the pre-test and post-test. Based on the distribution of data, it can be concluded that the pre-test items are more spread out symmetrically (normal with sig. 0.017 <0.05) than the post-test which tends to the right (not normal with sig. 0.200> 0.05). More detailed details are shown in Figure 6.  Based on Figure 6, it can be seen that on average, students' score with type 2 questions are higher than those type 1 questions. This means that with the same indicator, type 2 questions are declared easier than type 1 questions. Furthermore, Figure 6 also describes the range of data on the two types of questions which in type 1 students have the ability to answer questions that tend to be the same as the data set which is more centralized, whereas in type 2 the data is spread over a larger cluster. Then, students can be grouped according to their abilities as in Table 1.

Instrument's analysis
The results of the instrument testing at the implementation stage were given to sixty samples that were spread into two classes. They indicate that the resulting instrument had a fairly good reliability, as shown in Figure 7.    Figure 7 shows the total average value of the three aspects. Most of which are in the good category because they are between -2 to +2.  Figure 8 shows that the questions made have good quality, which is indicated by the reliability value of 0.95. Furthermore, the separation index of 3.89 shows that each item is distributed in good categories. This is reinforced by the separation index of 3.89 (rounded to 4) which shows that the question instrument made is able to differentiate students' abilities into four groups (very high, high, low, very low). Finally, with reference to the Cronbach alpha value of 0.81, it indicates that the resulting instrument is acceptable for measuring critical thinking skills.
Referring to the variable map, the data in Figure 6 shows that the post-test score is better than the pre-test, as evidenced by Figure 9 which indicates that the post-test (denoted by Y) has a lower level of difficulty than the pre-test. When viewed from the FIT model (Figure 10), it can be seen that the Y type (post-test) questions have a pattern that is more similar to the RASCH model compared to the X type (pre-test) questions.
The results of the study showed that the authentic assessment instrument has good quality and is suitable to measure student's CTS. Based on the results of the analysis obtained, it can be seen that the CTS authentic assessment instrument developed is able to measure the ability of students from the highest to the lowest ability. Based on the results of the expert's assessment, it appears that all the questions are of good quality and are feasible to be tested on students. The results of the feasibility test on 118 students showed that the instrument developed was able to measure students' abilities well. This can be seen in the group of students who were divided into six groups from the group with the highest ability to the lowest one which was shown by a separation index of 5.92 (rounded to 6). At the dissemination stage, the separation index slightly decreased to 5.52 (rounded Based on the results of the analysis on each item and the indicators tested, in all aspects, it meets the OUTFIT MNSQ value limit standard with a score range of 0.5 to 1.5. The clarity assumption aspect (Item 1 and item 8) has an OUTFIT MNSQ value of 0.80 (type A) and 0.72 (Type B), an interpretation aspect of 0.7 (type A) and 0.93 (type B), an analysis aspect of 0.74 (type A) and 1.13 (type B), an evaluation aspect of 1.26 (type A) and 1.63 (type B), a reason aspect of 0.73 (type A) and 1.36 (type B), and a self-regulation aspect of 0.96 (type A) and 0.8 (type B). Based on the OUTFIT MNSQ value, it can be seen that there is only 1 question that needs to be revised, namely the type B evaluation aspect (item number 12), and after that it can be used in dissemination test. Furthermore, referring to the variable Map, the results of the analysis show that there are some students with very high critical thinking skills, but on the other hand there are also students with very low critical thinking skills. This of course is influenced by several factors, one of which is the class characteristics factor. The class used as the test subject is a class with input from students with various abilities. This means that students in the class do not have the same abilities from the start, but there are students with very high abilities and students with sufficient abilities. However, the instrument of critical thinking skills developed was able to measure the ability of students from the highest to the lowest abilities. These results are consistent with research conducted by Burhanuddin (2015) which states that authentic assessment by written tests is suitable for assessing the cognitive aspects of students.

Conclusion
Based on the results of the analysis and discussion, the following conclusions were obtained: (1) this study produced an authentic assessment instrument which was suitable for measuring students' critical thinking skills on environmental change material. The feasibility of this authentic assessment instrument is based on the results of the analysis as follows: the instrument has met the content validity requirements by the expert judgment in the very good category and about 14 items of critical thinking skills have obtained empirical evidence of fit with the Rasch Model based on three parameters, namely OUTFIT MNSQ, ZSTD, and Pt Mean Corr; based on the item separation index, the developed critical thinking skills authentic assessment instrument was classified as reliable, and the developed authentic instrument can measure students' critical thinking skills in level high of 16.67%, middle of 63.33%, and low of 20.00% with an average of 10 students belong to the high level, 38 students to the middle level, and 12 students to the low level.