Evaluation of a Flipped Classroom Teachers Training Course Assessment Through Latent Trait Theory Analysis

Assessment of an educational program/course, based on quantitative data, is attempted in this study, by using the final deliverables of the trainees and assess them according to a predefined set of items connected to the desired Learning Outcomes and a predefined scale for each item. The statistical analysis of the items’ grades, first using factor analysis and then using an Item Response Theory model, gives an indication of the Learning Outcomes’ degree of achievement and consequently guides the training designers to modify training strategies for a potential next cycle of the training program/course. For this study, a teacher training course on flipped classroom methodology, has been used and the above concept was tested. Our analysis using Item Response Theory, revealed the Learning Outcomes partially or not at all achieved showing very good agreement with trainers’ intuitive observations. For the future, the use of such a quantitative assessment could involve Structural Equation Modelling (SEM) tools to assess the relations among learning outcomes, prior knowledge and teaching practices and temporal analysis during training course execution using not only final data but also data from intermediate phases.



Abstract-Assessment of an educational program/course, based on quantitative data, is attempted in this study, by using the final deliverables of the trainees and assess them according to a predefined set of items connected to the desired Learning Outcomes and a predefined scale for each item.The statistical analysis of the items' grades, first using factor analysis and then using an Item Response Theory model, gives an indication of the Learning Outcomes' degree of achievement and consequently guides the training designers to modify training strategies for a potential next cycle of the training program/course.For this study, a teacher training course on flipped classroom methodology, has been used and the above concept was tested.Our analysis using Item Response Theory, revealed the Learning Outcomes partially or not at all achieved showing very good agreement with trainers' intuitive observations.For the future, the use of such a quantitative assessment could involve Structural Equation Modelling (SEM) tools to assess the relations among learning outcomes, prior knowledge and teaching practices and temporal analysis during training course execution using not only final data but also data from intermediate phases.

Index Terms-Educational Assessment, Item Response
Theory, Flipped Classroom.

I. INTRODUCTION
It is common knowledge, that assessment of a training program is always at the last step of the program, to help the organizers understand what went good and what bad and whether the aims of the training program have been achieved.could be performed in various ways.It is also a common practice to have the trainees assess the program, they have just completed through a questionnaire.On the other hand, the trainers assess the achievement of the set educational outcomes for the trainees e.g. through exams or final projects etc.It is then on the trainers/teachers hand to intuitively combine the different aspects of training program assessment and using their experience to try to figure out what went good and needs to be retained and what wrong and needs to be payed attention during next cycle.
In this study, we attempt to provide a series of methodology steps, to help trainers/teachers analyze the final outcomes of a training program of any kind, by looking into the results of the learning outcomes evaluation of their trainees, using statistical tools/methodologies, i.e. here Exploratory Factor Analysis and Item Response Theory.
Published on September 23, 2019.I. Katsenos is with the University of Patras/Department of Business Administration, Patras, Greece (e-mail: ikatsenos@gmail.com).
G. S. Androulakis is with the University of Patras/Department of Business Administration, Patras, Greece (e-mail: gandroul@upatras.gr) The training program under analysis in this study, was a teacher training program on the flipped classroom teaching methodology, held by blended learning methodology over different western Greece's cities.

A. Latent trait models -Dichotomous Item Response theory models
In their simplest form unidimensional dichotomous latent trait models utilize a set of initially binary responses xi1, xi2, … xik to a set of k items (e.g.questions in an educational assessment test), for i examinees to calculate the probability of xij to be 0 or 1 given the true ability level θi of the examinees ((  ) = {  = 1|  }).To estimate p(θi), a monotonically increasing function in (-∞, +∞) for θi, most commonly the logistic function i.e.   = (  −   ) is used.Depending on the number of parameters to include in the model 1,2,3 and 4-parameter models are derived.For instance, the 2-parameter logistic model [1] (  = 1|  ,   ,   ) = 1 1+   (  −  ) (1) for dichotomously answered items, calculates the probability of correct answering item j by someone with ability level θi given the item's difficulty δj (otherwise location parameter), and the item's discrimination parameter aj.
The location parameter δj is that value of the ability θ at which there is 50% probability for the examining item to be correctly endorsed.Items with lower δj values are 'easier' and expected to be endorsed at lower trait levels.Item discrimination, denoted aj for item j, describes how well an item can differentiate the examinees at different trait levels.In the simplest binary case, it is defined as the slope of the logit function at δj.The steeper the curve, the better the item can discriminate between entities with different levels of the trait.
Graphical representation of (1) results in the Item's Characteristic Curve (ICC) IRT uni-dimensional models assume that 1) the probability of an entity selecting an item increases as the latent trait level increases (monotonicity) 2) all items equally contribute to the underlying latent trait (unidimensionality) 3) entity's trait level does not depend on which items are administered nor on the particular entities sample (invariance), 4) Responses are independent given an entity's ability level (local independence) and 5) The same item response functions applies to all members of the entities population.

Evaluation of a Flipped Classroom Teachers Training Course Assessment Through Latent Trait Theory Analysis
Ioannis Katsenos, Spyros Papadakis, and George S. Androulakis Having determined the IRT parameters for all items, the ability level (IRT score) for each examinee, can be calculated, through different methods.For instance, according to maximum likelihood estimation procedure for 2-Parameter Logistic (2PL) model, starting from an initial arbitrary value, the probability of correct answers is calculated for each examining item according to equation (2) [2].
Where: θs is the ability estimation for s th iteration, ai is the discrimination parameter for i th item, ui is the response of the examined entity to item i (either 1 or 0),   (  ̂)is the probability of correct response to item i under the given model at ability θ and within iteration s, while   (  ̂)is the corresponding wrong response probability.
The corresponding standard error for the measurement can be calculated by equation (3): The reciprocal of the squared ( ̂) for an item i is a function ( ̂) defined as the Item Information Function (IIF), meaning that where ( ̂) is small, the value of θ is accurate and thus we obtain higher information than in areas where ( ̂) is higher.A test is a set of items, thus the Test Information Function (TIF) is the sum of all items' information at each ability level, or simply [2]: Where for instance for the 2PL model,   () =   2   ()  () and   is the discrimination parameter for item i,   () = 1 1+ − (−  ) ,   = 1 −   and θ is the ability level For each item, the information function has a peak near (or exactly at for 1PL and 2PL models) the difficulty δi and is symmetrical around it, while for the total test the total information function (TIF) may be flat over a range of θs.Graphical representation of Ii(θ) results in the Item's Information Curve (IIC), while for the whole test the respective graphical representation results in the Test Information Curve (TCC).IICs and TCC show graphically the ability ranges where Information is higher for each item and for the test.The ICC may be used for identifying the θ position of the maximum accuracy given by the item

B. Polytomous Item Response theory models -The Generalized Partial Credit Model (GPCM)
When the responses to items are polytomous (i.e. more than two, success-failure), e.g. 3 or more, the IRT models need to be extended.Masters [3] proposed the Partial Credit Model (PCM), in which modeling of the ordered polytomous data involves their decomposition into a series of ordered pairs of adjacent categories or category scores and then a dichotomous model is successively "applied" to each pair.The partial credit model specifies that the conditional probability that an examinee with latent location θ obtains a category score of xj is Where δjh is the transition location parameter, which in effect, reflects the relative difficulty in endorsing category h over category (h -1).The use of a subscript on m (i.e., mj) reflects that the number of category scores may vary across items.Therefore, the Partial Credit Model may be applied to items that are polytomously scored with a varying number of category scores, are dichotomously scored, or consist of both dichotomously and polytomously scored items [4].
The probability of obtaining a particular category score as a function of θ may be graphically represented in an option response function (ORF); ORFs are sometimes referred to as category probability curves, category response functions, operating characteristic curves, or option characteristic curves.
Muraki in 1992 generalized the PCM in [5] by relaxing the assumption of equal discrimination parameters among items made by Masters (in other words, by extending the 2PL dichotomous model), thus proposing, that the probability of endorsing the k th category xjk of item j is given by Where θ is the latent trait, aj is the item's discrimination, δjh is the transition location parameter between the h th and the h-1 ctegory (i.e. the intersection point of adjacent ORFs), mj is the number of categories for item j and k={1,….,mj}.Muraki, arbitrarily defined the first boundary location as zero (δj1=0), thus there are mj-1 transition location parameters for each item with mj categories m.

C. Learning outcomes' formulation and taxonomies
Assessment serves to diagnose, predict, place, evaluate, select, grade and guide students or teachers.That is, in all education levels, assessment results are used to decide about students (i.e., student advancement), to decide about teaching and learning (i.e., curriculum decisions) and increasingly assessments are linked with certification of competence and the validation of performance on jobrelated tasks [6].While assessment can be seen as the Check step of a Plan-Do-Check-Act cycle in continuous learning process [7]- [10] formulation of the learning outcome objectives is the Plan step.Learning outcome objectives formulation, as well as the teaching methods to be used should be aligned to the learning activities assumed in the intended outcomes [11].
Several taxonomies have been proposed to formulate learning outcomes objectives, focusing on different aspects of learning processes.For instance, SOLO taxonomy proposed by Biggs [12], describes the increase in the ability of the trainee to associate principles to new ideas, while more recent Fink's taxonomy [13] is not hierarchical, but describes the intersection of six important to learning sections.
However, still the most popular taxonomy for formulating learning outcomes remains revised Bloom's taxonomy [14]- [16] especially if only objectives at the cognitive level are to be pursuit.

D. Flipped classroom
Although there is no single definition, flipped classroom is generally characterized by its course structure comprising in-class and out-of-class activities.It uses classroom time for students to actively engage in interactive learning activities, while traditional lectures are delivered out of formal class time with videos, audios, content-rich websites, games and simulations [17].Such a learning design intends to have classroom time to engage students in active learning [18] and to have the teacher a "guide on the side" instead of a "sage on the stage" [19].Students are encouraged to explore and solve problems either independently or in groups collaborated to achieve their learning outcomes.
A review of the literature on the flipped classroom has discovered that there is still a pressing need for studying "how" to support teachers to design and implement the flipped classroom and, moreover, to be able to connect this pedagogical design with evidence of advantages related with various aspects of student learning.

A. Educational environment and assumptions for this study
Teacher's training in Greece is performed either through centrally designed and implemented programs by ministry of education, or by decentralized regional training centers -PEK-which are under restructuring after summer 2018.The Regional Training Center of Patras, at its last year of operation, organized a "Flipped Classroom methodology" training course spread over the Region of Western Greece.
The course involved 40 trainees and 376 trainers (class teachers -volunteers), in four groups (in seven different cities of western Greece), lasted 36 training hours (16h on site -20h by distance), between February and June 2018.The methodology chosen and followed was blended learning, with sixteen hours physical presence in the classroom and twenty training hours asynchronous work assisted by LAMS platform [20] .The basic concept applied, was to deliver a course on Flipped Classroom methodology, using the same methodology as the course's training method.The technology supported used was the Learning Activity Management System (LAMS).The LAMS (https://www.lamsfoundation.org/) is the most widespread and popular platform that implements the ideas of learning The LAMS is an Online Free Open Source Software that supports the design, authoring, management and supervision of the execution of courses in the form of sequences of learning activities design [21], [22].
The core competence examining instrument for the trainees, was the composition of a learning scenario (LS), using the Flipped Classroom methodology.During the second week of their seminar, all the 376 participants were asked to design a learning scenario on a subject of their choice and implement it in their class using the FC model.The trainees began working on it in the fourth week and delivered it at the end of the course.
Each learning scenario was evaluated by a) the respective trainer and b) by other trainees of the same group according to a pre-specified measurement scale (see Appendix A).Three ordered category levels were defined for each formulated learning outcome.The lower category level meaning was that the specified expected learning outcome was poorly or not at all achieved.The middle category level signified moderate learning outcome achievement, while the higher category level meaning was that the expected learning outcome was fully achieved.

B. Proposed methodology steps
The methodology steps followed in this study could be regarded as the steps to be followed for assessing any training course or even a training program assessment at any level.

1) Learning outcomes' formulation
As is common practice, the learning outcomes are formulated during course design and the learning activities are designed.Learning outcomes formulation can be done according to any taxonomy.

2) Learning outcomes' grading
It is done against a predefined set of items, assumed to assess the learning outcomes and using a preselected scale.It is done at the end of the training course/program by the trainers/teachers

3) Identification of underlying relations on the answers
Although a model for the evaluation items, might exist at their time of the formulation, these might have been perceived differently by the answering groups.Therefore, factor analysis (exploratory and/or confirmatory) is needed to unveil the underlying dimensions and facilitate polytomous unidimensional IRT analysis at the next step.

4) IRT analysis
The data sets containing the evaluation data are analyzed using an appropriately selected IRT model.
In IRT analysis studies, the examined entities (trainees) are in the rows and the examining items are in the columns of a table and this table is fitted according to an IRT model using an appropriate environment.Although there have been proposed ways to identifying the items' location for polytomous items [23] we expect that for the level of detail sought in applications like ours, simply the mean of location indices would be enough.
Another means for identifying the difficulty of the item is to check the θ value where information function I(θ) maximizes, this can be calculated or simply estimated by the relevant IIF graphs.Thus, each item could be characterized e.g. as low, medium or high difficulty if one divides θ scale to three regions

5) Learning outcomes evaluation
It is done by combining the information gathered from the previous steps  The median of the grades for each item, which is associated to each learning outcome serves as an indication for deciding upon the achievement of this learning outcome.If the median of the grades of an item is e.g.into the lower answering category, one can safely conclude that the associated learning outcome to this item has not been achieved.On the other hand, a median falling in the upper answering category, will imply the opposite. Dividing the ability level scale θ for each dimension to e.g. 3 categories (low, medium, high) and identifying the location of each item, one can conclude on the relative difficulty of each of them.Thus the items (and consequently the associated learning outcomes) could be characterized as e.g.lower, medium or high difficulty.

A. Learning outcomes' formulation
It was done during the course design phase.The table presented in Appendix A, was produced and provided to the evaluators when learning scenario evaluations were requested.Revised Bloom's taxonomy was used in this study, however the taxonomy used is not expected to influence further analysis.

B. Learning objectives grading
In this study, each learning scenario was double evaluated by the teachers and peers, thus two sets of data were assembled for validity checking of the results, as described in previous section.Double evaluation should not however be necessary for normal application of this methodology and solely final evaluation of the trainees' outcomes by their trainers, should be enough.

C. Identification of underlying relations on the answers
Internal consistency of the data was examined, for Trainers' Dataset and for trainees' Dataset.Both datasets seem adequately internally consistent, giving Crombach's alpha factor values 0.73 and 0.79 respectively.
Exploratory factor analysis revealed four factors for the trainers' dataset and three underlying factors for the trainees' dataset, as shown in Table I.
In both cases there exist two evaluation items (LGF1 and CAD5 for trainers' dataset and TOO2 & VD1 for trainees' dataset), which seem not to be related enough to the overall data structure therefore they were excluded from further analysis.
Confirmatory factor analysis for the above identified models, gave good results and confirmed model selection, as shown in Table II.Factor analysis reveals 3 dimensions for trainees' dataset and four dimensions for trainers' data set, which correspond to different latent abilities identified.One can observe that:  Dimension 1 corresponds to factor F1 with item CE5 mainly contributing by both datatsets. Dimension 2 is mainly formed by items VC4, VA5 and CL3 in both datasets (factors F3 in both datasets)  Dimension 3 is mainly formed by items VO6, VI2 and PK2 (Factor F2 in trainees' dataset and Factor F4 in trainers' dataset)  Dimension 4 exists only into trainers' dataset consisting by items TOO2, CAT3 and COO3

D. IRT analysis
In this study, the results of the evaluations were gathered electronically and analyzed using the polytomous IRT GPCM model.
The R environment [24] and the mirt package [25] were preferred for IRT analyzing the data, however similar results are expected whatever IRT analysis tools are used.
The IRT analysis was performed per identified dimension and the parameters calculated were: 1.The discrimination parameter (ai) 2. The location parameters (thresholds, δ1i and δ2i) 3. The total information area under item information curves (as a single measure of the total accuracy in the items measurement)

1) Dimension 1
As can be seen in Table III and Fig. 1,   The item VD1 provides no information as almost all answers fall in the third answering category (i.e. the video duration is short). The items SE5 and CE5 have similar characteristics, with thresholds in the middle of θ scale and similar amount of information provided.
 The item VS4, being important only in trainers' dataset, discriminates better examinees with higher ability level, while items CAD5 & CAT3, being  Important only to trainees' dataset, discriminate better examines with lower θ levels.

2) Dimension 2
As can be seen in Table IV and Fig. 2, in dimension 2 the items providing information are different among the two datasets, so while in the trainers' dataset item CL3 has the highest discrimination and therefore provides the maximum information, in the trainees' dataset the highest discrimination comes from item VC4.Though, dimension 2 items for both datasets measure ability for θs below zero.

3) Dimension 3
As can be seen in Table V and Fig. 3,  Item PK2 for the trainers' dataset doesn't contain any answer from the highest-level category and therefore no high threshold δ2i or information function can be computed.Also, for the trainee's dataset, this item indicates that this learning objective has been proven difficult to achieve for the examinees as δ2i>δ1i indicating that the answering probability of the second category level is never higher than the first and the second, thus the evaluators tend to choose more likely between the first and the third answer category. Item VO6 is the dominant for both datasets, providing most of the information and examining higher θs.Item VS4, for the trainees' dataset also functions well at high θs.Consequently, this dimension seems to examine higher θs

4) Dimension 4
Dimension is only present to the trainers' dataset and provides most accurate results in the mid-lower range of θ ability.As seen in Table VI and Fig. 4, item CAT3, has very high discrimination and provides high information between δ1=-1,123 and δ2=0,386 i.e. mid-low range of θs.

E. Learning outcomes evaluation
The majority of evaluations, as indicated by the median of each dataset, can be used to conclude on the level of achievement for each learning objective.
In this study, we had double learning outcomes evaluation, thus a temporal evolution of the learning scenario development may be observed (could be further investigated in a future study) in some cases, since the examinees had the opportunity to work further onto their deliverables and further evolve them.Thus, we can summarize the IRT analysis results on Table VII, which provides a picture of the Learning Outcomes examined by each item and its degree of achievement.
Furthermore, Fig. 5, may be constructed to visualize the level of achievement for the different learning outcomes of each particular cycle of the training course or different training courses, based on the information presented in Table VII.

V. DISCUSSION AND FURTHER INVESTIGATION
Factor analysis showed that initial learning outcomes classification in accordance to revised Bloom's taxonomy was not accurate.Factor Analysis identified less than 6 dimensions in the data, which implies that the evaluators perceived a different interrelation among the learning outcomes and some of the levels are merged.
Teachers were able to identify one more dimension although they evaluated at earlier stage and the examinees continued to develop the learning scenarios.
For some of the evaluating items, there was further evolution (e.g.LGF1, SE3) between first evaluation by trainers and the second one by the trainees themselves.Nine out of fifteen learning objectives seem to have been finally achieved (LGF1, VD1, CL3, CAT3, VC4, VA5, CAD5, SE5, CE4), however only two of them (CE5, SE5) seem to correspond to middle level ability θ.
Three out of fifteen learning objectives seem to have been partially achieved, with two of them having been examining middle layer abilities.There also exist 3 out of fifteen learning objectives which have not been achieved, all of them seeming difficult for the examinees.
The above results, indicate that the trainees only partially applied the instructed theory in for formulating their goals, for adding interactivity to their videos and for including group activities in their flipped classroom educational scenarios and that also they completely failed to use original videos or videos with high modification grade, they didn't have adequate interactivity to their videos and they also failed to specify correctly the prior knowledge into their scenarios.
The above observations coming only from the quantitative analysis of the learning scenario deliverables, agreed to a high degree to the observation of the trainers and constitute the basic issues to be addressed to a potential next cycle of this teacher training course by designing the appropriate training strategies.
Regarding the research questions of this study, it is evident that quantitative assessment of the deliverables of a training program/course can lead to conclusions regarding its effectiveness and guide onto which areas should be treated with special care in a next possible cycle of the educational program/course.
Further investigation could be done to allow increase of the depth of the quantitative analysis, by assessing learning outcomes achievement at both at intermediate and final phases.Also, the use of Structural Equation Modelling The video content is poor (poor graphics and sound quality) The video content is moderately appealing (acceptable graphics quality and good sound) The video content is highly appealing (high quality graphics and crystal clear sound) Video supervision (VS4) The teaches has no supervision on the video sent to his/her students The learning scenario mentions some supervision to the video by the teacher The learning scenario contains specific field for the teacher to note his/her remarks from the video supervision, which is completed before the course

Evaluate
Video adequacy (VA4) The selected video serves too few of the intended learning outcomes The selected video serves many but not all the intended learning outcomes The selected video serves many but all the intended learning outcomes Classroom Activities description (CAD5) The description of the classroom activities contain only their titles

The description of the classroom activities is short
There is a complete description of the classroom activities and there are also Activity Sheets provided, where needed.
Student Evaluation (SE5) 1 2 3 The learning scenario does not contain students' evaluation activities There are some evaluating activities designed The learning scenario contains full description of evaluating activities for all intended learning outcomes Course Assessment (CE5) The video was found on the WWW and used by the trainee without any processing The video was found on the WWW and was adapted by the trainee (e.g.narration, added comments, added questions) The video was created from scratch and processed by the trainee

Fig. 5 :
Fig. 5: Learning outcomes achievement profile Moreover, based on the point in ability level θ, where occurs the maximum of the information function I(θ) (as

1 2 3
The learning scenario does not foresee any course evaluationThe learning scenario foresees partial course evaluationThe learning scenario foresees full evaluation of the learning outcomes achievement, the video and the classroom

TABLE III
Fig. 1.Item Response Curves and Item Information Curves for Dimension 1

TABLE V :
IRT PARAMETERS FORDIMENSION 3 Fig. 3. Item Response Curves and Item Information Curves for Dimension

TABLE VI :
IRT PARAMETERS FORDIMENSION 4Factor F2 for Trainers -no such dimension for trainees dataset

TABLE VII :
LEARNING OUTCOMES EVALUATION

(
SEM) tools to assess the relations among learning outcomes, prior knowledge and teaching practices could lead to further deepening of the quantitative analysis, thus enable a deeper view of any training program.