Research


Journal Articles

Treischl, E., & Wolbring, T. (2017). The Causal Effect of Survey Mode on Students’ Evaluations of Teaching: Empirical Evidence from Three Field Experiments. Research in Higher Education, 58(8), 904–921.

In recent years many universities switched from paper- to online-based student evaluation of teaching (SET) without knowing the consequences for data quality.  This paper examines the effects of survey mode on SET based on a series of three consecutive field experiments—a split-half design, twin courses, and pre–post-measurements. Overall, all three studies reveal marked differences in non-response between online- and paper-based SET and systematic, but small differences in the overall course ratings. The boxplot below depicts the (main) results for the split-half design for five example courses (for overall results see the paper).

Second, in the paper we show that online SET reveal a slightly less optimistic picture of teaching quality in students’ perception and highlight the importance of taking selection and class absenteeism into account when studying survey mode effects and also show that it is necessary and informative to survey the subgroup of no-shows when evaluating teaching. Third, we empirically demonstrate the need to account for contextual setting of the survey (in class vs. after class) and the specific type of the online survey mode (TAN vs. email). Previous research either confounded contextual setting with variation in survey mode or generalized results for a specific online mode to web surveys in general. Our findings suggest that higher response rates in email surveys can be achieved if students are given the opportunity and time to evaluate directly in class.


Wolbring, T., & Treischl, E. (2016). Selection Bias in Students’ Evaluation of Teaching:Causes of Student Absenteeism and Its Consequences for Course Ratings and Rankings. Research in Higher Education, 57, 51–71.

Systematic sampling error due to self-selection is a common topic in methodological research and a key challenge for every empirical study. Since selection bias is often not sufficiently considered as a potential flaw in research on and evaluations in higher education, the aim of this paper is to raise awareness for the topic using the case of students’ evaluations of teaching (SET). First, we describe students’ selection decisions at different points of their studies and elaborate potential biases which they might cause for SET. Then we empirically illustrate the problem and report findings from a design with two measurement points in time showing that approximately one third of the students do not attend class at the second time of measurement, when the regular SET takes place. Furthermore, the results indicate that the probability of absenteeism is influenced by course quality, students’ motivation, course topic, climate among course participants, course- and workload, and timing of the course.

Absenteeism is significantly associated with course quality and single dimensions of perceived course quality. The line graph below illustrates this point. The higher the student’s agreement on a single quality indicator (e.g. instructor’s enthusiasm), the higher the chance to attend at the second time of measurement.


Although data are missing not at random, average ratings do not strongly change after adjusting for selection bias. However, we find substantial changes in rankings based on SET. We conclude from this that, at least as regards selection bias, SET are a reliable instrument to assess quality of teaching at the individual level but are not suited for the comparison of courses.


Edited Volumes

Treischl, E., & Wolbring, T. (2017). Studentische Lehrveranstaltungsevaluation: Grundlagen, Befunde, methodische Fallstricke. In Handbuch Qualität in Studium und Lehre.

Die studentische Lehrveranstaltungsevaluation (LVE) ist das am weitesten verbreitete Instrument zur Erfassung der Lehrqualität aus studentischer Perspektive. Vor dem Hintergrund des flächendeckenden Einsatzes des Instruments bei gleichzeitig anhaltend starker und weit verbreiteter Kritik gegenüber der studentischen LVE soll der vorliegende Beitrag daher einen methodisch informierten Überblick zu Möglichkeiten und Grenzen des Instruments liefern. Hierfür werden zentrale Anforderungen an die Messung diskutiert und mit empirischen Befunden zur Zuverlässigkeit und Validität verknüpft. Grenzen der Messbarkeit werden dann anhand zwei verschiedener Arten von Validitätseinschränkungen aufgezeigt. Erstens wird die Diskussion zu möglicherweise verzerrenden Einflüssen aufgegriffen und anhand der Beispiele „Notengebung“ und „studentisches Interesse“ illustriert. Zweitens wird auf die bislang weitgehend vernachlässigte Frage nach potenziellen Selektionseffekten, d.h. den Konsequenzen einer unvollständigen Abdeckung der Hörerschaft bei der Befragung, aufmerksam gemacht und mit unterschiedlichen Erhebungsverfahren in Verbindung gebracht. Abschließend werden praktische Aspekte der Implementation von LVE aufgegriffen.


Work in Progress

Treischl, E. (2019): It’s a Bias, isn’t it? The Causal Influence of Course Interest, Workload and Difficulty on Students’ Evaluation of Teaching.

For decades empirical research about higher education addresses the question whether student evaluation of teaching (SET) is distorted due to bias variables. The current state of research is inconclusive for many bias variables, and in some instances, it is not clear whether a variable should be perceived as bias or is caused by other teaching and learning aspects. Against this background I stress that previous research relies almost exclusively on observational data, which is not well suited to estimate the causal impact of a bias variable due to several methodological pitfalls (e.g. student self-selection). In this paper I examine the causal effect of three bias variables – course interest, workload and difficulty. I conducted a factorial survey experiment (FSE) to demonstrate the methodological advantages in order to estimate the causal effect of bias variables, especially compared to observational SET data. Main findings of the FSE indicate that course interest and workload have a significant, but moderate causal impact on students’ perception of teaching quality, while difficulty has a significant but low impact. Furthermore, the FSE contains the perceived learning outcome, which gives me the opportunity to examine the mechanism behind the considered bias variables and estimate the effect independently of the perceived learning outcome. The analysis shows that perceived learning shapes students’ perception about the bias variables: Students are willing to accept a higher workload but put a higher emphasize on course interest, in case they learned a great deal. Overall, these findings indicate that course interest and workload are bias variables that distort SET results, however, an effective instructor may reduce or induce the effect.


Treischl, E., Wolbring, T. (2019): Past, Present and Future of Factorial Survey Experiments in Social Sciences. A Literature Review for the Social Sciences.


Lisa Wallander published a highly cited review article about factorial survey experiments (FSE) in 2009 and in the last decade FSE are increasingly used in the social sciences. Considering this great success of FSE and increasing number of empirical applications the first review article, we think it is time for an updated review. We conducted a literature review and created a dataset which contains all articles from Wallander’s review article, and we extend the data with published FSE articles until the year 2018. This gives us the opportunity to focus on the development since the first review paper, but in particular we discuss which standards of how to design and analyze an FSE have been established and which state-of-the-art techniques (e.g. statistical analysis, vignette sampling) have entered applied research in the last decades. Our literature review shows that most applications make use of state-of-the-art techniques, especially in terms of statistical analysis, even though there is still room to improvement with regard to vignette sampling procedures. Moreover, this review also aims to provide guidance for future methodological research by highlighting unresolved and open questions in the continuing, but small methodological literature about the FSE technique. Based on the review we discuss how previous research deals with concerns about the realism and complexity of vignettes, highlight that previous research is often faced with sensitive topics, and that many application focus on hypothetical intention in the FSE. All three aspect illustrate that little empirical knowledge exist which is why we call for more research to deepen our methodological knowledge about the chances and challenges of FSE.

Hönig, A.-L., Treischl, E. (2019): How to Find a Needle in a Haystack? Identifying Key Dimensions to Increase Teaching Quality.

Increasing teaching quality is an important aim for instructors, institutions, and for academia as a whole. Teaching quality is a multidimensional construct with several dimensions, which is why the key challenge in (higher) education is to identify individual dimensions of teaching quality as suitable leverage points to improve quality aspects. Student evaluation of teaching (SET) is the standard tool to measure teaching quality, but SET based on surveys is not well equipped to identify the impact of single dimensions on student’s overall course assessment. Moreover, survey research does not take students’ assessment process of different dimensions into account, which is why we use factorial survey experiments to measure the relative importance of individual dimensions on student’s course assessment. We randomly assign students to different course scenarios and ask them to simultaneously weight individual dimensions of teaching quality. This allows us to calculate the causal effect of each dimension on student’s course assessment. Our results support instructors and institutions to evaluate the relevance of individual dimensions of teaching quality in their context, and overall supports instructors to improve their teaching skills.


Treischl, E., Wolbring, T. (2019): Wirkungsevaluation. Grundlagen, Standards, Beispiele. Book Series: Standards standardisierter und nicht-standardisierter Sozialforschung (edited by Nicole Burzan, Paul Eisewicht, Ronald Hitzler). Basel/Weinheim: Beltz Juventa. Expected for 10/2019. 150 pages.

Evaluationen stellen auf die kausalen Wirkungen einer Maßnahme ab. Das Lehrbuch führt in die entsprechenden theoretischen und methodischen Grundlagen der Wirkungsevaluation ein und illustriert diese anhand ausgewählter Evaluationen. Beispielsweise greifen wir die Frage auf, ob eine Lebensmittelkennzeichnung das Ernährungsverhalten beeinflusst, da ein großer Anteil der Bevölkerung in Deutschland und vielen anderen Ländern dieser Erde übergewichtig oder fettleibig ist. Die weltweite Verteilung übergewichtiger Erwachsener in Prozent lässt sich auf der Grundlage von Daten des Global Health Observatory der Weltgesundheitsorganisation (WHO) für das Jahr 2017 anhand des BMI zeigen. Vor diesem Hintergrund stellt sich die Frage, welche Maßnahmen zur Reduktion von Übergewicht und Adipositas ergriffen werden könnten und ob die Lebensmittelkennzeichnung zu einer gesünderen Lebensführung beiträgt.

Ziel des Buches ist es dabei, einen praktischen Bezug zum Ablauf einer Evaluation und den einzelnen Entscheidungsschritten herzustellen, aber auch mögliche Fallstricke unterschiedlicher Evaluationsverfahren aufzuzeigen. Eine Lektüre ist daher sowohl für Evaluationsforscher/innen als auch Nutzer/innen, Auftraggebende und Betroffene einer Evaluation ertragreich.


Treischl, E. (2019): Course Selection and Selective Courses? Students’ Course Selection and its Consequences for Students’ Evaluation of Teaching.

Students select courses each term to fulfil their study requirements. Previous research explains students’ course selection (CS) as a simultaneous consideration process of expected benefits and costs, but so far previous research does not observe this consideration process and does not take into account that students are often faced with a trade-off situation between different course and instructor characteristics. Against this background this paper highlights that factorial survey experiments make it possible to observe students’ consideration process specifically. This paper shows that course interest and course quality have a strong causal impact, while several cost aspects (e.g. course timing) have a substantial but lower impact on students’ CS. These findings do not imply that students solely select courses based on anticipated benefits. Considering students’ CS in trade-off situation emphasizes that students weigh several costs and benefits simultaneously. Findings indicate that an excellent teacher can partly compensate a lower interest; while cost aspects reduce students’ CS regardless of anticipated benefits. Moreover, students’ CS may have consequences for students’ evaluation of teaching (SET) and distort the measurement of teaching quality if students’ CS is related to measurement of teaching quality (e.g. via course interest as bias variable). So far, research about SET neglects possible consequences of students’ CS and therefore knowledge about the causal impact on students CS is also a desideratum for SET research. Reported findings underline that students’ CS lay the foundation for selective course samples especially if a course combines several cost or benefit aspects.