Peer-review for selection of oral presentations for conferences: Are we reliable?
Introduction
Peer-review is a widely used medium to select presentations for conferences, review journal papers and assess grant-applications. The aim of peer-review is twofold: to improve the quality of research and help editors and planning committees of conferences with the decision-making process of accepting versus rejecting submissions.
It is assumed that this practice raises the quality of the end-product, especially if combined with providing feedback and suggestions for change. Moreover, it is assumed that peer-review provides a mechanism for rational, fair and objective decision-making. Peer-review has been called ‘a counter- stone of science and of quality assurance’, and even ‘the gold standard for evaluating scientific merit’ [1].
It is therefore surprising that research on this topic remains scarce. Research that has investigated peer-review reveals several issues and criticisms concerning bias, poor quality review, unreliability and inefficiency [2], [3]. The number of peer-reviewers needed to overcome the problems mentioned above remains unclear [4]. Moreover, during the last decade the exponential growth of scientific publications could make the approach difficult due to a lack of capacity of potential reviewers [5].
The most important weakness of the peer review process described in research is the inconsistency between reviewers leading to inadequate inter-rater reliability (IRR): ‘there is a low and even insufficient level of IRR’ is the conclusion of most research on this topic. Scores between 0.07-0.20 have been demonstrated [6]. However, research concerning IRR has in itself been criticized as being imprecise, as much research on IRR has been performed qualitatively. Bornmann et all conducted a meta-analysis study on the reliability of journal peer-reviews. Their conclusion was simple: the IRR of peer assessment is limited and needs improvement [7].
Blind peer-review has been suggested as one solution. Although widely believed to be an effective way to minimize at least potential reviewer bias, blind peer-review does not solve a major component of the reliability problem, namely inconsistencies between two or more reviewers [5], [6].
Looking into the selection of oral presentations for conferences the literature is even more scarce. Blackburn and Hakel [7] describe specific biases for selecting presentations for conferences. Apart from the reliability issue mentioned above, there seem to be individual biases. Peer-reviewers who have a submission for the conference themselves seem to rate on average lower compared to reviewers without personal submissions. On the other hand, submissions with at least one reviewer as an author of a presentation for the conference, received significant higher ratings than submissions with no reviewer as co-author. Another bias mentioned is the experience and professional role of the reviewer. However, it remains unclear how to explain these findings: are peer-reviewers biased or do they obtain higher scores just because they are more experienced?
Personal style has been seen as influential as well, with reviewers being consistently either high and low raters, “hawks and doves”. Normally these are taken into account by converting the scores into Z-scores and performing a post review adjustment. Nevertheless, even this practice does not completely solve the problem of low IRR [7].
Several solutions have been mentioned. Training of reviewers might be a solution, although the results of research on this topic remain ambiguous [2]. Moreover for very large conferences with a considerable number of reviewers this might not be practical or even possible. Although training can be performed with large groups, the risk of not reaching all reviewers or losing some during the training is much bigger with large groups [8].
There are some experiments with self-rating of the submission. Open peer review has been mentioned. However, none of these potential and promising solutions seem to solve the reliability problem [5], [9].
Appreciating this problem and having to select papers for an international conference, the authors of this paper conducted a small reliability study. In 2016 during the International Conference on Communication in Healthcare, organized by EACH: International Association for Communication in Healthcare, at that time known as the European Association, a calibration exercise was proposed and feedback was reported back to the participants of the exercise.
The conference planning committee had considerable experience with calibration and training of peer-reviewers for the workshop component of the conference. However, a calibration exercise for oral and poster presentations had not been previously performed, due to the large number of submissions and peer-reviewers required.
Aim of the paper: to report the reliability of ratings for a large international conference and to suggest possible solutions to overcome the problem.
Section snippets
Methods
The first 8 submissions for the conference were used for this exercise. As about 700 submissions for oral presentations/posters were expected with triple marking for each submission, 125 peer-reviewers were approached to be part of the peer-review process. Seventy-five agreed to serve as peer reviewers and were eligible for this calibration exercise.
All the reviewers received an invitation to take part in the calibration exercise. They were asked to rate 4 submissions out of the first 8
Participants
Fourty-nine peer-reviewers (49/75; 65%) agreed to perform the calibration exercise. As a result every submission was rated by from 22 to 26 peer-reviewers.
Mean scores of the reveiwers
The mean score of all reviewers was: 3 with 79,5% (39/49) rated > = 2,5 and < = 3,5. 6,1% (3/49) rated higher than 3,5; 14,2% (7/49) rated lower than 2,5 (Table 2).
Mean scores of the submissions
Six out of 8 submissions were rated between 2,5 and 3,5. Abstract 3 was rated highest, while abstract 5 was rated lowest. Abstract 3 received high scores (3; 4 or 5) but as well 4
Discussion
The mean conclusion of the small calibration exercise, is consistent with the existing literature and research about this topic, that peer-review is unreliable.
Most abstracts, as well as most peer-reviewers, receive and give scores around the median. Contrary to the general assumption that there are high and low scorers, in this group only 3 peer-reviewers could be identified with a high mean, while 7 has a low mean score. Only 2 reviewers gave only high ratings (4 and 5). Of the eight
Practice implications and possible solutions
The commonest way of improving interrater reliability would be by having more raters per abstract. To increase the IRR to 0.80 18 reviewers might be needed. This might cause a considerable problem for conference organisers of conferences with a high number of submissions This topic needs further research but literature suggests five reviewers per abstract [4], [10].
Training could also work. Training would enable reviewers to more fully understand their role, understand the potential biases that
Conclusion
Peer-review of submissions for conferences are, in accordance with the literature, unreliable. New and creative methods will be needed to give the participants of a conference what they really deserve: a more reliable selection of the best abstracts.
References (11)
‘Peer review' for scientific manuscripts: emerging issues, potential threats, and possible remedies
Med. J. Arm. Forces India
(2016)- et al.
Panel discussion does not improve reliability of peer review for medical research grant proposals
J. Clin. Epidemiol.
(2012) - et al.
Editorial peer reviewers' recommendations at a general medical journal: are they reliable and do editors care?
PLoS One
(2010) - et al.
Effects of editorial peer review: a systematic review
JAMA
(2002) - et al.
Peer review of grant applications: criteria used and qualitative study of reviewer practices
PLoS One
(2012)
Cited by (8)
Enhancing the examination of obstacles in an automated peer review system
2023, International Journal on Digital LibrariesCriteria for selection and classification of studies in medical events
2023, Revista da Associacao Medica BrasileiraConference Rubric Development for STEM Librarians’ Publications
2022, Science and Technology LibrariesCommunicating expectations: Developing a rubric for peer reviewers
2021, Journal of Continuing Education in NursingOpen science and modified funding lotteries can impede the natural selection of bad science
2019, Royal Society Open ScienceBuilding criteria to evaluate qualitative research papers: A tool for peer reviewers
2019, Revista da Escola de Enfermagem