Peer-review for selection of oral presentations for conferences: Are we reliable?

https://doi.org/10.1016/j.pec.2017.06.007Get rights and content

Abstract

Introduction

Although peer-review for journal submission, grant-applications and conference submissions has been called ‘a counter- stone of science’, and even ‘the gold standard for evaluating scientific merit’, publications on this topic remain scares.

Research that has investigated peer-review reveals several issues and criticisms concerning bias, poor quality review, unreliability and inefficiency. The most important weakness of the peer review process is the inconsistency between reviewers leading to inadequate inter-rater reliability.

Aim of the paper

To report the reliability of ratings for a large international conference and to suggest possible solutions to overcome the problem.

Methods

In 2016 during the International Conference on Communication in Healthcare, organized by EACH: International Association for Communication in Healthcare, a calibration exercise was proposed and feedback was reported back to the participants of the exercise.

Results

Most abstracts, as well as most peer-reviewers, receive and give scores around the median. Contrary to the general assumption that there are high and low scorers, in this group only 3 peer-reviewers could be identified with a high mean, while 7 has a low mean score. Only 2 reviewers gave only high ratings (4 and 5). Of the eight abstracts included in this exercise, only one abstract received a high mean score and one a low mean score. Nevertheless, both these abstracts received both low and high scores; all other abstracts received all possible scores.

Discussion

Peer-review of submissions for conferences are, in accordance with the literature, unreliable. New and creative methods will be needed to give the participants of a conference what they really deserve: a more reliable selection of the best abstracts.

Practice implications

More raters per abstract improves the inter-rater reliability; training of reviewers could be helpful; providing feedback to reviewers can lead to less inter-rater disagreement; fostering negative peer-review (rejecting the inappropriate submissions) rather than a positive (accepting the best) could be fruitful for selecting abstracts for conferences.

Introduction

Peer-review is a widely used medium to select presentations for conferences, review journal papers and assess grant-applications. The aim of peer-review is twofold: to improve the quality of research and help editors and planning committees of conferences with the decision-making process of accepting versus rejecting submissions.

It is assumed that this practice raises the quality of the end-product, especially if combined with providing feedback and suggestions for change. Moreover, it is assumed that peer-review provides a mechanism for rational, fair and objective decision-making. Peer-review has been called ‘a counter- stone of science and of quality assurance’, and even ‘the gold standard for evaluating scientific merit’ [1].

It is therefore surprising that research on this topic remains scarce. Research that has investigated peer-review reveals several issues and criticisms concerning bias, poor quality review, unreliability and inefficiency [2], [3]. The number of peer-reviewers needed to overcome the problems mentioned above remains unclear [4]. Moreover, during the last decade the exponential growth of scientific publications could make the approach difficult due to a lack of capacity of potential reviewers [5].

The most important weakness of the peer review process described in research is the inconsistency between reviewers leading to inadequate inter-rater reliability (IRR): ‘there is a low and even insufficient level of IRR’ is the conclusion of most research on this topic. Scores between 0.07-0.20 have been demonstrated [6]. However, research concerning IRR has in itself been criticized as being imprecise, as much research on IRR has been performed qualitatively. Bornmann et all conducted a meta-analysis study on the reliability of journal peer-reviews. Their conclusion was simple: the IRR of peer assessment is limited and needs improvement [7].

Blind peer-review has been suggested as one solution. Although widely believed to be an effective way to minimize at least potential reviewer bias, blind peer-review does not solve a major component of the reliability problem, namely inconsistencies between two or more reviewers [5], [6].

Looking into the selection of oral presentations for conferences the literature is even more scarce. Blackburn and Hakel [7] describe specific biases for selecting presentations for conferences. Apart from the reliability issue mentioned above, there seem to be individual biases. Peer-reviewers who have a submission for the conference themselves seem to rate on average lower compared to reviewers without personal submissions. On the other hand, submissions with at least one reviewer as an author of a presentation for the conference, received significant higher ratings than submissions with no reviewer as co-author. Another bias mentioned is the experience and professional role of the reviewer. However, it remains unclear how to explain these findings: are peer-reviewers biased or do they obtain higher scores just because they are more experienced?

Personal style has been seen as influential as well, with reviewers being consistently either high and low raters, “hawks and doves”. Normally these are taken into account by converting the scores into Z-scores and performing a post review adjustment. Nevertheless, even this practice does not completely solve the problem of low IRR [7].

Several solutions have been mentioned. Training of reviewers might be a solution, although the results of research on this topic remain ambiguous [2]. Moreover for very large conferences with a considerable number of reviewers this might not be practical or even possible. Although training can be performed with large groups, the risk of not reaching all reviewers or losing some during the training is much bigger with large groups [8].

There are some experiments with self-rating of the submission. Open peer review has been mentioned. However, none of these potential and promising solutions seem to solve the reliability problem [5], [9].

Appreciating this problem and having to select papers for an international conference, the authors of this paper conducted a small reliability study. In 2016 during the International Conference on Communication in Healthcare, organized by EACH: International Association for Communication in Healthcare, at that time known as the European Association, a calibration exercise was proposed and feedback was reported back to the participants of the exercise.

The conference planning committee had considerable experience with calibration and training of peer-reviewers for the workshop component of the conference. However, a calibration exercise for oral and poster presentations had not been previously performed, due to the large number of submissions and peer-reviewers required.

Aim of the paper: to report the reliability of ratings for a large international conference and to suggest possible solutions to overcome the problem.

Section snippets

Methods

The first 8 submissions for the conference were used for this exercise. As about 700 submissions for oral presentations/posters were expected with triple marking for each submission, 125 peer-reviewers were approached to be part of the peer-review process. Seventy-five agreed to serve as peer reviewers and were eligible for this calibration exercise.

All the reviewers received an invitation to take part in the calibration exercise. They were asked to rate 4 submissions out of the first 8

Participants

Fourty-nine peer-reviewers (49/75; 65%) agreed to perform the calibration exercise. As a result every submission was rated by from 22 to 26 peer-reviewers.

Mean scores of the reveiwers

The mean score of all reviewers was: 3 with 79,5% (39/49) rated > = 2,5 and < = 3,5. 6,1% (3/49) rated higher than 3,5; 14,2% (7/49) rated lower than 2,5 (Table 2).

Mean scores of the submissions

Six out of 8 submissions were rated between 2,5 and 3,5. Abstract 3 was rated highest, while abstract 5 was rated lowest. Abstract 3 received high scores (3; 4 or 5) but as well 4

Discussion

The mean conclusion of the small calibration exercise, is consistent with the existing literature and research about this topic, that peer-review is unreliable.

Most abstracts, as well as most peer-reviewers, receive and give scores around the median. Contrary to the general assumption that there are high and low scorers, in this group only 3 peer-reviewers could be identified with a high mean, while 7 has a low mean score. Only 2 reviewers gave only high ratings (4 and 5). Of the eight

Practice implications and possible solutions

The commonest way of improving interrater reliability would be by having more raters per abstract. To increase the IRR to 0.80 18 reviewers might be needed. This might cause a considerable problem for conference organisers of conferences with a high number of submissions This topic needs further research but literature suggests five reviewers per abstract [4], [10].

Training could also work. Training would enable reviewers to more fully understand their role, understand the potential biases that

Conclusion

Peer-review of submissions for conferences are, in accordance with the literature, unreliable. New and creative methods will be needed to give the participants of a conference what they really deserve: a more reliable selection of the best abstracts.

References (11)

There are more references available in the full text version of this article.
View full text