Proposed Solution

Proposed Solution Referee is designed to enhance the accuracy and reliability of academic research by introducing a universal reliability score that provides a single number on how reliable is a paper is in terms of correctness and soundness. There are many elements that impact that score beyond the methodology and research questions, including the following: pre-registrations, availability and quality of research data, reliability of papers cited, contentiousness/divisiveness, and readability (grade level and maybe style - knowledge sharing should only be about numbers and graphs), among others. By default, new papers receive the median score that is adjusted by initial conditions (e.g. available data/code, pre-registration). The initial score is then adjusted by every bounty claim depending on whether the claim was successful and weight of the weakness in the reliability score algorithm. This is the 'voting' mechanism for paper reliability and a log of every claim will be readable. For revised papers, authors must first submit evidence of changes that address the weaknesses identified by claimed bounties. The new paper's score is then adjusted and bounties are reissued for them.

Two necessary components enable the creation of the score. The first is a common research weakness enumeration (CRWE) that lists all the ways research may be unreliable in a granular way. The second component is implementing a bug bounty system on top of the CRWE to incentivize researchers to identify flaws in papers. Together, these mechanisms ensure that the reliability score is both robust and dynamic, continually refining the quality of academic outputs.

Several recent advancements make the timing of Referee particularly apt, including the following:

Open pre-print repositories. Several open-source archives exist for researchers to upload their pre-prints and published articles. These include PsyArXiv for psychology, bioRxiv for biology and related fields, arXiv for physics, mathematics, computer science, and related fields, SocArXiv for sociology and social sciences, and medRxiv for health sciences. arXiv alone contains over 2.2 million papers alone. These repositories essentially provide free raw material to validate the Referee system.
Generative AI agents. These models have progressed rapidly over the past few years and are nearing human-like reasoning skills in many domains. Even before the advent of ChatGPT, academic journals such as Elsevier have used advanced machine learning and artificial intelligence models to improve productivity and outcomes. Referee intends to initially create one or more robot scanners that use finetuned versions of existing models (e.g. GPT4) or customised ones to conduct a preliminary review of papers. These may operate as crawlers to continually review the preprint archives, both to capture new papers and to test previously reviewed papers with better models. Community members are encouraged to develop their own scanning bots and would be rewarded for producing effective ones. These scanning bots might be specialised to look for specific flaws (e.g. the strength of statistical tests, whether the trials were truly randomised, etc.) or a general one that provides an overall score. Specific bots can check the similarity of papers to detect those by paper mills and plagiarised papers more generally. Some researchers have already developed such bots but their findings remain dispersed and hard to aggregate. The existence of several reliability scores would be a feature and not a bug, just various readability score methodologies provide multiple insights into the ease of reading a text. Finally, bots can provide quality translations into most languages, reducing the barrier to participation in the process.
Scholarly research metadata. In the past two decades, significant advancements in tracking systems like Digital Object Identifiers (DOIs) for papers, Open Researcher and Contributor IDs (ORCID) for researchers, Research Organization Registry (ROR) for institutions, and DataCite for research data have greatly enhanced our ability to precisely target specific papers, researchers, or organizations with bounties. This allows papers to be connected into graphs that allow reliability scores to be ported into papers that cite previously scored research.
Privacy Tools. Privacy can apply to people, organisations and/or data. Reviewer privacy would likely increases the willingness of academics to participate as it reduces the risk of retributions and reputational harm. In reviewing papers, the focus should always be on comment content and not on individuals, reducing bias and discrimination. Anonymising paper authors (and even paper citations) can also be considered. This would reduce status biases impacting reviews. On the data side, homomorphic encryption, which allows computations to be performed on encrypted data without first having to decrypt it, would permit sensitive data to be shared for testing purposes. Although homomorphic encryption is computationally expensive, possibly limiting the complexity of the statistical tests that can be performed, it is an active area of ongoing research.

PreviousNeeded Improvements NextConceptual Analogies

Last updated 10 months ago

Proposed Solution

Several recent advancements make the timing of Referee particularly apt, including the following:

Open pre-print repositories. Several open-source archives exist for researchers to upload their pre-prints and published articles. These include PsyArXiv for psychology, bioRxiv for biology and related fields, arXiv for physics, mathematics, computer science, and related fields, SocArXiv for sociology and social sciences, and medRxiv for health sciences. arXiv alone contains over 2.2 million papers alone. These repositories essentially provide free raw material to validate the Referee system.
Generative AI agents. These models have progressed rapidly over the past few years and are nearing human-like reasoning skills in many domains. Even before the advent of ChatGPT, academic journals such as Elsevier have used advanced machine learning and artificial intelligence models to improve productivity and outcomes. Referee intends to initially create one or more robot scanners that use finetuned versions of existing models (e.g. GPT4) or customised ones to conduct a preliminary review of papers. These may operate as crawlers to continually review the preprint archives, both to capture new papers and to test previously reviewed papers with better models. Community members are encouraged to develop their own scanning bots and would be rewarded for producing effective ones. These scanning bots might be specialised to look for specific flaws (e.g. the strength of statistical tests, whether the trials were truly randomised, etc.) or a general one that provides an overall score. Specific bots can check the similarity of papers to detect those by paper mills and plagiarised papers more generally. Some researchers have already developed such bots but their findings remain dispersed and hard to aggregate. The existence of several reliability scores would be a feature and not a bug, just various readability score methodologies provide multiple insights into the ease of reading a text. Finally, bots can provide quality translations into most languages, reducing the barrier to participation in the process.
Scholarly research metadata. In the past two decades, significant advancements in tracking systems like Digital Object Identifiers (DOIs) for papers, Open Researcher and Contributor IDs (ORCID) for researchers, Research Organization Registry (ROR) for institutions, and DataCite for research data have greatly enhanced our ability to precisely target specific papers, researchers, or organizations with bounties. This allows papers to be connected into graphs that allow reliability scores to be ported into papers that cite previously scored research.
Privacy Tools. Privacy can apply to people, organisations and/or data. Reviewer privacy would likely increases the willingness of academics to participate as it reduces the risk of retributions and reputational harm. In reviewing papers, the focus should always be on comment content and not on individuals, reducing bias and discrimination. Anonymising paper authors (and even paper citations) can also be considered. This would reduce status biases impacting reviews. On the data side, homomorphic encryption, which allows computations to be performed on encrypted data without first having to decrypt it, would permit sensitive data to be shared for testing purposes. Although homomorphic encryption is computationally expensive, possibly limiting the complexity of the statistical tests that can be performed, it is an active area of ongoing research.

PreviousNeeded Improvements NextConceptual Analogies

Last updated 10 months ago