Author
Listed:
- David N Nicholson
- Vincent Rubinetti
- Dongbo Hu
- Marvin Thielk
- Lawrence E Hunter
- Casey S Greene
Abstract
Preprints allow researchers to make their findings available to the scientific community before they have undergone peer review. Studies on preprints within bioRxiv have been largely focused on article metadata and how often these preprints are downloaded, cited, published, and discussed online. A missing element that has yet to be examined is the language contained within the bioRxiv preprint repository. We sought to compare and contrast linguistic features within bioRxiv preprints to published biomedical text as a whole as this is an excellent opportunity to examine how peer review changes these documents. The most prevalent features that changed appear to be associated with typesetting and mentions of supporting information sections or additional files. In addition to text comparison, we created document embeddings derived from a preprint-trained word2vec model. We found that these embeddings are able to parse out different scientific approaches and concepts, link unannotated preprint–peer-reviewed article pairs, and identify journals that publish linguistically similar papers to a given preprint. We also used these embeddings to examine factors associated with the time elapsed between the posting of a first preprint and the appearance of a peer-reviewed publication. We found that preprints with more versions posted and more textual changes took longer to publish. Lastly, we constructed a web application (https://greenelab.github.io/preprint-similarity-search/) that allows users to identify which journals and articles that are most linguistically similar to a bioRxiv or medRxiv preprint as well as observe where the preprint would be positioned within a published article landscape.Preprints allow researchers to make their findings available to the scientific community before they have undergone peer review This study analyzes the full text content of the bioRxiv preprint repository, identifying field-specific patterns and changes that occur during publication, and providing a search tool that can identify the published papers that are most similar to a given bioRxiv or medRxiv preprint.
Suggested Citation
David N Nicholson & Vincent Rubinetti & Dongbo Hu & Marvin Thielk & Lawrence E Hunter & Casey S Greene, 2022.
"Examining linguistic shifts between preprints and publications,"
PLOS Biology, Public Library of Science, vol. 20(2), pages 1-22, February.
Handle:
RePEc:plo:pbio00:3001470
DOI: 10.1371/journal.pbio.3001470
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pbio00:3001470. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosbiology (email available below). General contact details of provider: https://journals.plos.org/plosbiology/ .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.