C. Brandon Ogbunu at The Undark revisits a debate about so-called “research parasites” — scientists who use and reanalyze other people’s data. The large language models used in generative AI tools such as ChatGPT have made this debate about data sharing relevant again.
Ogbunu writes: A 2016 editorial published in the New England Journal of Medicine lamented the existence of “research parasites,” those who pick over the data of others rather than generating new data themselves. The article touched on the ethics and appropriateness of this practice. The most charitable interpretation of the argument centered around the hard work and effort that goes into the generation of new data, which costs millions of research dollars and takes countless person-hours. Whatever the merits of that argument, the editorial and its associated arguments were widely criticized.
Given recent advances in AI, revisiting the research parasite debate offers a new perspective on the ethics of sharing and data democracy. It is ironic that the critics of research parasites might have made a sound argument — but for the wrong setting, aimed at the wrong target, at the wrong time. Specifically, the large language models, or LLMs, that underlie generative AI tools such as OpenAI’s ChatGPT, have an ethical challenge in how they parasitize freely available data. These discussions bring up new conversations about data security that may undermine, or at least complicate, efforts at openness and data democratization.
The backlash to that 2016 editorial was swift and violent. Many arguments centered around the anti-science spirit of the message. For example, metanalysis – which re-analyzes data from a selection of studies – is a critical practice that should be encouraged. Many groundbreaking discoveries about the natural world and human health have come from this practice, including new pictures of the molecular causes of depression and schizophrenia. Further, the central criticisms of research parasitism undermine the ethical goals of data sharing and ambitions for open science, where scientists and citizen-scientists can benefit from access to data. This differs from the status quo in 2016, when data published in many of the top journals of the world were locked behind a paywall, illegible, poorly labeled, or difficult to use. This remains largely true in 2024.
The “research-parasites-are-bad” movement didn’t go very far. The importance of data democratization has been argued for many years and led to meaningful changes in the practice of science. Licensing options through Creative Commons have become standard for published research in many subfields, giving authors a way to state how they want their work to be used. This system includes options that lean toward data democracy like the CC BY licenses. Notably, several of these licenses allow content to be used for commercial use.
More here.