The Challenge of Preserving Good Data in the Age of AI

File folder big data internet network security

By Peter Hall at Undark: Growing up, people of my generation were told to be careful of what we posted online, because “the internet is forever.” But in reality, people lose family photos, shared to social media accounts they’ve long-since been locked out of. Streaming services pull access to beloved shows, content that was never even possible to own. Journalists, animators, and developers lose years of work when web companies and technology platforms die.

At the same time, artificial intelligence-driven tools such as ChatGPT and the image creator Midjourney have grown in popularity, and some believe they will one day replace work that humans have traditionally done, like writing copy or filming video B-roll. Regardless of their actual ability to perform these tasks, though, one thing is certain: The internet is about to become deluged with a mass of low-effort, AI-generated content, potentially drowning out human work. This oncoming wave poses a problem to computer scientists like me who think about data privacy, fidelity, and dissemination daily. But everyone should be paying attention. Without clear preservation plans in place, we’ll lose a lot of good data and information.

Ultimately, data preservation is a question of resources: Who will be responsible for storing and maintaining information, and who will pay for these tasks to be done? Further, who decides what is worth keeping? Companies developing so-called foundation AI models are some of the key players wanting to catalog online data, but their interests are not necessarily aligned with those of the average person.

The costs of electricity and server space needed to keep data indefinitely add up over time. Data infrastructure must be maintained, in the same way bridges and roads are. Especially for small-scale content publishers, these costs can be onerous. Even if we could just download and back up the entirety of the internet periodically, though, that’s not enough. Just as a library is useless without some sort of organizational structure, any form of data preservation must be archived mindfully. Compatibility is also an issue. If someday we move on from saving our documents as PDFs, for example, we will need to keep older computers (with compatible software) around.

More here.