These Mini AI Models Match OpenAI With 1,000 Times Less Data

by Jason Dorrier at Singularity Hub: The artificial intelligence industry is obsessed with size. Bigger algorithms. More data. Sprawling data centers that could, in a few years, consume enough electricity to power whole cities.

This insatiable appetite is why OpenAI—which is on track to make $3.7 billion in revenue but lose $5 billion this year—just announced it’s raised $6.6 billion more in funding and opened a line of credit for another $4 billion.

Eye-popping numbers like these make it easy to forget size isn’t everything.

Some researchers, particularly those with fewer resources, are aiming to do more with less. AI scaling will continue, but those algorithms will also get far more efficient as they grow.

Last week, researchers at the Allen Institute for Artificial Intelligence (Ai2) released a new family of open-source multimodal models competitive with state-of-the-art models like OpenAI’s GPT-4o—but an order of magnitude smaller. Called Molmo, the models range from 1 billion to 72 billion parameters. GPT-4o, by comparison, is estimated to top a trillion parameters.

It’s All in the Data

Ai2 said it accomplished this feat by focusing on data quality over quantity.

Algorithms fed billions of examples, like GPT-4o, are impressively capable. But they also ingest a ton of low-quality information. All this noise consumes precious computing power.

To build their new multimodal models, Ai2 assembled a backbone of existing large language models and vision encoders. They then compiled a more focused, higher quality dataset of around 700,000 images and 1.3 million captions to train new models with visual capabilities. That may sound like a lot, but it’s on the order of 1,000 times less data than what’s used in proprietary multimodal models.

Instead of writing captions, the team asked annotators to record 60- to 90-second verbal descriptions answering a list of questions about each image. They then transcribed the descriptions—which often stretched across several pages—and used other large language models to clean up, crunch down, and standardize them. They found that this simple switch, from written to verbal annotation, yielded far more detail with little extra effort.

Tiny Models, Top Dogs

The results are impressive.

According to a technical paper describing the work, the team’s largest model, Molmo 72B, roughly matches or outperforms state-of-the-art closed models—including OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 Pro—across a range of 11 academic benchmarks as well as by user preference. Even the smaller Molmo models, which are a tenth the size of its biggest, compare favorably to state-of-the-art models.

Molmo can also point to the things it identifies in images. This kind of skill might help developers build AI agents that identify buttons or fields on a webpage to handle tasks like making a reservation at a restaurant. Or it could help robots better identify and interact with objects in the real world.

Ai2 CEO Ali Farhadi acknowledged it’s debatable how much benchmarks can tell us. But we can use them to make a rough model-to-model comparison.

“There are a dozen different benchmarks that people evaluate on. I don’t like this game, scientifically… but I had to show people a number,” Farhadi said at a Seattle release event. “Our biggest model is a small model, 72B, it’s outperforming GPTs and Claudes and Geminis on those benchmarks. Again, take it with a grain of salt; does this mean that this is really better than them or not? I don’t know. But at least to us, it means that this is playing the same game.”

More here.