Here’s a big money quote from DYNOMIGHT in an assessment of cost to achieve AI scale for better intelligence (reduced loss):
…everyone reports that filtering the raw internet makes models better. They also report that including small but high-quality sources makes things better. But how much better? And why? As far as I can tell, there is no general “theory” for this. We might discover that counting tokens only takes you so far, and 10 years from now there is an enormous infrastructure for curating and cleaning data from hundreds of sources and people look back on our current fixation on the number of tokens with amusement.
10 years from now? There’s already a standard in infrastructure curating and cleaning data from hundreds of millions of sources: the W3C’s solidproject.org
Everyone says filtering data is better for intelligence? That’s nice to hear.
Nobody should believe in the absolute freedom of speech, when they factor high cost of predictable errors caused from data negligence.
How high? VERY high.
…the best current models have a total error of around 0.24 and cost around $2.5 million. To drop that to a total error of 0.12 would “only” cost around $230 million. …to scale a LLM to maximum performance would cost much more—with current technology, more than the GDP of the entire planet.
That 100X cost to reduce error seems prohibitive, although I’m sure someone is thinking $230 million is NBD like just one oligarch yacht.
DYNOMIGHT seems to be warning us current LLM architectures are about to run into a serious scale limit; they train already on all practically-available data with compute costs far too high.
Add integrity, gain intelligence at far lower cost to completely change the game. Serious food for thought considering Microsoft’s oligarchical ChatGPT has just been beaten by a pedestrian Web browser.
A month ago I asked Could you train a ChatGPT-beating model for $85,000 and run it in a browser?. $85,000 was a hypothetical training cost for LLaMA 7B plus Stanford Alpaca. “Run it in a browser” was based on the fact that Web Stable Diffusion runs a 1.9GB Stable Diffusion model in a browser, so maybe it’s not such a big leap to run a small Large Language Model there as well.
That second part has now happened.