Meta’s surprise Llama 4 drop exposes the gap between AI ambition and reality

May Be Interested In:The $10 million club: College basketball’s portal recruiting hits unthinkable levels of financial chaos



Meta constructed the Llama 4 models using a mixture-of-experts (MoE) architecture, which is one way around the limitations of running huge AI models. Think of MoE like having a large team of specialized workers; instead of everyone working on every task, only the relevant specialists activate for a specific job.

For example, Llama 4 Maverick features a 400 billion parameter size, but only 17 billion of those parameters are active at once across one of 128 experts. Likewise, Scout features 109 billion total parameters, but only 17 billion are active at once across one of 16 experts. This design can reduce the computation needed to run the model, since smaller portions of neural network weights are active simultaneously.

Llama’s reality check arrives quickly

Current AI models have a relatively limited short-term memory. In AI, a context window acts somewhat in that fashion, determining how much information it can process simultaneously. AI language models like Llama typically process that memory as chunks of data called tokens, which can be whole words or fragments of longer words. Large context windows allow AI models to process longer documents, larger code bases, and longer conversations.

Despite Meta’s promotion of Llama 4 Scout’s 10 million token context window, developers have so far discovered that using even a fraction of that amount has proven challenging due to memory limitations. Willison reported on his blog that third-party services providing access, like Groq and Fireworks, limited Scout’s context to just 128,000 tokens. Another provider, Together AI, offered 328,000 tokens.

Evidence suggests accessing larger contexts requires immense resources. Willison pointed to Meta’s own example notebook (“build_with_llama_4”), which states that running a 1.4 million token context needs eight high-end Nvidia H100 GPUs.

Willison documented his own testing troubles. When he asked Llama 4 Scout via the OpenRouter service to summarize a long online discussion (around 20,000 tokens), the result wasn’t useful. He described the output as “complete junk output,” which devolved into repetitive loops.

share Share facebook pinterest whatsapp x print

Similar Content

Remains of Indigenous woman murdered by serial killer found at landfill in Canada, authorities confirm
Remains of Indigenous woman murdered by serial killer found at landfill in Canada, authorities confirm
Sunita Williams Begins Return Journey To Earth After 9 Months: NASA SpaceX Crew-9 Departs ISS
Sunita Williams Begins Return Journey To Earth After 9 Months: NASA SpaceX Crew-9 Departs ISS
Magazine Dreams
Magazine Dreams Review – IGN
Trentino, Italia
‘Racism in Paris?’ Arab Influencer Denied Entry to Luxury Hotel Despite Reservation
Film show: The cut-throat world of French football
Film show: The cut-throat world of French football
Rubio Orders U.S. Diplomats to Scour Student Visa Applicants’ Social Media
Rubio Orders U.S. Diplomats to Scour Student Visa Applicants’ Social Media
The Power of Now: Breaking News at Your Fingertips | © 2025 | Daily News