3 Comments
User's avatar
Daniel Popescu / ⧉ Pluralisk's avatar

Thanks for writing this, it realy clarifies a lot. This architectural shift for distributed AI compute makes on-device inference much more practical.

Expand full comment
Daniel's avatar

running a 7b parameter model can happen on an M1 with 32 GB too.

can this is new architecture and/ or the strategies you suggest (Quantization and memory mapped model) allow you to run 30b an m5 with 32GB? or the only way is add more RAM?

i'm asking as i think there's an elephant in the room and it is that when you're working with an agent and a remote model - you are more fragile (no internet, no agent) and less secure (one must assume that some pretty sensitive data will get out there).

this is why i think running a local model would be kind of a holy grail - not just for costs - but also for adoption in more sensitive parts of the business).

running a 30b model locally is IMO a must for any type of reasonable local code agent. if there's a way to squeeze large models into laptops - i think you'd see much more demand than you'd otherwise expect for macbooks.

Expand full comment
Codebra's avatar

AI on small personal devices and computers will always be extremely limited compared with the frontier models everyone is used to. The only people who actually need concentrated AI-specific compute locally are ML researchers and hobbyists. The vast majority of people want an experience like they get with Gemini or OpenAI, ideally integrated into their device in a secure manner.

Expand full comment