Discussion about this post

User's avatar
Daniel Popescu / ⧉ Pluralisk's avatar

Thanks for writing this, it realy clarifies a lot. This architectural shift for distributed AI compute makes on-device inference much more practical.

Expand full comment
Daniel's avatar

running a 7b parameter model can happen on an M1 with 32 GB too.

can this is new architecture and/ or the strategies you suggest (Quantization and memory mapped model) allow you to run 30b an m5 with 32GB? or the only way is add more RAM?

i'm asking as i think there's an elephant in the room and it is that when you're working with an agent and a remote model - you are more fragile (no internet, no agent) and less secure (one must assume that some pretty sensitive data will get out there).

this is why i think running a local model would be kind of a holy grail - not just for costs - but also for adoption in more sensitive parts of the business).

running a 30b model locally is IMO a must for any type of reasonable local code agent. if there's a way to squeeze large models into laptops - i think you'd see much more demand than you'd otherwise expect for macbooks.

Expand full comment
1 more comment...

No posts