I've been working on a question lately: can an AI run on a small local device without depending on the cloud?
I dug through a lot of material, and then one number stopped me cold.
A 7B parameter model needs to move roughly 14GB of weight data from memory to the compute unit every time it generates a single token. GPU memory bandwidth is around 2TB/s. Do the math: that's theoretically only 140 tokens per second — and in practice, even less.
I sat with that for a moment.
It's not that the compute isn't fast enough. It's that the carrying is too slow.










