DeepSeek V4 Cuts KV Cache by 90% at 1M Tokens, But Aggressive Compression Could Risk ‘Needle in a Haystack’ Failures

Chinese artificial intelligence lab DeepSeek claims to significantly reduce computing resources required for token inference and memory resources with its latest V4 model, according to its release notes. DeepSeek claims that the V4 AI model requires just 27% single-token inference FLOPs and 10% of key-value (KV) cache when compared to its predecessor, the DeepSeek V3.2 model. The reduction in cache requirements addresses memory requirements, with lower requirements conserving memory and increasing the context available to model builders when creating their models. DeepSeek V4 Touts Advancements Across Cache Use and Operations Required To Run A Single Token In its release notes […]

Read full article at https://wccftech.com/deepseek-v4-cuts-kv-cache-by-90-at-1m-tokens-but-aggressive-compression-could-risk-needle-in-a-haystack-failures/