DeepSeek-

A brand new model from DeepSeek- The model was trained on the fresh V3.1-Terminus, but with a slightly modified attention mechanism, DeepSeek Sparse Attention. In short, now each token pays attention to 2048 others instead of all previous ones, and based on a slightly differently calculated product of Q and K. Replacing the previously used mechanism with the new one does not require training from scratch — V3.2 is the same as V3.1, further trained on about a trillion tokens. This significantly reduces the cost of maintaining a long context — which is very important in the era of reasoning models; I think the main reason for moving in this direction is longer chains of reasoning for tasks requiring hundreds of tool calls. For a million generated tokens, the new model will cost $0.42 (instead of $1.68 on V3.1). Metrics show that quality does not suffer. An article with technical details on how the new Attention works is here - https://github.com/deepseek-ai/DeepSeek-V3.2-Exp/blob/mai

DeepSeek-

Related Topics

Share