It is hard not to write about deepseek-R1 this week. I ran the 32-billion parameter (number of weights in the neural network) version locally using Ollama. Mainly, I liked the availability of a 32-billion version (Q4 quantised) because that’s where the sweet spot of my GPU box lies. The box with a 16-GB GPU and 64-GB CPU combination runs 8 billion LLMs smoothly while barely handling a 70 billion model. Hence, it was love at first sight to notice the 32 billion model. The model promises to be robust, with a context length of 131K (the maximum number of tokens sent to and received from the LLM in one interaction) and an embedding length of 5120 (the number of floating-point numbers in each vector). The longer the vector, the finer the relationship among chunks from a piece of text that can be stored in them.

I gave this prompt: “A farmer needs to transport a fox, a chicken, and a bag of grain across a river using a boat that can only carry the farmer and one item at a time. If left alone, the fox will eat the chicken, and the chicken will eat the grain. How can the farmer safely transport all three items across the river?”

It went on into a loop of reasoning and verification, finally taking 682 lines of reasoning to arrive at the conclusion. The answer was correct. However, taking 682 lines of reasoning and 20 minutes to solve this puzzle is nowhere near impressive, despite getting the answer twice on the way but not realising that it should have stopped there. Love-at-first-sight transformed into like-at-second-sight. The steps were 1) Take the chicken across first, 2) Return alone, and 3) Take the fox across and bring the chicken back. This ensures that the fox does not eat the chicken while retrieving the grain 4) Take the grain across and leave it with the fox 5) Return to get the chicken 6) Take the chicken across. The final answer was neat. The intermediate steps of the model “thinking and reasoning” were human-like (assuming the human would have missed the correct answer twice presented right in front of them). 

Even though I was happy with how the model concluded, I was disappointed that it took 682 lines to solve a simple (seen by human standards) puzzle. Here is the interesting thing. I had more work for the GPU. This time, I ran llama3.1:8b (notice the difference in the number of parameters: 32b vs. 8b). It was a Q4 quantised version with a context length of 131K and an embedding length of 4096. Powerful enough. The answering process was not only unimpressive, but the final answer was also incorrect. It took hardly 5 seconds to “think and reason”, meaning it did not think or reason. The model misunderstood the question. Seeing things like “the grain is with the chicken (and thus safe)” shatters our confidence!

A few more comparative statistics follow. The deepseek-R1 model architecture is based on qwen2, whereas llama3.1 is based on llama. Both models had almost the same load duration (15s). The speed of evaluating the prompt was 43 tokens per second for deepseek-R1 and 340 tokens per second for llama3.1. The inferencing speed of deepseek-R1 was 5.65 tokens per second, and in the case of llama3.1, it was 111 tokens per second (remember the difference between the size of the two models).

It might be a while before others catch up with DeepSeek. I am not surprised this model’s release is causing a tsunami in Silicon Valley. The best part for a country like India is that it is causing a strategic shift in how we approach creating foundational LLMs from scratch. Will we see India moving (can I say elevating?) from an LLM-powered application creator (the arena of AI Engineering) to a foundational model creator? 

Linkedin
Disclaimer

Views expressed above are the author's own.

END OF ARTICLE