Thinking with Visual Primitives: The Architectural Shift That Could Define the Next Generation of Vision Agents
The race to build more capable multimodal AI systems has largely focused on improving perception. Researchers have invested heavily in higher-resolution image processing, dynamic cropping strategies, visual zoom mechanisms, and advanced Chain-of-Thought techniques in an effort to help models "see" more effectively.
However, a new line of research suggests that perception may not be the primary limitation.
The paper "Thinking with Visual Primitives" introduces a different perspective. Instead of focusing on the Perception Gap, the authors identify what they call the Reference Gap a fundamental challenge that arises when language attempts to reason about complex visual scenes.
Understanding the Reference Gap
In dense visual environments, language alone often struggles to maintain consistent references.
Consider:
- A team photo containing dozens of people
- A maze with multiple intersections
- Complex overlapping charts and diagrams
- Crowded manufacturing inspection images
Even when an AI model successfully perceives the visual elements, it may lose track of which object it is referring to during later reasoning steps.
According to the authors, this is not a perception problem it is a referencing problem.
Natural language was never designed to function as a precise pointer inside a continuous two-dimensional space.
The Core Innovation: Visual Primitives as Units of Thought
The central idea behind the research is remarkably simple yet powerful.
Instead of using bounding boxes and points only as verification tools after reasoning is completed, the model incorporates them directly into its reasoning process.
The reasoning chain alternates between:
- Natural language tokens
- Bounding boxes
- Spatial points
Example representations include:
<|box|>[[x1,y1,x2,y2]]<|/box|>
<|point|>[[x,y]]<|/point|>
These coordinates exist within a normalized visual space, allowing the model to maintain precise references throughout the reasoning process.
This approach closely resembles how humans use pointing gestures when:
- Counting objects
- Following a path
- Tracing diagrams
- Solving visual puzzles
Architecture Highlights
The proposed system is built on DeepSeek-V4-Flash, a large Mixture-of-Experts model featuring:
- 284B total parameters
- Approximately 13B active parameters
- Compressed Sparse Attention architecture
One particularly interesting aspect is its visual token compression strategy.
For an 800×800 image, the model reportedly retains roughly:
Model Approximate KV Entries
DeepSeek-V4-Flash ~90
Qwen3-VL-235B-A22B ~660
GPT-5.4 ~740
Gemini-3-Flash ~1100
The paper reports an impressive 7,056× pixel-to-KV compression ratio, highlighting significant efficiency gains for large-scale deployment.
Training Strategy
The post-training pipeline uses a specialist-first approach.
Separate experts are trained for:
Grounding Tasks
Responsible for generating and reasoning with bounding boxes.
Pointing Tasks
Focused on point-based visual references.
These experts are later merged through:
- Unified Reinforcement Fine-Tuning (RFT)
- On-Policy Distillation
- Reverse-KL Logit Distillation
This allows the final model to inherit strengths from both specialized systems.
Advanced Reward Modeling
One of the most technically impressive aspects of the paper is its reward design.
Beyond standard formatting and correctness rewards, the researchers introduce task-specific reward mechanisms.
Examples include:
Counting Tasks
A smooth exponential-decay relative error reward that penalizes inaccurate counts proportionally.
Maze Navigation
Rewards consider:
- Exploration progress
- Coverage completeness
- Wall violation penalties
- Path validity
- Final answer correctness
Path Tracing
A bidirectional trajectory reward evaluates:
- Path accuracy
- Coverage completeness
- Deviation penalties
This level of reward engineering is often what separates robust systems from easily exploitable ones.
Benchmark Performance
The model was evaluated on two newly introduced reasoning benchmarks:
DS_Maze_Navigation
Reported performance:
- Frontier models: ~48–51%
- Thinking with Visual Primitives: 66.9%
DS_Path_Tracing
Reported performance:
- Existing models: 24.5–46.5%
- Thinking with Visual Primitives: 56.7%
These results suggest meaningful improvements in tasks requiring persistent visual referencing and topological reasoning.
Important Caveats
The authors are transparent about several limitations.
In-House Benchmarks
The primary benchmarks were developed internally and still require broader industry validation and independent replication.
Limited Scope
The reported scores focus specifically on visual reasoning capabilities and should not be interpreted as indicators of overall model intelligence.
This transparency reflects strong evaluation discipline and improves confidence in the reported findings.
Why This Matters for Enterprise AI
For teams building AI-powered systems, the implications are significant.
Potential applications include:
- Document Intelligence
- Visual Inspection Systems
- Robotics Perception
- Industrial Quality Control
- Dashboard Analysis
- Autonomous Agent Workflows
The research suggests that improving reference tracking may deliver greater performance gains than simply increasing image resolution or extending reasoning chains.
Additionally, the efficiency benefits from visual token compression could reduce deployment costs while maintaining strong performance.
Current Limitations and Future Research
The authors acknowledge several open challenges:
- Fine-grained visual resolution limitations
- Dependence on explicit trigger mechanisms
- Limited cross-domain generalization
- Point-based reasoning constraints in unfamiliar environments
These areas are likely to become major research directions for future multimodal systems.
Final Thoughts
"Thinking with Visual Primitives" presents a compelling shift in how we think about multimodal reasoning.
Rather than asking how AI models can see more, the research asks a more fundamental question:
How can AI models maintain accurate references while thinking about what they see?
If future research validates these findings across broader benchmarks and real-world environments, visual primitives may become a foundational building block for the next generation of intelligent vision agents.
References:
Lu, R., Ma, Y., Chen, X., Luo, L., Wu, Z., Pan, Z., Liu, X., Lin, Y., Li, H., Liu, W., Hao, Z., Gao, X., Nie, S., Wei, Y., Xie, Z., Chen, T., & Zeng, G. (2026). Thinking with Visual Primitives. DeepSeek-AI, Peking University, and Tsinghua University. Technical Report.
Available via Hugging Face:
https://huggingface.co/datasets/NodeLinker/deepseek-ai-Thinking-with-Visual-Primitives-deleted-repo
Author Note:
All technical claims, benchmark results, and architectural descriptions discussed in this article are summarized from the cited research paper. Analysis and interpretation reflect the author's perspective.