Blog - AgentsArchitects

Thinking with Visual Primitives: The Architectural Shift That Could Define the Next Generation of Vision Agents

The race to build more capable multimodal AI systems has largely focused on improving perception. Researchers have invested heavily in higher-resolution image processing, dynamic cropping strategies, visual zoom mechanisms, and advanced Chain-of-Thought techniques in an effort to help models "see" more effectively.

However, a new line of research suggests that perception may not be the primary limitation.

The paper "Thinking with Visual Primitives" introduces a different perspective. Instead of focusing on the Perception Gap, the authors identify what they call the Reference Gap a fundamental challenge that arises when language attempts to reason about complex visual scenes.

Understanding the Reference Gap
In dense visual environments, language alone often struggles to maintain consistent references.

Consider:

A team photo containing dozens of people
A maze with multiple intersections
Complex overlapping charts and diagrams
Crowded manufacturing inspection images

Even when an AI model successfully perceives the visual elements, it may lose track of which object it is referring to during later reasoning steps.

According to the authors, this is not a perception problem it is a referencing problem.

Natural language was never designed to function as a precise pointer inside a continuous two-dimensional space.

The Core Innovation: Visual Primitives as Units of Thought

The central idea behind the research is remarkably simple yet powerful.

Instead of using bounding boxes and points only as verification tools after reasoning is completed, the model incorporates them directly into its reasoning process.

The reasoning chain alternates between:

Natural language tokens
Bounding boxes
Spatial points

Example representations include:

<|box|>[[x1,y1,x2,y2]]<|/box|>
<|point|>[[x,y]]<|/point|>

These coordinates exist within a normalized visual space, allowing the model to maintain precise references throughout the reasoning process.

This approach closely resembles how humans use pointing gestures when:

Counting objects
Following a path
Tracing diagrams
Solving visual puzzles

Architecture Highlights

The proposed system is built on DeepSeek-V4-Flash, a large Mixture-of-Experts model featuring:

284B total parameters
Approximately 13B active parameters
Compressed Sparse Attention architecture

One particularly interesting aspect is its visual token compression strategy.

For an 800×800 image, the model reportedly retains roughly:

Model      Approximate KV Entries
DeepSeek-V4-Flash    ~90
Qwen3-VL-235B-A22B     ~660
GPT-5.4     ~740
Gemini-3-Flash     ~1100

The paper reports an impressive 7,056× pixel-to-KV compression ratio, highlighting significant efficiency gains for large-scale deployment.

Training Strategy
The post-training pipeline uses a specialist-first approach.

Separate experts are trained for:

Grounding Tasks
Responsible for generating and reasoning with bounding boxes.

Pointing Tasks
Focused on point-based visual references.

These experts are later merged through:

Unified Reinforcement Fine-Tuning (RFT)
On-Policy Distillation
Reverse-KL Logit Distillation

This allows the final model to inherit strengths from both specialized systems.

Advanced Reward Modeling
One of the most technically impressive aspects of the paper is its reward design.

Beyond standard formatting and correctness rewards, the researchers introduce task-specific reward mechanisms.

Examples include:

Counting Tasks
A smooth exponential-decay relative error reward that penalizes inaccurate counts proportionally.

Maze Navigation

Rewards consider:

Exploration progress
Coverage completeness
Wall violation penalties
Path validity
Final answer correctness

Path Tracing

A bidirectional trajectory reward evaluates:

Path accuracy
Coverage completeness
Deviation penalties

This level of reward engineering is often what separates robust systems from easily exploitable ones.

Benchmark Performance
The model was evaluated on two newly introduced reasoning benchmarks:

DS_Maze_Navigation

Reported performance:

Frontier models: ~48–51%
Thinking with Visual Primitives: 66.9%

DS_Path_Tracing

Reported performance:

Existing models: 24.5–46.5%
Thinking with Visual Primitives: 56.7%

These results suggest meaningful improvements in tasks requiring persistent visual referencing and topological reasoning.

Important Caveats
The authors are transparent about several limitations.

In-House Benchmarks
The primary benchmarks were developed internally and still require broader industry validation and independent replication.

Limited Scope
The reported scores focus specifically on visual reasoning capabilities and should not be interpreted as indicators of overall model intelligence.

This transparency reflects strong evaluation discipline and improves confidence in the reported findings.

Why This Matters for Enterprise AI
For teams building AI-powered systems, the implications are significant.

Potential applications include:

Document Intelligence
Visual Inspection Systems
Robotics Perception
Industrial Quality Control
Dashboard Analysis
Autonomous Agent Workflows

The research suggests that improving reference tracking may deliver greater performance gains than simply increasing image resolution or extending reasoning chains.

Additionally, the efficiency benefits from visual token compression could reduce deployment costs while maintaining strong performance.

Current Limitations and Future Research

The authors acknowledge several open challenges:

Fine-grained visual resolution limitations
Dependence on explicit trigger mechanisms
Limited cross-domain generalization
Point-based reasoning constraints in unfamiliar environments

These areas are likely to become major research directions for future multimodal systems.

Final Thoughts

"Thinking with Visual Primitives" presents a compelling shift in how we think about multimodal reasoning.

Rather than asking how AI models can see more, the research asks a more fundamental question:

How can AI models maintain accurate references while thinking about what they see?
If future research validates these findings across broader benchmarks and real-world environments, visual primitives may become a foundational building block for the next generation of intelligent vision agents.

References:

Lu, R., Ma, Y., Chen, X., Luo, L., Wu, Z., Pan, Z., Liu, X., Lin, Y., Li, H., Liu, W., Hao, Z., Gao, X., Nie, S., Wei, Y., Xie, Z., Chen, T., & Zeng, G. (2026). Thinking with Visual Primitives. DeepSeek-AI, Peking University, and Tsinghua University. Technical Report.

Available via Hugging Face:
https://huggingface.co/datasets/NodeLinker/deepseek-ai-Thinking-with-Visual-Primitives-deleted-repo

Author Note:
All technical claims, benchmark results, and architectural descriptions discussed in this article are summarized from the cited research paper. Analysis and interpretation reflect the author's perspective.

Thinking with Visual Primitives: The Architectural Shift Shaping the Future of Vision Agents