Is it agentic enough? Benchmarking open models on your own tooling

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

giovedì 18 giugno 2026 New tab

3,276 words~15 min read

Back to Articles

Testing software for agentic-use Not all successes are equal How do we run evaluations? Which models to benchmark against? Large open models: hold the model, vary the revision Small models: hold the revision, vary the model Tweaking the tool: markers and results What's a marker? Is the CLI + Skill commit helping? Trying it yourself Closing Acknowledgements

Benchmarking transformers revisions across different metrics

This is a human-made, agent-focused blogpost.

Coding agents increasingly work with our software instead of us: describe a task, and the agent picks the library,

Other newsrooms on this story

· 1 sources

Full timeline →

dev.to·Jun 16, 2026 · 2 g fa
Small Models, Great Tools: The Engineering Behind a Local AI Agent in Production

Is it agentic enough? Benchmarking open models on your own tooling

Other newsrooms on this story

Is it agentic enough? Benchmarking open models on your own tooling

Other newsrooms on this story

Related reading

The PR you would have opened yourself

Mastering Agentic Techniques: AI Agent Evaluation | NVIDIA Technical Blog

Together Evaluations now supports comparing top commercial APIs vs. open source…

FOD#68: AI Benchmarks vs Vibe Checks: Measuring AI Progress

OpenAI's New Open gpt-oss Models vs o4-mini: A Real-World Comparison

We Got Claude to Build CUDA Kernels and teach open models!

Related reading

The PR you would have opened yourself

Mastering Agentic Techniques: AI Agent Evaluation | NVIDIA Technical Blog

Together Evaluations now supports comparing top commercial APIs vs. open source…

FOD#68: AI Benchmarks vs Vibe Checks: Measuring AI Progress

OpenAI's New Open gpt-oss Models vs o4-mini: A Real-World Comparison

We Got Claude to Build CUDA Kernels and teach open models!