AI BENCHMARKS

Search, execute, and defend benchmarks

249 held-out tasks, 14 unseen connectors · 11 models

Hit@1

Fine-Tuned MiniLM
92.8%
Fine-Tuned bge
90.0%
Semantic Router
62.7%
StackOne Action Search
59.0%
Toolshed RAG
57.4%
Baseline Embedding
55.0%
Tool2Vec
53.8%
bge-base-en-v1.5
52.2%
BM-25
36.5%
Baseline (LLM Direct)
—
LangGraph BigTool
—
# Model Hit@1 Hit@3 Hit@10 MRR p50 (ms) QPS
1 Fine-Tuned MiniLM 92.8% 98.8% 100.0% 0.957 3.0 360
2 Fine-Tuned bge 90.0% 98.8% 100.0% 0.943 10.9 89
3 Semantic Router 62.7% 82.3% 96.8% 0.736 2.7 365
4 StackOne Action Search 59.0% 80.7% 96.8% 0.712 3.1 261
5 Toolshed RAG 57.4% 79.5% 96.8% 0.704 53.6 18
6 Baseline Embedding 55.0% 79.1% 97.2% 0.689 2.7 361
7 Tool2Vec 53.8% 76.3% 96.0% 0.674 2.7 364
8 bge-base-en-v1.5 52.2% 85.5% 98.4% 0.691 10.5 91
9 BM-25 36.5% 63.5% 91.6% 0.537 0.5 1,880
10 Baseline (LLM Direct) — — — — 2500.0 0
11 LangGraph BigTool — — — — 2500.0 0
StackOne Open source Beats all our models We outperform

All 1,843 tasks, per-connector search · 9 models

Hit@1

Fine-Tuned MiniLM
91.6%
Fine-Tuned bge
89.9%
Semantic Router
59.7%
Tool2Vec
54.4%
Toolshed RAG
53.2%
StackOne Action Search
52.8%
Baseline Embedding
52.0%
bge-base-en-v1.5
51.2%
BM-25
33.1%
# Model Hit@1 Hit@3 Hit@10 MRR
1 Fine-Tuned MiniLM 91.6% 99.1% 100.0% 0.951
2 Fine-Tuned bge 89.9% 99.0% 100.0% 0.942
3 Semantic Router 59.7% 81.0% 96.5% 0.718
4 Tool2Vec 54.4% 76.5% 96.3% 0.679
5 Toolshed RAG 53.2% 77.8% 96.3% 0.674
6 StackOne Action Search 52.8% 76.9% 96.6% 0.671
7 Baseline Embedding 52.0% 77.9% 96.9% 0.668
8 bge-base-en-v1.5 51.2% 81.9% 98.4% 0.678
9 BM-25 33.1% 60.6% 91.6% 0.506
StackOne Open source Beats all our models We outperform

All 998 tools, 1,843 tasks, cross-connector · 9 models

Hit@1

Fine-Tuned MiniLM
57.3%
Fine-Tuned bge
56.9%
Semantic Router
35.4%
Tool2Vec
29.4%
Baseline Embedding
28.3%
StackOne Action Search
27.6%
Toolshed RAG
26.8%
bge-base-en-v1.5
23.3%
BM-25
—
# Model Hit@1 Hit@3 Hit@10 MRR p50 (ms) QPS
1 Fine-Tuned MiniLM 57.3% 71.5% 82.3% 0.661 — —
2 Fine-Tuned bge 56.9% 70.9% 81.9% 0.652 13.0 73
3 Semantic Router 35.4% 52.2% 67.3% 0.457 3.4 278
4 Tool2Vec 29.4% 45.2% 60.8% 0.393 3.3 297
5 Baseline Embedding 28.3% 44.3% 58.9% 0.388 — —
6 StackOne Action Search 27.6% 44.7% 58.7% 0.377 3.5 266
7 Toolshed RAG 26.8% 44.3% 59.0% 0.371 124.7 8
8 bge-base-en-v1.5 23.3% 41.1% 59.9% 0.355 — —
9 BM-25 — — — — — —
StackOne Open source Beats all our models We outperform

200 tools, 2,000 queries — single-tool retrieval · 7 models

nDCG@5

Fine-Tuned Bi-Encoder (v2)
94.9%
gte-large
81.6%
MiniLM-L6-v2 (base)
76.7%
bge-large-en-v1.5
76.1%
bge-base-en-v1.5
75.6%
Fine-Tuned Reranker
75.6%
e5-large-v2
67.6%
# Model Hit@1 Hit@3 Hit@5 Hit@10 MRR nDCG@5 nDCG@10 p50 (ms) p95 (ms) QPS
1 Fine-Tuned Bi-Encoder (v2) 88.8% 98.3% 99.0% 99.7% 0.936 0.949 0.952 1.7 2.6 578
2 gte-large 66.8% 88.2% 92.8% 96.9% 0.784 0.816 0.830 18.7 32.8 50
3 MiniLM-L6-v2 (base) 62.8% 82.5% 87.9% 93.3% 0.736 0.767 0.784 1.5 2.4 651
4 bge-large-en-v1.5 60.4% 82.7% 89.0% 93.5% 0.724 0.761 0.776 17.7 30.2 54
5 bge-base-en-v1.5 60.7% 81.9% 87.5% 93.2% 0.723 0.756 0.774 5.7 9.2 168
6 Fine-Tuned Reranker 60.7% 81.9% 87.5% 93.2% 0.723 0.756 0.774 50.2 69.8 19
7 e5-large-v2 52.8% 72.5% 80.2% 87.8% 0.644 0.676 0.700 19.5 32.7 48
StackOne Open source Beats all our models We outperform

44,453 tools, 7,961 queries — graded multi-tool relevance · 9 models

nDCG@10

Qwen3-Embedding-8B
46.2%
Qwen3-Embedding-0.6B
43.1%
NV-Embed-v1
42.7%
gte-Qwen2-1.5B-instruct
41.3%
GritLM-7B
41.1%
e5-mistral-7b-instruct
40.0%
BM25s
36.4%
Fine-Tuned Bi-Encoder (v2)
30.3%
Fine-Tuned Bi-Encoder (v1)
21.1%
# Model nDCG@5 nDCG@10 nDCG@20 Recall@5 Recall@10 Recall@20 Comp@5 Comp@10 Comp@20
1 Qwen3-Embedding-8B paper — 0.462 — — 0.575 — — 0.475 —
2 Qwen3-Embedding-0.6B paper — 0.431 — — 0.528 — — 0.430 —
3 NV-Embed-v1 paper — 0.427 — — 0.534 — — 0.434 —
4 gte-Qwen2-1.5B-instruct paper — 0.413 — — 0.516 — — 0.406 —
5 GritLM-7B paper — 0.411 — — 0.513 — — 0.404 —
6 e5-mistral-7b-instruct paper — 0.400 — — 0.501 — — 0.406 —
7 BM25s paper — 0.364 — — 0.464 — — 0.390 —
8 Fine-Tuned Bi-Encoder (v2) 0.282 0.303 0.318 0.328 0.384 0.434 0.272 0.321 0.367
9 Fine-Tuned Bi-Encoder (v1) 0.193 0.211 0.225 0.234 0.283 0.335 0.193 0.235 0.276
StackOne Open source Beats all our models We outperform

10,439 tools, 451 queries — NemoSheng subset · 5 models

nDCG@5

Ada-002 (EmbSim)
38.7%
Fine-Tuned Bi-Encoder (v2)
26.3%
Fine-Tuned Bi-Encoder (v1)
25.4%
MiniLM-L6-v2 (base)
25.4%
BM25
20.0%
# Model nDCG@5 nDCG@10 nDCG@20 Recall@5 Recall@10 Recall@20 Hit@1 Hit@3 Hit@5 Hit@10 MRR
1 Ada-002 (EmbSim) paper 0.387 — — — — — — — — — —
2 Fine-Tuned Bi-Encoder (v2) 0.263 0.310 0.337 0.390 0.534 0.641 12.6% 29.5% 39.0% 53.4% 0.253
3 Fine-Tuned Bi-Encoder (v1) 0.254 0.289 0.314 0.375 0.481 0.581 12.2% 29.5% 37.5% 48.1% 0.242
4 MiniLM-L6-v2 (base) 0.254 0.295 0.320 0.390 0.514 0.612 11.1% 28.8% 39.0% 51.4% 0.238
5 BM25 paper 0.200 — — — — — — — — — —
StackOne Open source Beats all our models We outperform

200 tasks, context-limit focused · 5 compressors · claude-sonnet-4-6

Pass Rate

Discode Dynamic
79.5%
Observation Masking
78.0%
Selective Context
73.5%
No Compression
71.0%
LLMLingua-2
67.5%
Compressor Pass Rate Coverage Ctx Reduction Total Tokens Avg Iterations Avg Latency
Discode Dynamic 79.5% 84.6% 37.3% 8.1M 5.7 40.7s
Observation Masking 78% 84.1% 0% 12.2M 6.3 33.8s
Selective Context 73.5% 80.3% 18.9% 19.5M 5.2 33.2s
No Compression 71% 73.9% 0% 11.1M 3.8 29.8s
LLMLingua-2 67.5% 78.3% 22.4% 26.1M 6.3 38.9s
StackOne Open source Beats all our models We outperform

Synthetic function-calling — Simple, Multiple, Parallel, Parallel Multiple · 4 models

Average Accuracy

Qwen2.5-7B LoRA (BFCL SFT)
95.4%
Gemini 3 Pro (Prompt)
90.6%
Claude Sonnet 4.5 (FC)
88.6%
Claude Opus 4.5 (FC)
88.6%
# Model Average Simple Multiple Parallel Parallel Multiple
1 Qwen2.5-7B LoRA (BFCL SFT) 95.4% 100.0% 94.1% 95.0% 92.6%
2 Gemini 3 Pro (Prompt) paper 90.6% 79.6% 96.0% 95.0% 92.0%
3 Claude Sonnet 4.5 (FC) paper 88.6% 72.6% 95.5% 94.5% 92.0%
4 Claude Opus 4.5 (FC) paper 88.6% 76.8% 95.5% 93.5% 88.5%
StackOne Open source Beats all our models We outperform

Real-world function-calling — Simple, Multiple, Parallel, Parallel Multiple · 4 models

Average Accuracy

Gemini 3 Pro (Prompt)
87.7%
Qwen2.5-7B LoRA (BFCL SFT)
85.2%
Claude Sonnet 4.5 (FC)
84.8%
Claude Opus 4.5 (FC)
81.8%
# Model Average Simple Multiple Parallel Parallel Multiple
1 Gemini 3 Pro (Prompt) paper 87.7% 87.6% 81.8% 93.8% 87.5%
2 Qwen2.5-7B LoRA (BFCL SFT) 85.2% 73.3% 87.5% 100.0% 80.0%
3 Claude Sonnet 4.5 (FC) paper 84.8% 89.5% 78.9% 87.5% 83.3%
4 Claude Opus 4.5 (FC) paper 81.8% 86.4% 78.2% 87.5% 75.0%
StackOne Open source Beats all our models We outperform

AgentShield corpus leaderboard · 7 providers · 537 tests

Final Score

AgentGuard
98.4
Deepset DeBERTa
87.6
StackOne Defender
79.8
Lakera Guard
79.4
ProtectAI DeBERTa v2
51.4
ClawGuard
38.9
LLM Guard
38.7
Provider Final Score PI Jailbreak Data Exfil Tool Abuse Over-Refusal Multi-Agent Provenance p50 (ms)
AgentGuard 98.4 98.5 97.8 100 100 100 100 85 1
Deepset DeBERTa 87.6 99.5 97.8 95.4 98.8 63.1 100 100 19
StackOne Defender 79.82 92.68 68.89 91.95 83.75 72.31 88.57 80 10.32
Lakera Guard 79.4 97.6 95.6 96.6 86.3 58.5 94.3 95 133
ProtectAI DeBERTa v2 51.4 77.1 86.7 43.7 12.5 95.4 74.3 65 19
ClawGuard 38.9 62.9 22.2 40.2 17.5 100 40 25 0
LLM Guard 38.7 77.1 — 30.8 8.9 — — — 111
StackOne Open source Beats all our models We outperform

Large-scale diverse · 65,000 samples · HuggingFace dataset · 5 models

Jayavibhav F1

StackOne
97.4%
DistilBERT
75.2%
ProtectAI DeBERTa-v3
73.6%
Meta PG v2
62.5%
Meta PG v1
54.7%
Model F1 Size (MB) Latency (ms) Hardware
StackOne 97.44% 22 4.31 CPU
DistilBERT 75.20% 1,789 6.99 GPU
ProtectAI DeBERTa-v3 73.55% 704 43 T4 GPU
Meta PG v2 62.50% 1,064 43 T4 GPU
Meta PG v1 54.74% 1,064 43 T4 GPU
StackOne Open source baselines

Curated enterprise prompts · 1,518 samples · HuggingFace dataset · 5 models

Qualifire F1

DistilBERT
89.1%
StackOne
86.4%
ProtectAI DeBERTa-v3
64.0%
Meta PG v2
60.3%
Meta PG v1
55.4%
Model F1 Size (MB) Latency (ms) Hardware
DistilBERT 89.07% 1,789 6.99 GPU
StackOne 86.41% 22 4.31 CPU
ProtectAI DeBERTa-v3 64.04% 704 43 T4 GPU
Meta PG v2 60.27% 1,064 43 T4 GPU
Meta PG v1 55.40% 1,064 43 T4 GPU
StackOne Open source baselines

5 attack types · 22,500 samples · HuggingFace dataset · 5 models

xxz224 F1

DistilBERT
93.6%
Meta PG v1
92.4%
StackOne
82.3%
Meta PG v2
66.5%
ProtectAI DeBERTa-v3
33.1%
Model F1 Size (MB) Latency (ms) Hardware
DistilBERT 93.59% 1,789 6.99 GPU
Meta PG v1 92.41% 1,064 43 T4 GPU
StackOne 82.28% 22 4.31 CPU
Meta PG v2 66.49% 1,064 43 T4 GPU
ProtectAI DeBERTa-v3 33.06% 704 43 T4 GPU
StackOne Open source baselines