249 held-out tasks, 14 unseen connectors · 11 models
Hit@1
StackOne Action Search
59.0%
Baseline (LLM Direct)
—
| # |
Model |
Hit@1 |
Hit@3 |
Hit@10 |
MRR |
p50 (ms) |
QPS |
| 1 |
Fine-Tuned MiniLM |
92.8% |
98.8% |
100.0% |
0.957 |
3.0 |
360 |
| 2 |
Fine-Tuned bge |
90.0% |
98.8% |
100.0% |
0.943 |
10.9 |
89 |
| 3 |
Semantic Router |
62.7% |
82.3% |
96.8% |
0.736 |
2.7 |
365 |
| 4 |
StackOne Action Search |
59.0% |
80.7% |
96.8% |
0.712 |
3.1 |
261 |
| 5 |
Toolshed RAG |
57.4% |
79.5% |
96.8% |
0.704 |
53.6 |
18 |
| 6 |
Baseline Embedding |
55.0% |
79.1% |
97.2% |
0.689 |
2.7 |
361 |
| 7 |
Tool2Vec |
53.8% |
76.3% |
96.0% |
0.674 |
2.7 |
364 |
| 8 |
bge-base-en-v1.5 |
52.2% |
85.5% |
98.4% |
0.691 |
10.5 |
91 |
| 9 |
BM-25 |
36.5% |
63.5% |
91.6% |
0.537 |
0.5 |
1,880 |
| 10 |
Baseline (LLM Direct) |
— |
— |
— |
— |
2500.0 |
0 |
| 11 |
LangGraph BigTool |
— |
— |
— |
— |
2500.0 |
0 |
StackOne
Open source
Beats all our models
We outperform
All 1,843 tasks, per-connector search · 9 models
Hit@1
StackOne Action Search
52.8%
| # |
Model |
Hit@1 |
Hit@3 |
Hit@10 |
MRR |
| 1 |
Fine-Tuned MiniLM |
91.6% |
99.1% |
100.0% |
0.951 |
| 2 |
Fine-Tuned bge |
89.9% |
99.0% |
100.0% |
0.942 |
| 3 |
Semantic Router |
59.7% |
81.0% |
96.5% |
0.718 |
| 4 |
Tool2Vec |
54.4% |
76.5% |
96.3% |
0.679 |
| 5 |
Toolshed RAG |
53.2% |
77.8% |
96.3% |
0.674 |
| 6 |
StackOne Action Search |
52.8% |
76.9% |
96.6% |
0.671 |
| 7 |
Baseline Embedding |
52.0% |
77.9% |
96.9% |
0.668 |
| 8 |
bge-base-en-v1.5 |
51.2% |
81.9% |
98.4% |
0.678 |
| 9 |
BM-25 |
33.1% |
60.6% |
91.6% |
0.506 |
StackOne
Open source
Beats all our models
We outperform
All 998 tools, 1,843 tasks, cross-connector · 9 models
Hit@1
StackOne Action Search
27.6%
| # |
Model |
Hit@1 |
Hit@3 |
Hit@10 |
MRR |
p50 (ms) |
QPS |
| 1 |
Fine-Tuned MiniLM |
57.3% |
71.5% |
82.3% |
0.661 |
— |
— |
| 2 |
Fine-Tuned bge |
56.9% |
70.9% |
81.9% |
0.652 |
13.0 |
73 |
| 3 |
Semantic Router |
35.4% |
52.2% |
67.3% |
0.457 |
3.4 |
278 |
| 4 |
Tool2Vec |
29.4% |
45.2% |
60.8% |
0.393 |
3.3 |
297 |
| 5 |
Baseline Embedding |
28.3% |
44.3% |
58.9% |
0.388 |
— |
— |
| 6 |
StackOne Action Search |
27.6% |
44.7% |
58.7% |
0.377 |
3.5 |
266 |
| 7 |
Toolshed RAG |
26.8% |
44.3% |
59.0% |
0.371 |
124.7 |
8 |
| 8 |
bge-base-en-v1.5 |
23.3% |
41.1% |
59.9% |
0.355 |
— |
— |
| 9 |
BM-25 |
— |
— |
— |
— |
— |
— |
StackOne
Open source
Beats all our models
We outperform
200 tools, 2,000 queries — single-tool retrieval · 7 models
nDCG@5
Fine-Tuned Bi-Encoder (v2)
94.9%
MiniLM-L6-v2 (base)
76.7%
Fine-Tuned Reranker
75.6%
| # |
Model |
Hit@1 |
Hit@3 |
Hit@5 |
Hit@10 |
MRR |
nDCG@5 |
nDCG@10 |
p50 (ms) |
p95 (ms) |
QPS |
| 1 |
Fine-Tuned Bi-Encoder (v2) |
88.8% |
98.3% |
99.0% |
99.7% |
0.936 |
0.949 |
0.952 |
1.7 |
2.6 |
578 |
| 2 |
gte-large |
66.8% |
88.2% |
92.8% |
96.9% |
0.784 |
0.816 |
0.830 |
18.7 |
32.8 |
50 |
| 3 |
MiniLM-L6-v2 (base) |
62.8% |
82.5% |
87.9% |
93.3% |
0.736 |
0.767 |
0.784 |
1.5 |
2.4 |
651 |
| 4 |
bge-large-en-v1.5 |
60.4% |
82.7% |
89.0% |
93.5% |
0.724 |
0.761 |
0.776 |
17.7 |
30.2 |
54 |
| 5 |
bge-base-en-v1.5 |
60.7% |
81.9% |
87.5% |
93.2% |
0.723 |
0.756 |
0.774 |
5.7 |
9.2 |
168 |
| 6 |
Fine-Tuned Reranker |
60.7% |
81.9% |
87.5% |
93.2% |
0.723 |
0.756 |
0.774 |
50.2 |
69.8 |
19 |
| 7 |
e5-large-v2 |
52.8% |
72.5% |
80.2% |
87.8% |
0.644 |
0.676 |
0.700 |
19.5 |
32.7 |
48 |
StackOne
Open source
Beats all our models
We outperform
44,453 tools, 7,961 queries — graded multi-tool relevance · 9 models
nDCG@10
Qwen3-Embedding-0.6B
43.1%
gte-Qwen2-1.5B-instruct
41.3%
e5-mistral-7b-instruct
40.0%
Fine-Tuned Bi-Encoder (v2)
30.3%
Fine-Tuned Bi-Encoder (v1)
21.1%
| # |
Model |
nDCG@5 |
nDCG@10 |
nDCG@20 |
Recall@5 |
Recall@10 |
Recall@20 |
Comp@5 |
Comp@10 |
Comp@20 |
| 1 |
Qwen3-Embedding-8B paper |
— |
0.462 |
— |
— |
0.575 |
— |
— |
0.475 |
— |
| 2 |
Qwen3-Embedding-0.6B paper |
— |
0.431 |
— |
— |
0.528 |
— |
— |
0.430 |
— |
| 3 |
NV-Embed-v1 paper |
— |
0.427 |
— |
— |
0.534 |
— |
— |
0.434 |
— |
| 4 |
gte-Qwen2-1.5B-instruct paper |
— |
0.413 |
— |
— |
0.516 |
— |
— |
0.406 |
— |
| 5 |
GritLM-7B paper |
— |
0.411 |
— |
— |
0.513 |
— |
— |
0.404 |
— |
| 6 |
e5-mistral-7b-instruct paper |
— |
0.400 |
— |
— |
0.501 |
— |
— |
0.406 |
— |
| 7 |
BM25s paper |
— |
0.364 |
— |
— |
0.464 |
— |
— |
0.390 |
— |
| 8 |
Fine-Tuned Bi-Encoder (v2) |
0.282 |
0.303 |
0.318 |
0.328 |
0.384 |
0.434 |
0.272 |
0.321 |
0.367 |
| 9 |
Fine-Tuned Bi-Encoder (v1) |
0.193 |
0.211 |
0.225 |
0.234 |
0.283 |
0.335 |
0.193 |
0.235 |
0.276 |
StackOne
Open source
Beats all our models
We outperform
10,439 tools, 451 queries — NemoSheng subset · 5 models
nDCG@5
Fine-Tuned Bi-Encoder (v2)
26.3%
Fine-Tuned Bi-Encoder (v1)
25.4%
MiniLM-L6-v2 (base)
25.4%
| # |
Model |
nDCG@5 |
nDCG@10 |
nDCG@20 |
Recall@5 |
Recall@10 |
Recall@20 |
Hit@1 |
Hit@3 |
Hit@5 |
Hit@10 |
MRR |
| 1 |
Ada-002 (EmbSim) paper |
0.387 |
— |
— |
— |
— |
— |
— |
— |
— |
— |
— |
| 2 |
Fine-Tuned Bi-Encoder (v2) |
0.263 |
0.310 |
0.337 |
0.390 |
0.534 |
0.641 |
12.6% |
29.5% |
39.0% |
53.4% |
0.253 |
| 3 |
Fine-Tuned Bi-Encoder (v1) |
0.254 |
0.289 |
0.314 |
0.375 |
0.481 |
0.581 |
12.2% |
29.5% |
37.5% |
48.1% |
0.242 |
| 4 |
MiniLM-L6-v2 (base) |
0.254 |
0.295 |
0.320 |
0.390 |
0.514 |
0.612 |
11.1% |
28.8% |
39.0% |
51.4% |
0.238 |
| 5 |
BM25 paper |
0.200 |
— |
— |
— |
— |
— |
— |
— |
— |
— |
— |
StackOne
Open source
Beats all our models
We outperform
200 tasks, context-limit focused · 5 compressors · claude-sonnet-4-6
Pass Rate
Observation Masking
78.0%
| Compressor |
Pass Rate |
Coverage |
Ctx Reduction |
Total Tokens |
Avg Iterations |
Avg Latency |
| Discode Dynamic |
79.5% |
84.6% |
37.3% |
8.1M |
5.7 |
40.7s |
| Observation Masking |
78% |
84.1% |
0% |
12.2M |
6.3 |
33.8s |
| Selective Context |
73.5% |
80.3% |
18.9% |
19.5M |
5.2 |
33.2s |
| No Compression |
71% |
73.9% |
0% |
11.1M |
3.8 |
29.8s |
| LLMLingua-2 |
67.5% |
78.3% |
22.4% |
26.1M |
6.3 |
38.9s |
StackOne
Open source
Beats all our models
We outperform
Synthetic function-calling — Simple, Multiple, Parallel, Parallel Multiple · 4 models
Average Accuracy
Qwen2.5-7B LoRA (BFCL SFT)
95.4%
Gemini 3 Pro (Prompt)
90.6%
Claude Sonnet 4.5 (FC)
88.6%
Claude Opus 4.5 (FC)
88.6%
| # |
Model |
Average |
Simple |
Multiple |
Parallel |
Parallel Multiple |
| 1 |
Qwen2.5-7B LoRA (BFCL SFT) |
95.4% |
100.0% |
94.1% |
95.0% |
92.6% |
| 2 |
Gemini 3 Pro (Prompt) paper |
90.6% |
79.6% |
96.0% |
95.0% |
92.0% |
| 3 |
Claude Sonnet 4.5 (FC) paper |
88.6% |
72.6% |
95.5% |
94.5% |
92.0% |
| 4 |
Claude Opus 4.5 (FC) paper |
88.6% |
76.8% |
95.5% |
93.5% |
88.5% |
StackOne
Open source
Beats all our models
We outperform
Real-world function-calling — Simple, Multiple, Parallel, Parallel Multiple · 4 models
Average Accuracy
Gemini 3 Pro (Prompt)
87.7%
Qwen2.5-7B LoRA (BFCL SFT)
85.2%
Claude Sonnet 4.5 (FC)
84.8%
Claude Opus 4.5 (FC)
81.8%
| # |
Model |
Average |
Simple |
Multiple |
Parallel |
Parallel Multiple |
| 1 |
Gemini 3 Pro (Prompt) paper |
87.7% |
87.6% |
81.8% |
93.8% |
87.5% |
| 2 |
Qwen2.5-7B LoRA (BFCL SFT) |
85.2% |
73.3% |
87.5% |
100.0% |
80.0% |
| 3 |
Claude Sonnet 4.5 (FC) paper |
84.8% |
89.5% |
78.9% |
87.5% |
83.3% |
| 4 |
Claude Opus 4.5 (FC) paper |
81.8% |
86.4% |
78.2% |
87.5% |
75.0% |
StackOne
Open source
Beats all our models
We outperform
AgentShield corpus leaderboard · 7 providers · 537 tests
Final Score
ProtectAI DeBERTa v2
51.4
| Provider |
Final Score |
PI |
Jailbreak |
Data Exfil |
Tool Abuse |
Over-Refusal |
Multi-Agent |
Provenance |
p50 (ms) |
| AgentGuard |
98.4 |
98.5 |
97.8 |
100 |
100 |
100 |
100 |
85 |
1 |
| Deepset DeBERTa |
87.6 |
99.5 |
97.8 |
95.4 |
98.8 |
63.1 |
100 |
100 |
19 |
| StackOne Defender |
79.82 |
92.68 |
68.89 |
91.95 |
83.75 |
72.31 |
88.57 |
80 |
10.32 |
| Lakera Guard |
79.4 |
97.6 |
95.6 |
96.6 |
86.3 |
58.5 |
94.3 |
95 |
133 |
| ProtectAI DeBERTa v2 |
51.4 |
77.1 |
86.7 |
43.7 |
12.5 |
95.4 |
74.3 |
65 |
19 |
| ClawGuard |
38.9 |
62.9 |
22.2 |
40.2 |
17.5 |
100 |
40 |
25 |
0 |
| LLM Guard |
38.7 |
77.1 |
— |
30.8 |
8.9 |
— |
— |
— |
111 |
StackOne
Open source
Beats all our models
We outperform
Large-scale diverse · 65,000 samples · HuggingFace dataset · 5 models
Jayavibhav F1
ProtectAI DeBERTa-v3
73.6%
| Model |
F1 |
Size (MB) |
Latency (ms) |
Hardware |
| StackOne |
97.44% |
22 |
4.31 |
CPU |
| DistilBERT |
75.20% |
1,789 |
6.99 |
GPU |
| ProtectAI DeBERTa-v3 |
73.55% |
704 |
43 |
T4 GPU |
| Meta PG v2 |
62.50% |
1,064 |
43 |
T4 GPU |
| Meta PG v1 |
54.74% |
1,064 |
43 |
T4 GPU |
StackOne
Open source baselines
Curated enterprise prompts · 1,518 samples · HuggingFace dataset · 5 models
Qualifire F1
ProtectAI DeBERTa-v3
64.0%
| Model |
F1 |
Size (MB) |
Latency (ms) |
Hardware |
| DistilBERT |
89.07% |
1,789 |
6.99 |
GPU |
| StackOne |
86.41% |
22 |
4.31 |
CPU |
| ProtectAI DeBERTa-v3 |
64.04% |
704 |
43 |
T4 GPU |
| Meta PG v2 |
60.27% |
1,064 |
43 |
T4 GPU |
| Meta PG v1 |
55.40% |
1,064 |
43 |
T4 GPU |
StackOne
Open source baselines
5 attack types · 22,500 samples · HuggingFace dataset · 5 models
xxz224 F1
ProtectAI DeBERTa-v3
33.1%
| Model |
F1 |
Size (MB) |
Latency (ms) |
Hardware |
| DistilBERT |
93.59% |
1,789 |
6.99 |
GPU |
| Meta PG v1 |
92.41% |
1,064 |
43 |
T4 GPU |
| StackOne |
82.28% |
22 |
4.31 |
CPU |
| Meta PG v2 |
66.49% |
1,064 |
43 |
T4 GPU |
| ProtectAI DeBERTa-v3 |
33.06% |
704 |
43 |
T4 GPU |
StackOne
Open source baselines