Introducing ScribeAgent, setting a new bar for performance on web-navigation benchmarks

By
Mouad Hadji
4
min read
Updated
December 3, 2024
Photo credit
We're thrilled to introduce ScribeAgent, a new family of proprietary LLMs achieving state-of-the-art results on multiple web navigation benchmarks
Generate Free AI Documents!


Introduction

Key Takeaways

  • By fine-tuning open-source language models on production-scale workflow data, we demonstrate unprecedented capabilities in understanding and executing complex web tasks while significantly reducing computational costs compared to existing solutions.
  • Our models, ScribeAgent-Small and ScribeAgent-Large, achieve state-of-the-art performance on Mind2Web and substantially improve the baseline from the WebArena benchmark with a 51.3% task success rate, surpassing previous best results by 14.1 percentage points (considering text-only baselines).
  • In partnership with our co-authors at Carnegie Mellon University, we open-source our complete code, along with a version of our LLM trained on open-source datasets.
  • If you are passionate about cutting-edge AI techniques and turning innovations into impactful features for millions of users, join our team! We are hiring for multiple open ML/AI engineering roles!

Read the full paper here: https://arxiv.org/abs/2411.15004

Key Results

We evaluated our proprietary models ScribeAgent-Small (fine-tuned on Qwen2 7B) and ScribeAgent-Large (fine-tuned on Qwen2.5 32B), on three benchmarks:

  • One internal benchmark:
    • Our human-annotated proprietary workflow dataset comprised of 1,200 enterprise software workflows collected across 250 real-world domains
  • Two external benchmarks:
    • Mind2Web, a static text-based dataset for assessing the navigation ability of web agents
    • WebArena, a dynamic web environment for end-to-end task completion benchmarking

We do not use any task-specific adaptation for Mind2Web and WebArena, even when extra training data is available. We showcase a summary of our results in the following sections.

Proprietary Workflow Benchmark

On our proprietary test dataset, our ScribeAgent models significantly outperform the proprietary GPT-4o and 4o-mini in the next-step prediction setting. This shows the benefit of specialized fine-tuning over using general-purpose LLMs. Moreover, while the non-fine-tuned Qwen2 performs poorly, fine-tuning with our dataset boosts its performance by nearly 6x, highlighting the importance of domain-specific data.

proprietary workflow benchmark comparison

We also benchmarked the o1 suite of models on a subset of our test set. While o1-preview performs the best among all general-purpose baselines, ScribeAgent models still outperform it by a wide margin, highlighting the importance of fine-tuning on real-world web navigation data.

performance vs speed on our workflow benchmark

Notably, ScribeAgent models do not require any inference time scaling, whereas most proprietary baselines are typically larger in size and slower at inference time. This makes the family of ScribeAgent models a better choice in terms of accuracy, latency, and cost.

Mind2Web

On the Mind2Web benchmark, ScribeAgent-Large achieves best overall state-of-the-art zero shot performance in the multi-stage setting and is competitive with the best fine-tuned baseline. We attribute our model’s strong performance to the diversity and high quality of the workflows in our dataset.

Average element accuracy on Mind2Web benchmark

WebArena

Compared with existing text-only baselines, ScribeAgent-Small augmented with GPT-4o obtains the highest task success rate in all five categories, leading to 12.25% performance improvements in total success rate over the previous-best AgentOccam results.

Notably, on Reddit and GitLab tasks where the domains are more realistic and thus closer to the ones in our training data, ScribeAgent-Small demonstrates stronger generalization ability and higher task success rates than in other domains.

Conclusion

With the ScribeAgent family of models, we showcase the power of domain-specific LLMs and how finetuning with high-quality real-world workflow data can benefit specialized web agents. To experience the power of AI and stay tuned about our upcoming releases, try Scribe today.

Ready to try Scribe?

Scribe automatically generates how-to guides and serves them to your team when they need them most. Save time, stay focused, help others.