
Praesent nec orci at nulla consequat congue ut non arcu.
Integer rutrum ante et nunc venenatis, id ultricies risus ultricies. Cras sit amet velit id nulla tempus dictum sit amet eu nisi.
Sed auctor augue id tellus lacinia, nec ultricies est fermentum.
Donec eu felis at libero consequat sagittis a et urna. Fusce aliquet turpis at orci bibendum, non convallis justo tempor.
Cras sit amet velit id nulla tempus dictum sit amet eu nisi. Sed auctor augue id tellus lacinia, nec ultricies est fermentum.
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae.
Key Takeaways
- By fine-tuning open-source language models on production-scale workflow data, we demonstrate unprecedented capabilities in understanding and executing complex web tasks while significantly reducing computational costs compared to existing solutions.
- Our models, ScribeAgent-Small and ScribeAgent-Large, achieve state-of-the-art performance on Mind2Web and substantially improve the baseline from the WebArena benchmark with a 53% task success rate, surpassing previous best results by 7.3 percentage points (considering text-only baselines).
- In partnership with our co-authors at Carnegie Mellon University, we open-source our complete code, along with a version of our LLM trained on open-source datasets.
- If you are passionate about cutting-edge AI techniques and turning innovations into impactful features for millions of users, join our team! We are hiring for multiple open ML/AI engineering roles!
Read the full paper here: https://arxiv.org/abs/2411.15004
Key Results
We evaluated our proprietary models ScribeAgent-Small (fine-tuned on Qwen2 7B) and ScribeAgent-Large (fine-tuned on Qwen2.5 32B), on three benchmarks:
- One internal benchmark:
- Our human-annotated proprietary workflow dataset comprised of 1,200 enterprise software workflows collected across 250 real-world domains
- Two external benchmarks:
- Mind2Web, a static text-based dataset for assessing the navigation ability of web agents
- WebArena, a dynamic web environment for end-to-end task completion benchmarking
We do not use any task-specific adaptation for Mind2Web and WebArena, even when extra training data is available. We showcase a summary of our results in the following sections.
Proprietary Workflow Benchmark
On our proprietary test dataset, our ScribeAgent models significantly outperform the proprietary GPT-4o and 4o-mini in the next-step prediction setting. This shows the benefit of specialized fine-tuning over using general-purpose LLMs. Moreover, while the non-fine-tuned Qwen2 performs poorly, fine-tuning with our dataset boosts its performance by nearly 6x, highlighting the importance of domain-specific data.