Pulse - Scribe's dev blog
Inside Scribe:
Craft, Culture, Code
View Open Roles
Engineering
January 3, 2025
8 min read

7 Ways To Improve Website Usability And Accessibility

Vivamus vitae arcu vel velit efficitur vestibulum vel in purus. Vestibulum ante ipsum primis in fauci

Written by
Written by
Contents
Example H2 (current)
Example H3

Praesent nec orci at nulla consequat congue ut non arcu.

Integer rutrum ante et nunc venenatis, id ultricies risus ultricies. Cras sit amet velit id nulla tempus dictum sit amet eu nisi.

Sed auctor augue id tellus lacinia, nec ultricies est fermentum.

Donec eu felis at libero consequat sagittis a et urna. Fusce aliquet turpis at orci bibendum, non convallis justo tempor.

Cras sit amet velit id nulla tempus dictum sit amet eu nisi. Sed auctor augue id tellus lacinia, nec ultricies est fermentum.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere cubilia curae.

Key Takeaways

  • By fine-tuning open-source language models on production-scale workflow data, we demonstrate unprecedented capabilities in understanding and executing complex web tasks while significantly reducing computational costs compared to existing solutions.
  • Our models, ScribeAgent-Small and ScribeAgent-Large, achieve state-of-the-art performance on Mind2Web and substantially improve the baseline from the WebArena benchmark with a 53% task success rate, surpassing previous best results by 7.3 percentage points (considering text-only baselines).
  • In partnership with our co-authors at Carnegie Mellon University, we open-source our complete code, along with a version of our LLM trained on open-source datasets.
  • If you are passionate about cutting-edge AI techniques and turning innovations into impactful features for millions of users, join our team! We are hiring for multiple open ML/AI engineering roles!

Read the full paper here: https://arxiv.org/abs/2411.15004

Key Results

We evaluated our proprietary models ScribeAgent-Small (fine-tuned on Qwen2 7B) and ScribeAgent-Large (fine-tuned on Qwen2.5 32B), on three benchmarks:

  • One internal benchmark:
    • Our human-annotated proprietary workflow dataset comprised of 1,200 enterprise software workflows collected across 250 real-world domains
  • Two external benchmarks:
    • Mind2Web, a static text-based dataset for assessing the navigation ability of web agents
    • WebArena, a dynamic web environment for end-to-end task completion benchmarking

We do not use any task-specific adaptation for Mind2Web and WebArena, even when extra training data is available. We showcase a summary of our results in the following sections.

Proprietary Workflow Benchmark

On our proprietary test dataset, our ScribeAgent models significantly outperform the proprietary GPT-4o and 4o-mini in the next-step prediction setting. This shows the benefit of specialized fine-tuning over using general-purpose LLMs. Moreover, while the non-fine-tuned Qwen2 performs poorly, fine-tuning with our dataset boosts its performance by nearly 6x, highlighting the importance of domain-specific data.