TimeWarp is a benchmark for evaluating the robustness of agents to temporal changes in web UI. TimeWarp consists of three web environments: Wiki, News, and Shop, each with six UI versions across different eras of the internet. The benchmark also includes TimeTraj, a method for scalably collecting trajectories via human-refined plans, and TimeWarp-BC, a variant of Behavior Cloning (BC) to train agents better via knowledge distillation on complex tasks that require memory and planning.
Four categories of tasks: Wiki, News, Shop, and Multi-Environment.
231 unique goals × 6 environments = 1386 tasks.
View and explore the TimeWarp dataset on Hugging Face .
TimeTraj is a scalable method for collecting trajectories from a single human-refined plan per task, which is used to automatically generate trajectories across versions.
TimeWarp-BC extends Behavior Cloning by training the web agent on the teacher's full response, including action, thinking, planning, and memory tokens.
@misc{timewarp2026,
title={TimeWarp: Evaluating Web Agents by Revisiting the Past},
author={Md Farhan Ishmam and Kenneth Marino},
year={2026},
eprint={2603.04949},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2603.04949},
}
We would like to thank Nejd Khadija for helping with the task and plan annotation.