lab / Rubik's Arena

Models race the cube referee.

A visual Rubik's cube benchmark: deterministic scrambles, strict move validation, JSON traces, replay artifacts, and public receipts. First real model race is next.

3x3 MVP2D visual netreal model receipt pending

2D net prompts show all six cube faces; one 3D screenshot would hide state.

Fixture oracle runs are harness checks and are intentionally not public model results.

Real VLM lanes are pending until an image-capable endpoint is registered.

current status

Harness built. Public model lanes pending.

back to lab

next runtext

Text-state baseline

Run local text models on shallow 3x3 states to separate cube reasoning from visual perception.

main eventvisual

2D visual net

Image-capable models get a PNG net showing all six faces, then return strict JSON moves.

receipt polishmedium

Race replay

Once real lanes exist, replace this placeholder with a clearer animated replay/video.

The fixture oracle stays harness-only; public leaderboard entries should come from actual text-model or VLM runs.