YouTube Puts AI Labels Where You Can’t Miss Them

YouTube is moving its AI disclosure labels out of the footnotes and onto the main stage. Labels on long-form videos now appear directly below the player. On Shorts, they appear as overlays — right on the video itself.
That’s a meaningful UX shift. Burying a label in a description tab is very different from putting it in front of your face before you’ve formed an opinion about what you’re watching.
The Platform Will Label You Even If You Don’t Label Yourself
The more interesting part: YouTube will now automatically apply AI labels when its own systems detect “significant photorealistic AI use” — even if the creator never disclosed anything. Creators can dispute incorrect labels through YouTube Studio, but for videos made with YouTube’s own AI tools, the disclosure sticks permanently.
That’s a quiet but firm policy move. The platform is essentially saying: transparency isn’t optional anymore, and we’ll enforce it ourselves if we have to.
Why This Is Happening Now

The timing isn’t coincidental. Google just unveiled Gemini Omni at I/O 2026 — a multimodal model combining Gemini with Veo, Nano Banana, and Genie. New Shorts features let users restyle videos, insert themselves into clips, and remix other creators’ content entirely with AI.
More powerful creation tools mean more AI-generated content flooding the platform. More AI-generated content means more pressure to tell viewers what they’re actually looking at. The labels are the counterweight to the tools.
YouTube is careful to note that none of this affects recommendations or monetization. The goal, as they put it, is simple: give creators and viewers the right information. Whether that holds as AI video becomes indistinguishable from real footage is a question worth watching.
Huawei’s Benchmark Asks AI Agents to Do Real Work. They Mostly Can’t.

Meanwhile, researchers from Huawei Technologies, Beijing Institute of Technology, Peking University, and the Chinese Academy of Sciences published something that should recalibrate a lot of AI agent hype.
It’s called Claw-Anything, and it’s a benchmark designed to test AI personal assistants on tasks that resemble actual human life — not sanitized, single-step demos.
What “Real Life” Looks Like as a Benchmark

Most existing benchmarks give AI agents a clean desk and a clear task. Claw-Anything drops them into a mess. Each task spans more than three months of simulated user activity, involves an average of 10.1 interdependent backend services, and requires interaction across both CLI Linux and GUI Android environments.
The average context window per task is 191,700 words. Most benchmarks sit between 1,700 and 12,000. That’s not a gap — that’s a different problem entirely.
The Numbers Are Humbling
GPT-5.5 — OpenAI’s flagship model, explicitly built with agentic and long-horizon tasks in mind — scored 34.5% on pass@1. That’s the probability of completing a task correctly on the first try, no retries.
Other models that look impressive on conventional benchmarks dropped even further. The benchmark also tests proactive assistance — cases where the agent spots a need and acts without being asked. Agents scored 25.9% on reactive tasks and just 6.7% on proactive ones.
That second number is worth sitting with. An agent that only acts when explicitly told to isn’t really an assistant. It’s a very fast search bar.
The Benchmark’s Actual Argument
The researchers aren’t just publishing scores — they’re making a pointed case about how the industry measures progress. Current benchmarks treat agents like task solvers handed a clean problem. Claw-Anything treats them like assistants dropped into accumulated noise, conflicting signals, and months of context they have to parse before doing anything useful.
When cross-service tools were removed in ablation tests, success rates fell to nearly zero. Most tasks require agents to retrieve information and act across multiple backends simultaneously. Single-service performance is largely irrelevant to real-world utility.
The team also released an automated data pipeline that generated 2,000 training environments. Fine-tuning an open-weight model on that data improved task success by 23.7% — which suggests the gap isn’t purely architectural. Better training data for messy, long-horizon tasks moves the needle.

Comments (0) No comments yet
Want to join this discussion? Login or Register.
No comments yet. Be the first to share your thoughts!