Published 2 months ago

Mass Data Scraping for GPT‑3, Gemini, Llama & More: Amnesty International’s Case Against Generative AI

A major human rights organisation has entered the AI accountability debate — and its findings are difficult to dismiss.

In a detailed briefing titled Unlawful by Design: Exposing the Human Rights Costs of Generative AI, Amnesty International argues that the data pipelines powering today’s most prominent generative AI systems are not incidentally problematic. They are structurally so. The report examines models behind tools including OpenAI’s GPT‑3, Google’s Gemini, Meta’s Llama, DeepSeek, Midjourney, and Stable Diffusion — products used by millions daily.

For founders evaluating AI tools and operators building on top of these models, this briefing raises questions that go well beyond regulatory compliance.

273

6 mins read

9 sections

Key Highlights

Amnesty says leading generative AI models are “unlawful by design” due to mass, non-consensual data scraping
Structural bias and potential threats to freedom of thought are baked into current web-scale AI training pipelines
Rising data centre emissions and local resistance link AI training directly to environmental and social harms

The Core Accusation: Unlawful by Design

Amnesty International’s central claim is precise: generative AI companies are conducting mass, automated extraction of publicly available online data — including personal images and social media activity — without the explicit consent of the individuals involved.

This practice, commonly known as web scraping, is not new. What is new is its scale, its opacity, and its direct integration into commercial products generating significant revenue.

Likhita Banerji, Head of Amnesty’s Algorithmic Accountability Lab, frames it plainly: the “extractive data pipeline” and “exploitative supply chains” behind these systems have created a paradigm of technology development that opens the door to mass human rights abuse. The language is deliberate. This is not a critique of edge cases — it is a critique of the architecture itself.

What the Data Pipeline Actually Involves

To understand the argument, it helps to understand what Amnesty is actually examining.

The briefing focuses on three interconnected stages: data capture, data processing, and model scaling. At the capture stage, billions of public posts, images, and web pages are harvested automatically. At the processing stage, this raw material is filtered, labelled, and structured. At the scaling stage, larger models ingest larger datasets — and the problems compound.

Each stage introduces risk. Data capture without consent violates privacy by design. Processing without adequate bias filtering amplifies discriminatory content. Scaling without governance accelerates both problems simultaneously.

The result, Amnesty argues, is not a bug in these systems. It is a feature of how they were built.

Bias as a Structural Output, Not an Anomaly

One of the briefing’s more technically grounded observations concerns algorithmic bias.

Because training datasets are drawn predominantly from the open web, they inherit the biases present there — racial stereotypes, gender prejudices, cultural blind spots. As model size increases and training data expands, these biases are not diluted. They are reinforced and amplified.

Amnesty identifies racial and gendered bias as “consistent features” of generative AI systems, not occasional failures. This matters for any organisation deploying these tools in hiring, content moderation, customer service, or healthcare contexts. The bias is not something a prompt can reliably neutralise.

There is also a subtler concern raised in the briefing: the risk to freedom of thought. Large language models capable of shaping predictive suggestions and personalised outputs may, over time, influence users’ beliefs and reasoning patterns. This is speculative territory, but it is not unreasonable territory.

The Environmental Ledger

AI infrastructure has environmental costs

The human rights framing extends beyond privacy and bias into environmental harm — and here the briefing cites hard numbers.

Google’s own 2024 sustainability report recorded a 48 per cent increase in greenhouse gas emissions since 2019, attributed directly to data centre operations and supply chain demands driven by AI workloads. Microsoft reported a 29 per cent increase in emissions between 2020 and 2024 for comparable reasons.

These are not projections. They are self-reported figures from the companies themselves.

The infrastructure required to train and serve large generative AI models — energy-intensive chips, vast data centres, substantial water cooling systems — does not exist in a vacuum. Communities in Chile’s Cerrillos region, Querétaro in Mexico, and parts of Arizona have actively resisted data centre construction in areas already facing drought conditions and electricity shortages. The environmental cost of AI development is not evenly distributed.

Who Responded — and What That Signals

Amnesty International contacted Google, OpenAI, Meta, Stability AI, Midjourney, and DeepSeek with its findings prior to publication. It also wrote to Intel, VMware, Microsoft, and Amazon regarding specific concerns around discrimination and environmental harm.

At the time of publication, only Microsoft, Amazon, Intel, OpenAI, and Meta had responded. Google, Stability AI, Midjourney, and DeepSeek did not.

The pattern of non-response is itself informative. Companies with the most established legal and communications infrastructure engaged. Newer or more opaque players did not. For anyone evaluating which AI vendors take accountability seriously, this is a meaningful data point.

What Amnesty Is Calling For

The briefing’s recommendations are direct and, by industry standards, ambitious.

Amnesty calls on governments to prohibit standalone generative AI systems built on unlawful web scraping — defined as bulk, mass collection of training data without consent. It calls on companies to immediately cease non-consensual scraping of personal data for training purposes. And it calls on states to hold companies accountable for human rights abuses linked to their design choices.

Critically, Amnesty defines “standalone generative AI” narrowly: products developed and marketed specifically for their generative capabilities, such as chatbots and image generators. This excludes generative AI as an optional feature within broader software suites — a distinction that matters for how regulation might be scoped.

What This Means for AI Tool Adopters

For the founders, marketers, and operators who make up the core audience of this platform, Amnesty’s briefing is not an abstract policy document. It has practical implications.

On tool selection: The models underpinning the tools you use were trained on data pipelines that may not meet emerging legal standards in the EU, UK, or beyond. Understanding which vendors have published transparent data sourcing policies is increasingly relevant due diligence.

On bias risk: If your use case involves decisions affecting people — recruitment, content ranking, customer segmentation — the structural bias documented here is a liability, not a footnote.

On regulatory trajectory: Amnesty’s intervention adds significant institutional weight to calls for stricter AI data regulation. The EU AI Act is already in motion. Further restrictions on training data practices are a plausible near-term development, not a distant scenario.

A Different Trajectory Is Possible

The most important sentence in Amnesty’s briefing may be the quietest one.

“These choices are not inevitable.”

The argument is not that generative AI is inherently harmful or that its development should stop. It is that the specific design choices made by specific companies — to scrape at scale, without consent, without adequate bias controls — represent a path chosen, not a path required.

That distinction matters. It means the problems are correctable. It also means that companies choosing to operate differently — through licensed data, synthetic datasets, or consent-based collection — are not at a fundamental disadvantage. They are making a different bet on where regulation and public trust are heading.

For those observing the AI tools ecosystem closely, that bet is worth tracking.

Fani López

Published 7 articles across Trend Analysis, Insights, AI Use Cases, News, and Explainer since May 2026.

Key Highlights

The Core Accusation: Unlawful by Design

What the Data Pipeline Actually Involves

Bias as a Structural Output, Not an Anomaly

The Environmental Ledger

Who Responded — and What That Signals

What Amnesty Is Calling For

What This Means for AI Tool Adopters

A Different Trajectory Is Possible

Fani López

Related · Content

MLB Restricts Dugout iPads to Block AI-Assisted In-Game Strategy

Netflix Expands Generative AI Across 300 Productions and Its Streaming App

Google Gemini 3.5 Pro Delay: What the Missing Model Means for the AI Race

Netflix Earnings Reveal Broad GenAI Adoption Across Previs, VFX, and Post-Production

Comments (0) No comments yet

Related · Tools

AI Outpainting Image

EmbedAI

Modish AI

Docufai

Mammouth AI

SDXL Emoji