Research phase · 2026

India has world-class AI engineers
and almost no world-class
Indian AI training data.
That's the gap.
We're starting to fill it.

The global AI race is a data race. Models are commodified — anyone can fine-tune Llama. What's not commodified is high-quality, domain-specific, geographically-grounded data. India is one of the largest pools of underrepresented data on Earth.

§ Data business · DMA-DAT-001 · Project Y · 2026

§ 01 — The problem

The English internet
has been scraped to exhaustion.

The next wave of AI differentiation will come from high-quality, domain-specific, geographically-grounded data — and India is one of the largest pools of underrepresented data on Earth.

Most global AI labs cannot collect this themselves. They lack on-the-ground operations, language coverage, regulatory familiarity, and trust. We can.

Almost none of it exists at the quality and scale that global AI labs need. We can fix this. We're starting in 2026.

§ 02 — Where India is undercollected

Six verticals.
All missing.

Automotive & street imagery

Indian roads, vehicles, signage, traffic patterns. The working hypothesis for our first vertical — builds on our computer vision background.

Regional language audio

Tamil, Telugu, Kannada, Malayalam, Bengali — spoken in real-world conditions, not curated studio recordings. At scale, annotated.

Indian legal text

HC and SC judgments, citation graphs, legislative text. LegalPro is already a partial proxy — we're sitting on a significant corpus.

Business documents & workflows

Indian forms, invoices, tax filings, regulatory documents. The kind of structured business data global models handle badly for Indian contexts.

Agricultural imagery

Indian crops, soil types, irrigation patterns, pest identification. Agricultural AI is a massive market with almost no India-specific training data.

Medical imaging, Indian populations

Dermatology, radiology, pathology images from Indian populations. Existing medical AI datasets are overwhelmingly Western.

§ 03 — What we're doing in 2026

Research first.
Revenue later.

We're not selling data yet. We're figuring out what to collect, how to collect it legally and ethically, and who will license it. This page is as transparent as it gets.

Q2 2026· In progress

Research report published — which Indian data verticals are most valuable, most collectible, and most defensible. We look at commercial demand signals (who's already buying in this space), defensibility (how hard is it to replicate), and legal/ethical feasibility.

Q3 2026· Planned

First vertical chosen. Collection pipeline started. This means: who collects, who annotates, who QAs, what the consent and privacy framework looks like, what format we license in. Automotive/street CV is the working hypothesis — but research decides.

Q4 2026· Target

First dataset asset ready, or a serious licensing conversation in motion. This may be a research artifact, a small pilot dataset, or a partnership announcement. Revenue is not expected in 2026 — Q4 is the "something real exists" milestone.

§ 04 — Our principles

How we collect.
No exceptions.

✓No personal data without explicit, informed consent — ever.

✓No synthetic data sold as real data. What we license is what we collected.

✓Open-source contributions where they make sense — we take from the OSS community and we give back.

✓Documented provenance for everything we license. You know where every data point came from.

✓Transparent licensing terms. No hidden exclusivity clauses, no buried usage restrictions.

§ 05 — Who this is for

If you've been struggling
to find Indian-context data.

AI labs and foundation model companies

Training data diversity matters more as models scale. India-specific data is a gap in every major training corpus.

Autonomous vehicle and ADAS companies

Indian road conditions are structurally different. Western driving datasets produce models that fail in India. We can fix this.

Mapping and geospatial companies

Street-level imagery, signage recognition, lane detection — all need India-specific training data to perform reliably.

Sector-specific AI startups

AgriTech, LegalTech, MedTech — if you're building AI for Indian use cases, your training data should be Indian too.

§ 06 — Talk to us early

We're still choosing what
to collect first.

If you're a potential dataset buyer or research partner, we'd like to hear from you while we still have flexibility on what we collect first. Your use case shapes our research. That benefits both of us.

We're also interested in academic partnerships, annotation partnerships, and co-collection arrangements in specific geographies or domains.

data@demanualai.com Book a call

Response within 48 hours

India has world-class AI engineersand almost no world-classIndian AI training data.That's the gap.We're starting to fill it.

The English internethas been scraped to exhaustion.

Six verticals.All missing.