Scraping 24,000 TEDx Talks Across 147 Countries: A Data Engineering Story

This project began at Quid, the San Francisco-based intelligence platform where John Jansen was building the scraping and analytics infrastructure. Sean Gourley, Quid's co-founder and CTO, wanted to map the global landscape of TEDx ideas — not just the main TED conference talks that everyone knows, but the entire TEDx ecosystem: 24,000 independently organised talks across 147 countries, delivered in over 50 languages. The goal was ambitious: apply NLP and network graph theory to the full corpus and discover how ideas spread, mutate, and cluster across cultures and geographies.

Sean was working with Eric Berlow, an ecologist turned network scientist, and together they had a compelling analytical framework. What they needed was the data infrastructure to make it possible — and that was squarely in John's wheelhouse.

Why Data Engineering Comes First

There's a pattern that shows up repeatedly in data science projects: brilliant analytical minds with a clear vision, stalled because nobody has done the unglamorous work of actually acquiring, cleaning, and structuring the data. It's not that data scientists can't do this work — many can — but it's a different skill set with different failure modes, and it deserves dedicated attention.

The TEDx project was a textbook case. The analytical plan was sophisticated: use NLP to extract themes from transcripts, build a network graph connecting talks by thematic similarity, apply community detection algorithms to find idea clusters, then visualise the whole thing as an interactive map of the global ideas landscape. Beautiful. But before any of that could happen, someone needed to get 24,000 transcripts, with accurate metadata, in a clean and consistent format. That someone was JJ.

The Scraping Challenge

If you've never scraped data at scale, you might think the hard part is writing the scraper. It's not. Writing a scraper that works for one page is trivial. Writing a system that reliably extracts data from thousands of pages, across multiple platforms, over days or weeks, handling every edge case and failure mode — that's the actual job.

The TEDx data lived in two primary places: the TED website (ted.com) and YouTube. Each presented its own challenges.

The TED website had talk pages with metadata — speaker name, event name, date, location, tags — and in many cases, transcripts. But the structure wasn't perfectly consistent. Older talks had different page layouts. Some talks had transcripts in multiple languages. Some had no transcript at all. The metadata fields weren't always populated, and when they were, the formats varied.

YouTube was a different beast entirely. TEDx talks on YouTube were uploaded by hundreds of independent organisers, each with their own channel naming conventions, description formats, and metadata habits. A talk from TEDxWellington might be uploaded as "TEDxWellington - Jane Smith - The Future of Education" while one from TEDxTokyo might be "未来の教育 | Jane Smith | TEDxTokyo". Extracting consistent, structured data from this chaos required systems that could handle variation at every level.

Rate Limiting and Resilience

Scraping at this scale means dealing with rate limiting, IP blocks, and service interruptions. John built a system with adaptive rate limiting — slowing requests when encountering resistance, rotating through request patterns, and maintaining state so that interrupted scrapes could resume from where they left off rather than starting over. The system needed to run for days, handling intermittent failures gracefully without corrupting the dataset.

The Normalisation Problem

Raw scraped data is messy. The normalisation work was arguably harder than the scraping itself.

Speaker names had to be normalised across sources. "Dr. Jane Smith" on TED.com might be "Jane Smith, PhD" on YouTube and "J. Smith" in a playlist title. Event names needed similar treatment — standardising "TEDxWellington 2013," "TEDx Wellington," and "Wellington TEDx" into a single canonical form while preserving the year and location as structured fields.

Dates were a particular headache. YouTube upload dates don't correspond to talk dates — a talk given in 2012 might be uploaded in 2014. The TED website usually had the actual event date, but not always. Where both sources were available, cross-referencing was possible. Where only YouTube existed, inference from video descriptions, channel metadata, or the event name itself was necessary.

Geographic normalisation required mapping event names to locations. "TEDxAuckland" is easy. "TEDxYouth@BeaconHill" is harder. Some events were named after neighbourhoods, universities, or concepts rather than cities. JJ built a lookup table for the straightforward cases and flagged ambiguous ones for manual review.

Deduplication was the final normalisation step. The same talk appearing on TED.com and YouTube needed to be merged into a single record. Talks that had been re-uploaded to different YouTube channels needed to be identified and collapsed. This used a combination of transcript similarity matching, metadata comparison, and heuristic rules developed through iterating on the dataset.

The Handoff

Once the data pipeline was robust and delivering clean, structured data, John handed off to the data science team who applied NLP to extract themes from every transcript, built a network graph connecting thematically similar talks, and used community detection algorithms to identify clusters of related ideas. The infrastructure continued to power the ongoing data feeds as the analysis evolved — this wasn't a one-shot extraction but a living pipeline.

What Happened Next

The work culminated in a TED Talk by Eric Berlow and Sean Gourley — "Mapping Ideas Worth Spreading" — which is a satisfying piece of recursion: using TEDx data to give a TED talk about TEDx. The visualisations were genuinely beautiful, and the insights were real: you could see how different cultures emphasised different themes, how ideas migrated between regions, and where the intellectual energy was concentrated.

But here's the thing: in the talk, in the press coverage, in the conversations about the project, nobody mentioned the data engineering. Nobody talked about rate limiting strategies, transcript extraction priorities, or geographic normalisation. And that's exactly how it should be.

The Invisible Infrastructure Lesson

Good data engineering is invisible. When it works, everyone talks about the insights, the visualisations, the discoveries. Nobody talks about the plumbing. This is both the curse and the point of the discipline.

Every data science project that fails because of bad data had a data engineering problem, not a data science problem. The models were probably fine. The algorithms were probably correct. But the data was messy, incomplete, inconsistent, or wrong, and no amount of statistical sophistication can fix that.

This is what Dreamware brings to data projects: the understanding that the plumbing matters. That the person who builds the data pipeline is as important as the person who builds the model. That the most valuable thing you can give a data scientist is clean, reliable, well-structured data — and that getting there is a genuine engineering challenge, not a chore to rush through.

What This Means for NZ Businesses

If you're a New Zealand business sitting on data you know is valuable but can't seem to extract value from, the problem is almost certainly in the plumbing. Your data sources are inconsistent. Your pipelines are fragile. Your data scientists are spending 80% of their time cleaning data and 20% doing actual science.

That ratio should be inverted, and it can be — with proper data engineering. The TEDx project proved that even at extreme scale (24,000 documents, 50 languages, 147 countries), clean data engineering makes sophisticated analysis possible. At the scale most NZ businesses operate, the same principles apply but the problems are far more tractable.

The data engineering discipline that scraped 24,000 TEDx talks is the same discipline that can untangle your CRM data, build reliable ETL pipelines for your warehouse, or structure your documents for AI processing. The tools change. The principles don't.