Automating the Absurdity Index

The Absurdity Index needs 3,440 data points across 8 metrics. Each one has to be found, read, categorized by severity, and linked back to a real person's story. The original estimate for doing this manually was 30-50 hours per collection cycle. Weekly. That's not a methodology, that's a full-time job with no pay.

So we automated it. Mostly.

Sneaky Snake

Here's the part that might surprise people: there's no database.

The dashboard is static. It builds once and deploys to Vercel. No server, no runtime data fetching, no connection strings. A database would've added infrastructure to solve a problem that doesn't exist here.

All the dashboard data lives inside a single TypeScript file called metricDetailData.ts. Scores, crisis ratios, level distributions, sample stories, collection progress, dates. One file. When the automation runs, a Python script opens that TypeScript file and rewrites specific values using regex. About 15 patterns per metric, navigating multi-line object structures, finding the right field inside the right metric block, updating a number, and moving on.

The data lives where the app reads it. It's type-checked at build time, tracked in git history, and imported directly by the frontend. If the regex produces bad output, npm run build fails, the commit never lands, and nothing deploys. The build step is the validation layer.

TikTok Fail Compilation

TikTok was the platform I most wanted data from and the hardest to actually get it.

Attempt one: ProxiTok, a set of privacy-friendly mirrors that let you browse TikTok without an account. We built a collector that searched by hashtag and enriched results with TikTok's public oEmbed API. It worked until the ProxiTok instances started going down. Regularly.

Attempt two: Playwright with full headless browser automation. The nuclear option. Spin up a headless Chrome, bypass TikTok's JavaScript rendering, scrape the pages directly. It was reliable but absurdly heavy for what we needed. TikTok's anti-bot measures are also not messing around.

Attempt three (the one that stuck): Search YouTube for TikTok compilation videos. Queries like "tiktok compilation insurance denied" and "viral tiktok hospital bill" return exactly what they sound like. People re-upload TikTok content to YouTube constantly, and we already had the YouTube API wired up.

We turned a data access problem into a search query problem. It captures viral TikTok content that crossed platforms, uses infrastructure we already had, and runs in CI without a headless browser. Sometimes the best solution is the one that already exists.

Fxck Reddit

The weekly automation runs on GitHub Actions every Monday at 9 AM UTC. YouTube collectors fire, scores get recalculated, the TypeScript file gets rewritten, the site rebuilds, and it deploys. Fully hands-off.

Except for Reddit. Reddit returns 403 Forbidden to requests from GitHub Actions IP ranges. A platform built on people sharing their worst moments, blocking automated research about people sharing their worst moments. We found this out the fun way.

So the pipeline is hybrid. YouTube runs in CI automatically. Reddit collection runs locally, on my machine, with a --reddit-only flag we built specifically for this split. You automate what you can and design around what you can't.

The Reddit collectors are straightforward, too. No API key. No official Reddit API client. They hit Reddit's public .json endpoints with a respectful user-agent and 2-3 second delays between requests. The original plan was to go through the formal API application process. The public endpoints just worked.

Green Eggs and Spam

When you search YouTube for "can't afford healthcare," a significant chunk of results are financial guru content. "How I FIXED my healthcare costs in 30 days!" "The one trick insurance companies don't want you to know!"

That's not spam in the traditional sense. It's content that describes resolved problems. And if it gets categorized alongside people sharing genuine crisis stories, it corrupts the severity data. A video about conquering medical debt isn't the same as a video about drowning in it.

So we built a content filter with about 50 patterns that catch clickbait, self-promotion, and solution-selling content. The filter is conservative; it only excludes on positive matches. Missing some noise is acceptable. Accidentally filtering out a real story is not.

There's also the overlap problem. A post on r/povertyfinance titled "can't make rent on minimum wage" gets collected by both Housing Despair and Wage Stagnation. Same story, two metrics. So the pipeline scores each post against every metric's keyword list (title matches weighted 3x over body) and assigns it to the single best fit. One post, one home.

This is where automation became a methodology decision. What you choose to exclude, and where you choose to place what's left, shapes the data just as much as what you include.

Main Character Energy

The most interesting thing to come out of the automation wasn't efficiency. It was a finding we wouldn't have seen without multi-platform collection at scale.

Reddit is consistently less sensational than YouTube. Across 7 of 8 metrics, Reddit showed lower crisis ratios. YouTube creators optimize for engagement, which inflates severity in titles and thumbnails. Reddit's anonymous users describe their experiences more plainly. The one exception was Layoff Watch, where subreddits like r/jobs and r/careerguidance function as genuine crisis forums.

This validated the whole multi-platform approach. If we'd only collected from YouTube, the scores would skew high. Only Reddit, they'd skew low. The blend is closer to something real.

Panorama

The pipeline runs every Monday. Sixteen collectors, a deduplicator, a score calculator, a regex TypeScript rewriter, and a CI workflow. The scripts are open-source, the methodology is documented, and anyone can verify how the scores land.

We're not trying to build the perfect system. We're trying to keep up. The absurdity compounds weekly, and so does the data.

View the dashboard: absurdity-index.vercel.app Read the methodology: Full documentation See the code: GitHub repository