Reddit Archive Initiative

During my research, I came across a lost and forgotten scrape of the entire banned pizzagate subreddit and by some miracle managed to get my hands on it. In the course of my research, some links lead me to a reddit conspiracy archive and it’s corresponding reddit archiver tool.

Naturally, I put two and two together. Since the Pizzagate research moved to Voat after the subreddit was banned, I decided to add an archive for that too. Voat shutdown in 2020. Luckily, the site has an online backup that could be scraped (over a long time) and the data could be liberated. I conducted a statistical analysis on all the voat subs and determined only pizzagate and conspiracy were worth the trouble.

Hack Liberty Archives

Future Archive Initiative

I found a fork of libertysoft3/reddit-html-archiver called red-arch.

The goal of this project is to provide a framework for archiving websites and social media - with a particular focus on subreddits - and creating compilations of information in ways that are very easy for non-tech-savy people to consume, copy, and distribute.

reddit-html-archiver was chosen as the base for this project for a number of reasons:

  • It generates a static website. This is very important due to a static website being the best option for compiling data according to the needs of this project.
  • Its styled nicely.
  • Its written in python which will make integration with other web scrapers or data dumps very simple.
  • Takes minimal changes to accept data from popular reddit data dumps such as pushshift

At the moment this project is limited to creating static sites from Subreddit comments/submissions 2005-06 to 2023-12 - Academic Torrents. the user responsible for those uploads provides a repo here with some tools for parsing through the files contained in the torrent. This repo (red-arch) provides a modified version of their ‘single_file.py’ as ‘watchful.py’ (named after its creator) which can be used as to convert the subreddit dumps into valid python dictionaries and then used to create a website using reddit-html-archiver.

As an archivist, data hoarder, and cypherpunk, I find this VERY interesting!

Size 3.28TB (3,275,329,715,321 bytes)
Added 2025-02-16 00:54:51
Num files 79925 files
Mirrors 6 complete, 21 downloading = 27 mirror(s) total

Subreddits to Archive

This is the hard part, reddit isn’t that great. I am interested in any suggestions for subreddits that YOU are aware of that might be worth archiving. Take a look at the file list. (big link, browser might not like it)

Here is the list I have put together so far:

File Size
conspiracytheories_comments.zst 108.46MB
conspiracytheories_submissions.zst 21.33MB
conspiracy_comments.zst 4.94GB
conspiracy_submissions.zst 652.93MB
911truth_comments.zst 18.83MB
911truth_submissions.zst 5.90MB
Anarcho_Capitalism_comments.zst 712.59MB
Anarcho_Capitalism_submissions.zst 98.87MB
AskNetsec_comments.zst 47.98MB
AskNetsec_submissions.zst 11.86MB
netsec_comments.zst 77.73MB
netsec_submissions.zst 18.59MB
PrivacyGuides_comments.zst 17.71MB
PrivacyGuides_submissions.zst 3.48MB
privacy_comments.zst 274.00MB
privacy_submissions.zst 59.73MB
Libertarian_comments.zstl 1.90GB
Libertarian_submissions.zst 182.45MB
DarkNetMarkets_comments.zst 241.68MB
DarkNetMarkets_submissions.zst 48.93MB
Monero_comments.zst 128.13MB
Monero_submissions.zst 27.40MB

and maybe more..