During my research, I came across a lost and forgotten scrape of the entire banned pizzagate subreddit and by some miracle managed to get my hands on it. In the course of my research, some links lead me to a reddit conspiracy archive and it’s corresponding reddit archiver tool.
Naturally, I put two and two together. Since the Pizzagate research moved to Voat after the subreddit was banned, I decided to add an archive for that too. Voat shutdown in 2020. Luckily, the site has an online backup that could be scraped (over a long time) and the data could be liberated. I conducted a statistical analysis on all the voat subs and determined only pizzagate and conspiracy were worth the trouble.
Hack Liberty Archives
Future Archive Initiative
I found a fork of libertysoft3/reddit-html-archiver called red-arch.
The goal of this project is to provide a framework for archiving websites and social media - with a particular focus on subreddits - and creating compilations of information in ways that are very easy for non-tech-savy people to consume, copy, and distribute.
reddit-html-archiver was chosen as the base for this project for a number of reasons:
- It generates a static website. This is very important due to a static website being the best option for compiling data according to the needs of this project.
- Its styled nicely.
- Its written in python which will make integration with other web scrapers or data dumps very simple.
- Takes minimal changes to accept data from popular reddit data dumps such as pushshift
At the moment this project is limited to creating static sites from Subreddit comments/submissions 2005-06 to 2023-12 - Academic Torrents. the user responsible for those uploads provides a repo here with some tools for parsing through the files contained in the torrent. This repo (red-arch) provides a modified version of their ‘single_file.py’ as ‘watchful.py’ (named after its creator) which can be used as to convert the subreddit dumps into valid python dictionaries and then used to create a website using reddit-html-archiver.
As an archivist, data hoarder, and cypherpunk, I find this VERY interesting!
Size | 3.28TB (3,275,329,715,321 bytes) |
---|---|
Added | 2025-02-16 00:54:51 |
Num files | 79925 files |
Mirrors | 6 complete, 21 downloading = 27 mirror(s) total |
Subreddits to Archive
This is the hard part, reddit isn’t that great. I am interested in any suggestions for subreddits that YOU are aware of that might be worth archiving. Take a look at the file list. (big link, browser might not like it)
Here is the list I have put together so far:
File | Size |
---|---|
conspiracytheories_comments.zst | 108.46MB |
conspiracytheories_submissions.zst | 21.33MB |
conspiracy_comments.zst | 4.94GB |
conspiracy_submissions.zst | 652.93MB |
911truth_comments.zst | 18.83MB |
911truth_submissions.zst | 5.90MB |
Anarcho_Capitalism_comments.zst | 712.59MB |
Anarcho_Capitalism_submissions.zst | 98.87MB |
AskNetsec_comments.zst | 47.98MB |
AskNetsec_submissions.zst | 11.86MB |
netsec_comments.zst | 77.73MB |
netsec_submissions.zst | 18.59MB |
PrivacyGuides_comments.zst | 17.71MB |
PrivacyGuides_submissions.zst | 3.48MB |
privacy_comments.zst | 274.00MB |
privacy_submissions.zst | 59.73MB |
Libertarian_comments.zstl | 1.90GB |
Libertarian_submissions.zst | 182.45MB |
DarkNetMarkets_comments.zst | 241.68MB |
DarkNetMarkets_submissions.zst | 48.93MB |
Monero_comments.zst | 128.13MB |
Monero_submissions.zst | 27.40MB |
and maybe more..