Another day another scraping project. This was actually a really short one and took me 2 hours from start to the finish. This time we are scraping reddit. r/gonewildaudio is a subreddit for some real desperate men and women of the world. I wanted to know who was the most served demographic of this sub. each post contains a gender tag of the format {x4y. x~male/female, y~male/female}.
As a first step I tried my hands on the official reddit api. I used its python wrapper – PRAW. But soon ran into its limitations. The api allows only a maximum of 1000 posts to be returned per request, which is fine btw but it has no time stamp feature to extract all the posts from a particular time period (it was there previously but was depreciated by the reddit devs IDK why).
So I resorted to using pushshift.io – It is a service designed for data science applications and actively archives all the reddit data in its database and provides an easy to use web-api to access it.
Pushshift also has a limit of returning maximum of 1000 posts per request but this drawback can be countered by using custom time frames to retrieve posts.
Here’s some really-really dirty code that I wrote –
import json
import requests
import time
import sys
dump =open("dump.txt", "a+")
current_timestamp = 1582816669
# 60 seconds * 60 minutes * 24 hours * 60 days = 2 months
fiveDaysTimestamp = 432000
totalPosts =0
mf =0
fm =0
ma =0
fa =0
ff =0
mm =0
i = 1
def more(after,before):
global totalPosts
global mf
global fm
global ma
global fa
global ff
global mm
global i
url = 'https://api.pushshift.io/reddit/submission/search/?subreddit=gonewildaudio&sort=desc&filter=title&size=5000&after='+ str(after) + '&before=' + str(before)
print(url)
r = requests.get(url)
data = r.json()
for title in data['data'] :
submission = title['title']
dump.write('\n'+str(submission.encode("utf-8"))+'\n')
totalPosts += 1
if 'm4f' in submission.lower():
mf +=1
elif 'f4m' in submission.lower():
fm +=1
elif 'm4a' in submission.lower():
ma +=1
elif 'f4a' in submission.lower():
fa +=1
elif 'f4f' in submission.lower():
ff +=1
elif 'm4m' in submission.lower():
mm +=1
print('totalPosts = '+str(totalPosts))
print('M4F = '+str(mf))
print('F4M = '+str(fm))
print('M4A = '+str(ma))
print('F4A = '+str(fa))
print('F4F = '+str(ff))
print('M4M = '+str(mm))
increment()
def increment():
global i
global current_timestamp
global fiveDaysTimestamp
after = current_timestamp - (i*fiveDaysTimestamp)
before = current_timestamp - ((i-1)*fiveDaysTimestamp)
if(after<=1337522586):
print("done")
sys.exit()
i+=1
more(after,before)
increment()
It is a really really dirty and stitched together code with a tonne of unnecessary recursion and bad exit statements but it works (somehow). Here’s basically what it does :
- Increment() functions builds two timestamps each 5 days apart. It creates a new 5 day timestamp every time its called. I also finds if the timestamps reach the subreddit creation date. If Yes it exits the script. Otherwise it goes ahead and calls the more() function.
- more() function takes the before and after parameters and builds a url for pushshift.io. Url contains the title filter, max limit , subreddit name and before,after as parameters.
- Then the requests library is used to send a get request to pushshift and json data is returned.
- The json data is then parsed for the title of each post and find() function is used to detect keywords in them.
- On detection the variable for the specific keyword is incremented.
- I outputs the result of each five day iteration in the terminal. It also stores all the scraped titles in a separate text file.
- It then recursively call out the increment function until the exit condition (subreddit creation date) is met.

Areas of improvement :
- Declare local and global variables more elegantly.
- Use less recursion or use it less stupidly to prevent overflow.
- Use more variables or different functions for generating time-epoches instead of hard-coding them to make the script more versatile.