Scraping and processing all posts in a sub-reddit (timestamp method)

Another day another scraping project. This was actually a really short one and took me 2 hours from start to the finish. This time we are scraping reddit. r/gonewildaudio is a subreddit for some real desperate men and women of the world. I wanted to know who was the most served demographic of this sub. each post contains a gender tag of the format {x4y. x~male/female, y~male/female}.

As a first step I tried my hands on the official reddit api. I used its python wrapper – PRAW. But soon ran into its limitations. The api allows only a maximum of 1000 posts to be returned per request, which is fine btw but it has no time stamp feature to extract all the posts from a particular time period (it was there previously but was depreciated by the reddit devs IDK why).

So I resorted to using pushshift.io – It is a service designed for data science applications and actively archives all the reddit data in its database and provides an easy to use web-api to access it.

Pushshift also has a limit of returning maximum of 1000 posts per request but this drawback can be countered by using custom time frames to retrieve posts.

Here’s some really-really dirty code that I wrote –

import json
import requests
import time
import sys
 
 
dump =open("dump.txt", "a+")
current_timestamp = 1582816669
# 60 seconds * 60 minutes * 24 hours * 60 days = 2 months
fiveDaysTimestamp = 432000
totalPosts =0
mf =0
fm =0
ma =0
fa =0
ff =0
mm =0
i = 1
def more(after,before):
    global totalPosts
    global mf
    global fm
    global ma
    global fa
    global ff
    global mm
    global i
    url = 'https://api.pushshift.io/reddit/submission/search/?subreddit=gonewildaudio&sort=desc&filter=title&size=5000&after='+ str(after) + '&before=' + str(before)
    print(url)
    r = requests.get(url)
    data = r.json()
    for title in data['data'] :
        submission = title['title']
        dump.write('\n'+str(submission.encode("utf-8"))+'\n')
        totalPosts += 1
        if 'm4f' in submission.lower():
            mf +=1
        elif 'f4m' in submission.lower():
            fm +=1
        elif 'm4a' in submission.lower():
            ma +=1
        elif 'f4a' in submission.lower():
            fa +=1
        elif 'f4f' in submission.lower():
            ff +=1
        elif 'm4m' in submission.lower():
            mm +=1
    print('totalPosts = '+str(totalPosts))
    print('M4F = '+str(mf))
    print('F4M = '+str(fm))
    print('M4A = '+str(ma))
    print('F4A = '+str(fa))
    print('F4F = '+str(ff))
    print('M4M = '+str(mm))
 
    increment()
 
def increment():
    global i
    global current_timestamp
    global fiveDaysTimestamp
    after = current_timestamp - (i*fiveDaysTimestamp)
    before = current_timestamp - ((i-1)*fiveDaysTimestamp)
    if(after<=1337522586):
        print("done")
        sys.exit()
    i+=1
    more(after,before)
 
increment()

It is a really really dirty and stitched together code with a tonne of unnecessary recursion and bad exit statements but it works (somehow). Here’s basically what it does :

  • Increment() functions builds two timestamps each 5 days apart. It creates a new 5 day timestamp every time its called. I also finds if the timestamps reach the subreddit creation date. If Yes it exits the script. Otherwise it goes ahead and calls the more() function.
  • more() function takes the before and after parameters and builds a url for pushshift.io. Url contains the title filter, max limit , subreddit name and before,after as parameters.
  • Then the requests library is used to send a get request to pushshift and json data is returned.
  • The json data is then parsed for the title of each post and find() function is used to detect keywords in them.
  • On detection the variable for the specific keyword is incremented.
  • I outputs the result of each five day iteration in the terminal. It also stores all the scraped titles in a separate text file.
  • It then recursively call out the increment function until the exit condition (subreddit creation date) is met.
These were the results – charted in google sheets .

Areas of improvement :

  • Declare local and global variables more elegantly.
  • Use less recursion or use it less stupidly to prevent overflow.
  • Use more variables or different functions for generating time-epoches instead of hard-coding them to make the script more versatile.