Scraping and processing all posts in a sub-reddit (timestamp method)
Another day another scraping project. This was actually a really short one and took me 2 hours from start to the finish. This time we are scraping reddit. r/gonewildaudio is a subreddit for some real desperate men and women of the world. I wanted to know who was the most served demographic of this sub. each post contains a gender tag of the format {x4y. x~male/female, y~male/female}.
As a first step I tried my hands on the official reddit api. I used its python wrapper - PRAW. But soon ran into its limitations. The api allows only a maximum of 1000 posts to be returned per request, which is fine btw but it has no time stamp feature to extract all the posts from a particular time period (it was there previously but was depreciated by the reddit devs IDK why).
So I resorted to using pushshift.io - It is a service designed for data science applications and actively archives all the reddit data in its database and provides an easy to use web-api to access it.
Pushshift also has a limit of returning maximum of 1000 posts per request but this drawback can be countered by using custom time frames to retrieve posts.
Here's some really-really dirty code that I wrote -
import json
import requests
import time
import sys
dump =open("dump.txt", "a+")
current_timestamp = 1582816669
# 60 seconds * 60 minutes * 24 hours * 60 days = 2 months
fiveDaysTimestamp = 432000
totalPosts =0
mf =0
fm =0
ma =0
fa =0
ff =0
mm =0
i = 1
def more(after,before):
global totalPosts
global mf
global fm
global ma
global fa
global ff
global mm
global i
url = 'https://api.pushshift.io/reddit/submission/search/?subreddit=gonewildaudio&sort=desc&filter=title&size=5000&after='+ str(after) + '&before=' + str(before)
print(url)
r = requests.get(url)
data = r.json()
for title in data['data'] :
submission = title['title']
dump.write('\n'+str(submission.encode("utf-8"))+'\n')
totalPosts += 1
if 'm4f' in submission.lower():
mf +=1
elif 'f4m' in submission.lower():
fm +=1
elif 'm4a' in submission.lower():
ma +=1
elif 'f4a' in submission.lower():
fa +=1
elif 'f4f' in submission.lower():
ff +=1
elif 'm4m' in submission.lower():
mm +=1
print('totalPosts = '+str(totalPosts))
print('M4F = '+str(mf))
print('F4M = '+str(fm))
print('M4A = '+str(ma))
print('F4A = '+str(fa))
print('F4F = '+str(ff))
print('M4M = '+str(mm))
increment()
def increment():
global i
global current_timestamp
global fiveDaysTimestamp
after = current_timestamp - (i*fiveDaysTimestamp)
before = current_timestamp - ((i-1)*fiveDaysTimestamp)
if(after<=1337522586):
print("done")
sys.exit()
i+=1
more(after,before)
increment()
It is a really really dirty and stitched together code with a tonne of unnecessary recursion and bad exit statements but it works (somehow). Here's basically what it does :
Increment() functions builds two timestamps each 5 days apart. It creates a new 5 day timestamp every time its called. I also finds if the timestamps reach the subreddit creation date. If Yes it exits the script. Otherwise it goes ahead and calls the more() function.
more() function takes the before and after parameters and builds a url for pushshift.io. Url contains the title filter, max limit , subreddit name and before,after as parameters.
Then the requests library is used to send a get request to pushshift and json data is returned.
The json data is then parsed for the title of each post and find() function is used to detect keywords in them.
On detection the variable for the specific keyword is incremented.
I outputs the result of each five day iteration in the terminal. It also stores all the scraped titles in a separate text file.
It then recursively call out the increment function until the exit condition (subreddit creation date) is met.
Areas of improvement :
Declare local and global variables more elegantly.
Use less recursion or use it less stupidly to prevent overflow.
Use more variables or different functions for generating time-epoches instead of hard-coding them to make the script more versatile.