Scraping images from dynamic web-pages using selenium. (also: how to bypass ddos protection)

I <3 Scraping Projects. There, I said it ! Collecting hundreds of gigabytes of data from a random source borderline illegally gives me the pleasure none other coding project can match. Then parsing the data to make sense out of it… mmmhh.. love it.

I’ve been scraping static web-pages for a long time using scrappy , wget, and urllib. It got boring. So i decided to give dynamic websites a try using selenium and webdriver and….. holy shit its fun !

So selenium is basically a tool which emulates an actual user while opening and interacting with a web-page. it is mostly used by unit testers . But it has also got some amazing scraping capabilities under its sleeve which I’ll show you below.


Problem 1 >

I wanted to scrape all the images from this site : Australia’s top independent escort booking service. {No i’m not perverted. I wanted to build a image classifier to find out the perfect bra size for the given Brest) It is the perfect source to find pics of semi-nude women with their bust size (maybe PornHub would’ve done the job but then, its all exaggerated anyways).

But this a dynamic website( where java-script renders the page) so the legendary w-get won’t work here as it extracts the html and project structure without loading java-script. (will use w-get later though).

So here comes selenium to the rescue :

from selenium import webdriver
options = webdriver.ChromeOptions()
driver = webdriver.Chrome()
driver.get("https://google.com")
//so some shit
driver.close()

We also need chrome(or Firefox) web-driver for selenium to interact with. Download the driver>unzip it>put it into /usr/local/bin/


Problem 2 >

The website has cloudflre anti DDOS protection enabled. It prevents a user(or a swarm of users) to flood the site and overload the server thus preventing the crash. It presents a captcha before starting a user session and stores the approved session keys in cookies for about 30 minutes before asking the captcha again. (all in all – a real pain in the a$$)

To counter this this is what we’ll do >

  • Solve the captcha manually.
  • Save the session cookies.
  • Pass on the session cookies to webdriver each time it tries to parse the website.
  • Solve a captcha every 30 minutes.

Simple enough ? Here’s the implementation :

driver.get("https://scarletblue.com.au/sitemap.xml")
sleep(100)
print('sleeping 100s')
#do manual captcha shit
allCookies = (driver.get_cookie('__cfduid'),driver.get_cookie('cf_clearance'))

driver.close()
print(allCookies)

url = https://scarletblue.com.au/sitemap.xml

window = webdriver.Chrome()
    window.get(url)
    for cookie in allCookies:
        window.add_cookie(cookie)
    window.refresh()
    title = window.title
    if title == 'Attention Required! | Cloudflare':
        sys.exit("Verification Required at  >>  "+ url)

These are the two cookies we need for authentication > __cfuid ; cd_clearance.


Problem 3 >

Find the list of all the pages on this site containing images of escorts to feed to selenium.

Solution : Every website has a sitemap.xml page . It is mostly used by search engines to index and organise the pages. And sure enough this website had one.

If the website doesn’t have a sitemap you may have to create one yourself > using a node package called sitemap-generator-cli.

Use this command to create a sitemap for the site :

sudo npm install -g sitemap-generator-cli
sitemap-generator -v https://website.com

After a quick session of editing/sorting/removing-male-escorts I had this :

1200+ pages of 1200+ escorts

Problem 4 >

Lets parse this beautiful stream of data and extract all the image urls from it !

Here’s the full code for crawler :

  • Parse the sitemap text file line-by-line and store it in an array.
  • Use webdriver to open each link one-by-one.
  • Extract the ‘src’ attribute of all the ‘<img>’ tags to get the url(s) of all image(s).
  • Write the name of escort their bust size and all their images to a separate text file.
  • Stop the program if the auth token times out and spit out the last parsed url.
from selenium import webdriver
from time import sleep
import os
import sys

options = webdriver.ChromeOptions()
driver = webdriver.Chrome()


driver.get("https://scarletblue.com.au/sitemap.xml")
sleep(100)
print('sleeping 100s')
#do manual captcha shit
allCookies = (driver.get_cookie('__cfduid'),driver.get_cookie('cf_clearance'))
driver.close()

print(allCookies)

dataFile = open("data.txt","a")

sitemap = open( "sitemap.txt", "r" )
urls = []
for line in sitemap:
    if line[0] == 'h':
        urls.append(line)

for url in urls:
    window = webdriver.Chrome()
    window.get(url)
    for cookie in allCookies:
        window.add_cookie(cookie)
    window.refresh()
    title = window.title
    if title == 'Attention Required! | Cloudflare':
        sys.exit("Verification Required at  >>  "+ url)
    dataFile.write('\n#'+title)
    images = window.find_elements_by_tag_name('img')
    for image in images:
        imgUrl = image.get_attribute('src')
        print(imgUrl)
        dataFile.write('\n'+imgUrl)
    window.close()
    dataFile.write('\n \n')
    print(title)

To scale this up you can create multiple spiders (like scrappy) and run them simultaneously also sharing the cookie tokens between them . (Cloudflare doesn’t care on how many devices(instances) use a single cookie simultaneously . It only cares about the time spent using tokens) .

Problem 5 >

After image-url scrapping its time to download the actual images . I could use selenium for this purpose too but I found it slow and overkill for this simple task.

So here I used my real MVP again . The all-mighty w-get. And a python script to manage everything else.

Here’s how it works :

command = "wget -U 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' --header='Accept: text/html' --header='Cookie: __cfduid="+cfduid+"; cf_clearance="+cf_clearance+"' -P "+ currDir +" "+ line
  • -U specifies user-agent . helps wget pretend it is chrome.
  • -Header specifies cookies to use. The __cfuid and currDir values are the same as the selenium scraper. Without this the command returns f*you ! (403 forbidden message)
  • -P specifies output directory.

Here’s full code for the image scraper :

  • Set up the current and default directory.
  • Set __cfduid and currDir values
  • Get image url(s) from the data file.
  • If the first character of a line contains ‘#’ make a folder named after the line.
  • Use python’s os library to execute linux commands from python script.
  • Put subsequent images (downloaded using wget to the previous folder).
  • Live happily ever after (probably not)
import os

directory = 'misc'
currDir = '../output/misc'
cfduid = 'dfa8ee2bfd6d988421a1399b6294b0f9d1555557385'
cf_clearance = '01fa21195e663b6c4c55d3019e1aad4f9eace5f6-1555612920-1800-250'

data = open( "data.txt", "r" )
urls = []
for line in data:
    if line[0] == 'h':
        command  = "wget -U 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' --header='Accept: text/html' --header='Cookie: __cfduid="+cfduid+"; cf_clearance="+cf_clearance+"' -P "+ currDir +" "+ line
        os.system(command)
    elif line[0] == '#':
        directory = line.split(' ')[0] + '_' + line.split(' ')[1] + '_' + line.split(' ')[2]
        urls.append(directory)
        currDir = '../output/'+ directory
        os.system("mkdir " + currDir)


print(urls)

Method 2: Download images using selenium or urllib (in case the wget doesn’t work or the URL has some variable parameters like website.com/image.jpg?hello=12)

  • Open the data file and parse the url(s) into an array.
  • Option _1 – Open each url using web-driver > take a screenshot of the page > save the screenshot — !recommended
  • Option_2 – use urllib.request.urlretrieve to download pictures. – recommended

Here’s the code :

from selenium import webdriver
import os
import sys
import urllib.request

dataFile = open("sitemap.txt","r")
counter  = 0
currDir = 'scrapeResult/misc'

urls = []
for line in dataFile:
    if line[0] == 'h':
        window = webdriver.Chrome()
        window.get(line)
        urllib.request.urlretrieve(line,currDir+"/screenshot"+str(counter)+".jpg") #option_2
        window.save_screenshot(currDir+"/screenshot"+str(counter)+".jpg") #option_1
        counter = counter +  1
        window.close()
    elif line[0] == '#':
        directory = line.split(' ')[1] + '_' + line.split(' ')[2] + '_' + line.split(' ')[3] + '_' + line.split(' ')[4]
        urls.append(directory)
        currDir = 'scrapeResult/'+ directory
        os.system("mkdir " + currDir)

Conclusion >

So the scraping part of the project was successful and took 9 hours to complete. (oh yeah!).

The sad part is the data was much much larger than I thought it would be – 60K+ images making 65+gB of data. So I have to abandon the project for now. Maybe I’ll pick it again in the future (probably not).

On the flip-side though I am planning to make a script to solve image captchas using google’s vision api (I think it already exists though… but who cares I’m gonna build it anyways).

Also, building a nice general purpose CLI for the script would be a sweet idea too !

GLORIOUS

The full source code will be available on my GitHub profile.