Back in 2014, roughly around the same time of November I was leaving college. I wanted to download all the episodes of Bleach, Naruto and One-Piece. Sadly I couldn't find any downloaders or torrents which let me download all the episodes till date. So now I have tried to solve the same problem. This script was built with the help of my colleague and college senior Mr. Kumar Harsh who works on crawlers at Naukri.com .

The idea was simple we used Selenium to automate browser clicks and navigate to the download page of the particular video. Then we can easily extract download links for the video. The site for downloading animes was Chia-Anime. Mostly because I prefer this site to watch and download animes.

I installed selenium using pip install selenium and that was all to it. The next step was to extract video links from the root page of anime. The code was pretty much straight forward. And we were able to extract the XPATH for video links after inspecting elements using Chrome. Cool ! But extracting XPATH was not enough. I noticed that on clicking the download link for a video a lot of popups were generated. This was annoying since it was difficult for me to focus the selenium window on the correct download page using code itself. We needed a solution to handle popups. And then Adblock Plus came to our rescue.

Handling PopUps and Multiple Windows

AdBlock Plus is a popular extension for Firefox for blocking popups and ads. Now the tricky part was how to use Adblock on our Selenium Automated Browser to remove any popus. Firefox Profiles was the solution. All I did was created a new firefox profile for scraping purposes and installed AdBlock extension on it. Now, I can easily use a created profile from my code to use AdBlock natively.

To my surprise I noticed that you can use a firefox profile created on one machine, move it to some other machine and use it without any changes. I used the same scraping profile created on local machine to scrape videos on my VPS server.

After dealing with popups we extracted the XPATH of the download URL and that was it. The script looks like this

    import time
    import re
    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    import wget


    profile = webdriver.FirefoxProfile("/home/bunty/.mozilla/firefox/bh2u5hdn.Scraping")
    driver = webdriver.Firefox(profile)

    home_url = "http://www.chia-anime.tv/show/rainbow-nisha-rokubo-no-shichinin/"
    keywords = ["raibow", "nisha"]
    driver.get(home_url)

    episode_urls = []
    elements = driver.find_elements_by_xpath("//div[contains(@class,'post')]")

    for element in elements:
        HTML = element.get_attribute("innerHTML")
        URL = re.search("(?P<url>https?://[^\s]+)", HTML).group("url")
        episode_urls.append(URL[:-1])
        allFound = True

        for key in keywords:
            if not re.search(key, URL):
                allFound = False

        if allFound:
            episode_urls.append(URL[:-1])

    print "....Fetched Episode URLs of given page...."

    for url in episode_urls:
        driver.get(url)
        print "Trying to fetch from...", url

        try:
            elem = driver.find_element_by_id("download")
            elem.click()

            all_windows = driver.window_handles

            download_window = None

            for window in all_windows:
                if window != driver.current_window_handle:
                    download_window = window

            driver.close()
            if download_window:
                driver.switch_to_window(download_window)

            # Get the Download Links
            elem = driver.find_element_by_xpath("//*[@id=\"wrap\"]/table/tbody/tr[1]/th/table/tbody/tr[1]/td[2]/a[2]")
            download_url = elem.get_attribute("href")
            print "Downloading From....", download_url
            download_urls.append(download_url)
            wget.download(download_url)

            time.sleep(100) # Be a little generous

        except:
            print "Failed to fetch...

Going Headless

What if we try to run this script on server. There is no X on server right (not preinstalled atleast) ? Although you can forward X using X11-forwarding or maybe using a VNC but X11-Forwarding is too slow and setting up VNC was too much pain just for this task.

We used Xvfb. As Wikipedia states,

Xvfb or X virtual framebuffer is a display server implementing the X11 display server protocol. In contrast to other display servers, Xvfb performs all graphical operations in memory without showing any screen output. From the point of view of the client, it acts exactly like any other X display server, serving requests and sending events and errors as appropriate. However, no output is shown. This virtual server does not require the computer it is running on to have a screen or any input device. Only a network layer is necessary.

I ran a Xvfb service on a screen and then set the DISPLAY environment variable to display set for Xvfb service. Using

$ screen -R scraping_screen
$ sudo Xvfb :10 -ac
$ [Detached from screen]
$ export DISPLAY=:10
$ echo $DISPLAY     # Cross Check
$ :10

Cool now I can run our scraper on my VPS server and download my favorite anime.

This was indeed a great pass time for killing 2 hours at office. If you need to download a anime just drop me a mail. I can download that for you on my server and share it to you on google drive/FTP. Cheers. Keep Hacking !