This tutorial explores how to scrape sites where the content is loaded dynamically via JavaScript. The tools we use are Python, PhantomJS and Selenium. You can clone the github repository from here.
Problem
Nowadays a lots of sites use JavaScript to load their content dynamically. For example let’s take a look at ATP Singles USA tennis results 2015. You can see that all the matches are listed below:
but if we disable JavaScript the page keeps loading infinitely:
What if we want to get all the games from that site using Python? Surely the classic way of just sending the request to the site and parsing the html won’t work as JavaScript is not rendered and thus the body of the page is incomplete.
PhantomJS and Selenium to the Rescue
First we’ll set up our environment:
Let’s make python file named scraper.py
and fill it up with our script:
NB! In order for this script to work we must specify where to find PhantomJS. The phantomjs file can be found from here. Just download the correct version according to your operating system. Then unpack it and the file named ‘phantomjs’ can be found under the bin folder. Copy the file into the same folder as the scraper.py script.
Now lets run our script:
and we get the html for the first match, which happened between Djokovic and Federer:
Summary
As we saw it’s relatively easy to get dynamically loaded content, although it’s a bit slower than the traditional method of just firing a request to the server. You can clone the github repository from here.