In the last post, I had outlined the steps taken to build a distributed web crawler on AWS and the challenges faced while developing so. It took me almost 3–4 days to get that up and running smoothly.
In this post we are going to utilize some of the modules we had built earlier to build a Amazon product title+description crawler. The purpose of this crawler is to collect product metadata which will be subsequently used for Machine Learning tasks such as:
- Real-Time named entity annotation and feedback for retraining NER models.
- Building semantic product graph.
There are some e-commerce open data available for free but they are very limited and I did not want to spend money at-least at this stage to buy data.
Thus I came up with my own crawler to obtain data and do NER annotations and tagging on those. The core crawling engine is robust enough to be re-used for other websites.
The main and the pain point of crawling Amazon is to avoid your IPs getting blocked. Thus our crawler should be careful enough as not to overwhelm Amazon servers with requests.
- Use user-agents so as not be classified as a bot.
- Use rotating proxy IPs available from free proxy websites.
- Use timeouts whenever possible. Use minimum of 1 second between each request.
- Many times request will fail. Sleep for 2–3 seconds then try again.
- Keep the number of threads under 50.
- Use multiple amazon domains e.g. amazon.in, amazon.co.uk, amazon.sg etc. Each domain is hosted at servers nearby to the region thus you will not be hitting the same server again and again.
Following are the steps and the codes for crawling from Amazon.
 Defining search queries. Following are few search queries I am interested in:
 Defining the domains and the weight of each domain. We do not want to give uniform weightage to each domain because some domains data are more important to us for e.g. amazon.com and amazon.co.uk.
 Obtain free proxy list and assign weights to list of proxy IPs. IPs at the head of the list are given more weightage.
 Since for each search query the results are paginated. Thus we need to also pass page number along with the search query. The page numbers are also randomized so as not to create a pattern of similar URLs for Amazon to detect.
Thus each URL is randomly generated from a domain randomly selected, with a random search query and a random page number for the selected query.
 Setup Redis ElasticCache and Cassandra Keyspaces connections.
Add the URLs corresponding to the search query to Redis FIFO queue if the URL has not been crawled and metadata downloaded.
Note that unlike the wikipedia crawler from the last post, our amazon crawler will not encounter duplicate URLs because of how we are constructing and crawling the URLs. But many times although an URL has been crawled and added to the queue, but the content failed to be downloaded. Thus we need to crawl it again.
To help with this we are maintaining a Redis SET with the hash of the URLs. Before crawling and after adding to the FIFO queue, we add the hash of the URL to Redis set. But if the content of the URL failed to download, we remove the hash of the URL from the Redis set.
 Create and fire up threads to crawl the selected URLs. In my case I am using 20 threads to begin with.
 Implement the ‘add_to_url_queue’ method. All threads call this method in parallel. Threads pop an URL from the Redis FIFO queue, crawl the contents of the URL, discover new URLs and add them back to the FIFO queue.
For Amazon, the 1st level of crawling is with search queries. For a given search query and a page number of the result, obtain the list of search results.
Each search result will contain the product title along with the product details page URL. We can also obtain the rating and the price of the product if it is listed. Note that we cannot obtain the product description from the search results. For description we need to follow the product details page URLs from the search query results.
 Finally we come to the core parsing engine of the Amazon pages. As mentioned above that for the search query pages we fetch the product title, rating and price and add the product details page URLs as next level URLs to be crawled.
For the product details pages, we again obtain the product title, rating and price and more importantly product description which was missing from the search results pages.
All required metadata such as title and description are sanitized for special characters, lowercase conversion, removal of HTML tags etc.
For persisting the crawled metadata from Amazon, we are using Cassandra database. To store the data we are creating 2 tables (after creating a keyspace).
The first table is for inserting the metadata from the search results (the 1st level contents)
The primary key is a combo key of the (url hash, search result index).
This is because primary key is unique for all entries in a table. When data is written to Cassandra Memtable with the same PRIMARY KEY value, the previous data associated with the PRIMARY KEY value is overwritten as it is just a key for a hash map. Although it is duplicated in the commit log.
Since for each search query URL, we need to insert 15–20 search results metadata in database, thus we cannot use the hash of the URL alone but also need a unique identifier for each search result which is just the index position of the result in the page.
The metadata i.e. product title, rating, price etc. is added as a JSON string object into Cassandra.
The 2nd table is for inserting the product title and description:
The insert statements for both the tables are prepared statements which allows some optimization in Cassandra as it does not have to parse the query everytime we insert a new row.
 To run the crawler on AWS, I created 2 EC2 instances (T2.micro), one ElasticCache Redis cluster with 3 shards and replication factor of 2 with cache.T3.small instances and Amazon Keyspaces for Cassandra.
The crawlers are run in parallel from local machine using Fabric library.
Put this code on your local system in a file named fabfile.py and run the command ‘fab run_amzn_crawler’. That’s it !!!
Sample results from Cassandra Keyspaces.
Do keep a tab on the data inserted into Amazon Keyspaces for any errors or other data sanity issues. Also keep looking into the logger files on your instances to track any errors or warnings not anticipated.
The code is hosted here.