Building a Distributed Web Crawler on AWS

  1. Crawling wikipedia — This is how standard Google crawlers work by following links from one page to another using a BFS search.
  2. Crawling Amazon — Using a seed set of search queries, I are going to crawl search results for product titles and then follow the product links and crawl product metadata from details pages. This is a 2 level BFS search.
BFS search for URLs
Parse URL using BeautifulSoup
Get list of free proxy IPs
Class for sampling from a weighted probability distribution.
Get random proxy from weighted distribution.
Custom sleep class method working on per domain basis.
  1. Crawl a URL only if it has not been crawled before i.e. read from a set of crawled URLs and if the current URL is not in the set, then crawl the URL and add the URL to the set else do not crawl.
  2. Domain based timeouts — Check the last accessed time of a domain and if the last accessed time is more than the timeout interval then crawl URL from that domain and update the last accessed time to current time else do not crawl URL from the domain.
Readers Writers Lock
Redis Bloom Filter Implementation
Cassandra write process.
fabfile for remote execution.
Redis Cluster key distribution.
Sample code for reading and inserting into FIFO Queue
Multiple (100) threads are being used to read and write to the FIFO queue.
  1. Write threads will starve waiting for Read threads to release lock on a resource.
  2. It is not distributed. i.e. lock applied on a shared resource by a thread in one machine will not be visible to another thread in a different machine.
  1. Assign a unique key name to a shared resource e.g. FIFO queue
  2. Assign a unique lock name (e.g. uuid.uuid1()) as the value
  3. Whenever a thread wants to obtain a lock, if the lock is available, it will call Redis SET command as follows.
  4. SET ‘resource_name’ ‘lock_name’ NX PX=1000
  5. This will set the key and the value with NX meaning it will only set if the key is not already set and PX=1000 meaning that the key will be valid for 1000 milliseconds or 1 seconds.
  6. If another thread tries to SET the ‘resource_name’ key within the PX interval, it will be blocked and cannot do so because the key is already set.
  7. Only if the thread tries to SET the ‘resource_name’ key after 1000 milliseconds and assuming no other thread has acquired it in the meantime, will it succeed.
  8. If a thread is not able to SET the key, it will wait for sometime (sleep) and then try again until it is able to acquire the lock or there is a timeout.
  9. To release the lock just DEL the key by the current thread holding the lock.
Class implements a customized locking mechanism using distributed Redis clusters.
Redis pipeline
LIFO Stack traversal using DFS
Check size of queue before adding
wikimedia pageviews API
Redis set with timeout
  1. https://aws.amazon.com/blogs/database/creating-a-simple-autocompletion-service-with-redis-part-one-of-two/
  2. https://aws.amazon.com/getting-started/hands-on/building-fast-session-caching-with-amazon-elasticache-for-redis/2/
  3. https://aws.amazon.com/blogs/database/work-with-cluster-mode-on-amazon-elasticache-for-redis/
  4. https://docs.aws.amazon.com/AmazonElastiCache/latest/mem-ug/accessing-elasticache.html
  5. https://www.bmc.com/blogs/redis-clustering-partitioning/
  6. http://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html
  7. https://redis.io/topics/distlock
  8. https://chris-lamb.co.uk/posts/distributing-locking-python-and-redis
  9. https://realpython.com/python-redis/
  10. https://redis.io/topics/transactions
  11. https://redis.io/topics/pipelining
  12. https://teddyma.gitbooks.io/learncassandra/content/about/about_cassandra.html
  13. https://opensourceconnections.com/blog/2013/07/24/understanding-how-cql3-maps-to-cassandras-internal-data-structure/
  14. https://opensourceconnections.com/blog/2013/07/24/understanding-how-cql3-maps-to-cassandras-internal-data-structure-sets-lists-and-maps/
  15. https://medium.com/@marikalam/study-guide-cassandra-data-consistency-496e5bf9cadb
  16. https://stackoverflow.com/questions/34592948/what-is-the-purpose-of-cassandras-commit-log
  17. https://aws.amazon.com/blogs/big-data/best-practices-for-running-apache-cassandra-on-amazon-ec2/
  18. https://docs.aws.amazon.com/keyspaces/latest/devguide/programmatic.drivers.html
  19. https://docs.aws.amazon.com/keyspaces/latest/devguide/programmatic.credentials.html#programmatic.credentials.ssc
  20. https://docs.aws.amazon.com/keyspaces/latest/devguide/programmatic.cqlsh.html

--

--

--

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Sunsetting Docker Executor Support in Argo Workflows

Client Authentication with Private Key JWT using WSO2 Identity Server

Feature Spec: Presenting Transaction Insights to LendUp Card Users

100 new Hour of Code tutorials are here!

Another day of problem solving.

Progress on MegaHousing: Knowing Flocking Simulation from Scratch.

Manual testing is dead!

Random Programming Problems (Part 0): Introduction

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Abhijit Mondal

Abhijit Mondal

Engineer

More from Medium

Key Value Stores

Distributed Systems Part-1: A peek into consistent hashing!

Learnings from Scaling Redis at Syfe

System Design — Top K Trending Hashtags