Web Scrapping Amazon

broken image


Serverless Architecture for a Web Scraping Solution. If you are interested in serverless architecture, you may have read many contradictory articles and wonder if serverless architectures are cost effective or expensive. I would like to clear the air around the issue of effectiveness through an analysis of a web scraping solution. Web scraping deals with extracting or scraping the information from the website. Web scraping is also sometimes referred to as web harvesting or web data extraction. Copying text from a website and pasting it to your local system is also web scraping. However, it is a manual task.

In its simplest form, web scraping is about making requests and extracting data from the response. For a small web scraping project, your code can be simple. You just need to find a few patterns in the URLs and in the HTML response and you're in business.

But everything changes when you're trying to pull over 1,000,000 products from the largest ecommerce website on the planet.

When crawling a sufficiently large website, the actual web scraping (making requests and parsing HTML) becomes a very minor part of your program. Instead, you spend a lot of time figuring out how to keep the entire crawl running smoothly and efficiently.

This was my first time doing a scrape of this magnitude. I made some mistakes along the way, and learned a lot in the process. It took several days (and quite a few false starts) to finally crawl the millionth product. If I had to do it again, knowing what I now know, it would take just a few hours.

In this article, I'll walk you through the high-level challenges of pulling off a crawl like this, and then run through all of the lessons I learned. At the end, I'll show you the code I used to successfully pull 1MM+ items from amazon.com.

  1. Scraping Amazon using WebHarvy Scraping Amazon to get product data like price, images, ASIN, BSR, specifications, descriptions and reviews/ratings for thousands of products automatically is possible using WebHarvy. Being a generic web scraping software, WebHarvy can be configured to extract data from any website as per your requirement.
  2. T-HAOHUA Anniversary Photo Album Scrapbook - Our Adventure Book Wedding Photo Album Scrapping 11.6'x7.5' inches, 80 Pages - Includes Bonus 5 Postcards and 5 Self-Adhesive Photo Corners 4.7 out of 5 stars 2,938.
  3. Web Scraping for Online reputation and Monitoring: It is difficult for all large-scale companies for monitoring products. Web Scraping can help for extracting all the reviews and data which can.

I've broken it up as follows:

High-Level Challenges I Ran Into

There were a few challenges I ran into that you'll see on any large-scale crawl of more than a few hundred pages. These apply to crawling any site or running a sufficiently large crawling operation across multiple sites.

High-Performance is a Must

In a simple web scraping program, you make requests in a loop – one after the other. If a site takes 2-3 seconds to respond, then you're looking at making 20-30 requests a minute. At this rate, your crawler would have to run for a month, non-stop before you made your millionth request.

Not only is this very slow, it's also wasteful. The crawling machine is sitting there idly for those 2-3 seconds, waiting for the network to return before it can really do anything or start processing the next request. That's a lot of dead time and wasted resources.

When thinking about crawling anything more than a few hundred pages, you really have to think about putting the pedal to the metal and pushing your program until it hits the bottleneck of some resources – most likely network or disk IO.

I didn't need to do this for my purposeses (more later), but you can also think about ways to scale a single crawl across multiple machines, so that you can even start to push past single-machine limits.

Amazon

Avoiding Bot Detection

Any site that has a vested interest in protecting its data will usually have some basic anti-scraping measures in place. Amazon.com is certainly no exception.

You have to have a few strategies up your sleeve to make sure that individual HTTP requests – as well as the larger pattern of requests in general – don't appear to be coming from one centralized bot.

For this crawl, I made sure to:

  1. Spoof headers to make requests seem to be coming from a browser, not a script
  2. Rotate IPs using a list of over 500 proxy servers I had access to
  3. Strip 'tracking' query params from the URLs to remove identifiers linking requests together

More on all of these in a bit.

The Crawler Needed to be Resilient

The crawler needs to be able to operate smoothly, even when faced with common issues like network errors or unexpected responses.

You also need to be able to pause and continue the crawl, updating code along the way, without going back to 'square one'. This allows you to update parsing or crawling logic to fix small bugs, without needing to rescrape everything you did in the past few hours.

I didn't have this functionality initially and I regretted it, wasting tons of hours hitting the same URLs again and again whenever I need to make updates to fix small bugs affecting only a few pages.

Crawling At Scale Lessons Learned

From the simple beginnings to the hundreds of lines of python I ended up with, I learned a lot in the process of running this project. All of these mistakes cost me time in some fashion, and learning the lessons I present here will make your amazon.com crawl much faster from start to finish.

1. Do the Back of the Napkin Math

When I did a sample crawl to test my parsing logic, I used a simple loop and made requests one at a time. After 30 minutes, I had pulled down about 1000 items.

Initially, I was pretty stoked. 'Yay, my crawler works!' But when I turned it loose on a the full data set, I quickly realized it wasn't feasible to run the crawl like this at full scale.

Doing the back of the napkin math, I realized I needed to be doing dozens of requests every second for the crawl to complete in a reasonable time (my goal was 4 hours).

This required me to go back to the drawing board.

2. Performance is Key, Need to be Multi-Threaded

In order to speed things up and not wait for each request, you'll need to make your crawler multi-threaded. This allows the CPU to stay busy working on one response or another, even when each request is taking several seconds to complete.

You can't rely on single-threaded, network blocking operations if you're trying to do things quickly. I was able to get 200 threads running concurrently on my crawling machine, giving me a 200x speed improvement without hitting any resource bottlenecks.

3. Know Your Bottlenecks

You need to keep an eye on the four key resources of your crawling machine (CPU, memory, disk IO and network IO) and make sure you know which one you're bumping up against.

What is keeping your program from making 1MM requests all at once?

The most likely resource you'll use up is your network IO – the machine simply won't be capable of writing to the network (making HTTP requests) or reading from the network (getting responses) fast enough, and this is what your program will be limited by.

Note that it'll likely take hundreds of simultaneous requests before you get to this point. You should look at performance metrics before you assume your program is being limited by the network.

Depending on the size of your average requests and how complex your parsing logic, you also could run into CPU, memory or disk IO as a bottleneck.

You also might find bottlenecks before you hit any resource limits, like if your crawler gets blocked or throttled for making requests too quickly.

This can be avoided by properly disguising your request patterns, as I discuss below.

Web Scraping Amazon

4. Use the Cloud

I used a single beefy EC2 cloud server from Amazon to run the crawl. This allowed me to spin up a very high-performance machine that I could use for a few hours at a time, without spending a ton of money.

It also meant that the crawl wasn't running from my computer, burning my laptop's resources and my local ISP's network pipes.

5. Don't Forget About Your Instances

The day after I completed the crawl, I woke up and realized I had left an m4.10xlarge running idly overnight. My reaction:

I probably wasted an extra $50 in EC2 fees for no reason. Make sure you stop your instances when you're done with them!

6. Use a Proxy Service

This one is a bit of a no-brainer, since 1MM requests all coming from the same IP will definitely look suspicious to a site like amazon that can track crawlers.

I've found that it's much easier (and cheaper) to let someone else orchestrate all of the proxy server setup and maintenance for hundreds of machines, instead of doing it yourself.

This allowed me to use one high-performance EC2 server for orchestration, and then rent bandwidth on hundreds of other machines for proxying out the requests.

I used ProxyBonanza and found it to be quick and simple to get access to hundreds of machines.

7. Don't Keep Much in Runtime Memory

If you keep big lists or dictionaries in memory, you're asking for trouble. What happens when you accidentally hit Ctrl-C when 3 hours into the scrape (as I did at one point)? Back to the beginning for you!

Make sure that the important progress information is stored somewhere more permanent.

8. Use a Database for Storing Product Information

Store each product that you crawl as a row in a database table. Definitely don't keep them floating in memory or try to write them to a file yourself.

Databases will let you perform basic querying, exporting and deduping, and they also have lots of other great features. Just get in a good habit of using them for storing your crawl's data.

9. Use Redis for Storing a Queue of URLs to Scrape

Store the 'frontier' of URLs that you're waiting to crawl in an in-memory cache like redis. This allows you to pause and continue your crawl without losing your place.

If the cache is accessible over the network, it also allows you to spin up multiple crawling machines and have them all pulling from the same backlog of URLs to crawl.

10. Log to a File, Not stdout

While it's temptingly easy to simply print all of your output to the console via stdout, it's much better to pipe everything into a log file. You can still view the log lines coming in, in real-time by running tail -f on the logfile.

Having the logs stored in a file makes it much easier to go back and look for issues. You can log things like network errors, missing data or other exceptional conditions.

I also found it helpful to log the current URL that was being crawled, so I could easily hop in, grab the current URL that was being crawled and see how deep it was in any category. I could also watch the logs fly by to get a sense of how fast requests were being made.

11. Use screen to Manage the Crawl Process instead of your SSH Client

If you SSH into the server and start your crawler with python crawler.py, what happens if the SSH connection closes? Maybe you close your laptop or the wifi connection drops. You don't want that process to get orphaned and potentially die.

Using the built-in Unix screen command allows you to disconnect from your crawling process without worrying that it'll go away. You can close your laptop and simple SSH back in later, reconnect to the screen, and you'll see your crawling process still humming along.

12. Handle Exceptions Gracefully

You don't want to start your crawler, go work on other stuff for 3 hours and then come back, only to find that it crashed 5 minutes after you started it.

Any time you run into an exceptional condition, simply log that it happened and continue. It makes sense to add exception handling around any code that interacts with the network or the HTML response.

Be especially aware of non-ascii characters breaking your logging.

Site-Specific Lessons I Learned About Amazon.com

Every site presents its own web scraping challenges. Part of any project is getting to know which patterns you can leverage, and which ones to avoid.

Here's what I found.

13. Spoof Headers

Besides using proxies, the other classic obfuscation technique in web scraping is to spoof the headers of each request. For this crawl, I just grabbed the User Agent that my browser was sending as I visited the site.

If you don't spoof the User Agent, you'll get a generic anti-crawling response for every request Amazon.

Web Scraping Free

In my experience, there was no need to spoof other headers or keep track of session cookies. Just make a GET request to the right URL – through a proxy server – and spoof the User Agent and that's it – you're past their defenses.

Web scraping amazon reviews

14. Strip Unnecessary Query Parameters from the URL

One thing I did out of an abundance of caution was to strip out unnecessary tracking parameters from the URL. I noticed that clicking around the site seemed to append random IDs to the URL that weren't necessary to load the product page.

I was a bit worried that they could be used to tie requests to each other, even if they were coming from different machines, so I made sure my program stripped down URLs to only their core parts before making the request.

15. Amazon's Pagination Doesn't Go Very Deep

While some categories of products claim to contain tens of thousands of items, Amazon will only let you page through about 400 pages per category.

This is a common limit on many big sites, including Google search results. Humans don't usually click past the first few pages of results, so the sites don't bother to support that much pagination. It also means that going too deep into results can start to look a bit fishy.

If you want to pull in more than a few thousand products per category, you need to start with a list of lots of smaller subcategories and paginate through each of those. But keep in mind that many products are listed in multiple subcategories, so there may be a lot of duplication to watch out for.

16. Products Don't Have Unique URLs

The same product can live at many different URLs, even after you strip off tracking URL query params. To dedupe products, you'll have to use something more specific than the product URL.

How to dedupe depends on your application. It's entirely possible for the exact same product to be sold by multiple sellers. You might look for ISBN or SKU for some kinds of products, or something like the primary product image URL or a hash of the primary image.

17. Avoid Loading Detail Pages

This realization helped me make the crawler 10-12x faster, and much simpler. I realized that I could grab all of the product information I needed from the subcategory listing view, and didn't need to load the full URL to each of the products' detail page.

I was able to grab 10-12 products with one request, including each of their titles, URLs, prices, ratings, categories and images – instead of needing to make a request to load each product's detail page separately.

Whether you need to load the detail page to find more information like the description or related products will depend on your application. But if you can get by without it, you'll get a pretty nice performance improvement.

18. Cloudfront has no Rate Limiting for Amazon.com Product Images

While I was using a list of 500 proxy servers to request the product listing URLs, I wanted to avoid downloading the product images through the proxies since it would chew up all my bandwidth allocation.

Fortunately, the product images are served using Amazon's CloudFront CDN, which doesn't appear to have any rate limiting. I was able to download over 100,000 images with no problems – until my EC2 instance ran out of disk space.

Then I broke out the image downloading into its own little python script and simply had the crawler store the URL to the product's primary image, for later retrieval.

19. Store Placeholder Values

There are lots of different types of product pages on Amazon. Even within one category, there can be several different styles of HTML markup on individual product pages, and it might take you a while to discover them all.

If you're not able to find a piece of information in the page with the extractors you built, store a placeholder value like '' in your database.

This allows you to periodically query for products with missing data, visit their product URLs in your browser and find the new patterns. Then you can pause your crawler, update the code and then start it back up again, recognizing the new pattern that you had initially missed.

How My Finished, Final Code Works

TL;DR: Here's a link to my code on github. It has a readme for getting you setup and started on your own amazon.com crawler.

Scrapping

Avoiding Bot Detection

Any site that has a vested interest in protecting its data will usually have some basic anti-scraping measures in place. Amazon.com is certainly no exception.

You have to have a few strategies up your sleeve to make sure that individual HTTP requests – as well as the larger pattern of requests in general – don't appear to be coming from one centralized bot.

For this crawl, I made sure to:

  1. Spoof headers to make requests seem to be coming from a browser, not a script
  2. Rotate IPs using a list of over 500 proxy servers I had access to
  3. Strip 'tracking' query params from the URLs to remove identifiers linking requests together

More on all of these in a bit.

The Crawler Needed to be Resilient

The crawler needs to be able to operate smoothly, even when faced with common issues like network errors or unexpected responses.

You also need to be able to pause and continue the crawl, updating code along the way, without going back to 'square one'. This allows you to update parsing or crawling logic to fix small bugs, without needing to rescrape everything you did in the past few hours.

I didn't have this functionality initially and I regretted it, wasting tons of hours hitting the same URLs again and again whenever I need to make updates to fix small bugs affecting only a few pages.

Crawling At Scale Lessons Learned

From the simple beginnings to the hundreds of lines of python I ended up with, I learned a lot in the process of running this project. All of these mistakes cost me time in some fashion, and learning the lessons I present here will make your amazon.com crawl much faster from start to finish.

1. Do the Back of the Napkin Math

When I did a sample crawl to test my parsing logic, I used a simple loop and made requests one at a time. After 30 minutes, I had pulled down about 1000 items.

Initially, I was pretty stoked. 'Yay, my crawler works!' But when I turned it loose on a the full data set, I quickly realized it wasn't feasible to run the crawl like this at full scale.

Doing the back of the napkin math, I realized I needed to be doing dozens of requests every second for the crawl to complete in a reasonable time (my goal was 4 hours).

This required me to go back to the drawing board.

2. Performance is Key, Need to be Multi-Threaded

In order to speed things up and not wait for each request, you'll need to make your crawler multi-threaded. This allows the CPU to stay busy working on one response or another, even when each request is taking several seconds to complete.

You can't rely on single-threaded, network blocking operations if you're trying to do things quickly. I was able to get 200 threads running concurrently on my crawling machine, giving me a 200x speed improvement without hitting any resource bottlenecks.

3. Know Your Bottlenecks

You need to keep an eye on the four key resources of your crawling machine (CPU, memory, disk IO and network IO) and make sure you know which one you're bumping up against.

What is keeping your program from making 1MM requests all at once?

The most likely resource you'll use up is your network IO – the machine simply won't be capable of writing to the network (making HTTP requests) or reading from the network (getting responses) fast enough, and this is what your program will be limited by.

Note that it'll likely take hundreds of simultaneous requests before you get to this point. You should look at performance metrics before you assume your program is being limited by the network.

Depending on the size of your average requests and how complex your parsing logic, you also could run into CPU, memory or disk IO as a bottleneck.

You also might find bottlenecks before you hit any resource limits, like if your crawler gets blocked or throttled for making requests too quickly.

This can be avoided by properly disguising your request patterns, as I discuss below.

Web Scraping Amazon

4. Use the Cloud

I used a single beefy EC2 cloud server from Amazon to run the crawl. This allowed me to spin up a very high-performance machine that I could use for a few hours at a time, without spending a ton of money.

It also meant that the crawl wasn't running from my computer, burning my laptop's resources and my local ISP's network pipes.

5. Don't Forget About Your Instances

The day after I completed the crawl, I woke up and realized I had left an m4.10xlarge running idly overnight. My reaction:

I probably wasted an extra $50 in EC2 fees for no reason. Make sure you stop your instances when you're done with them!

6. Use a Proxy Service

This one is a bit of a no-brainer, since 1MM requests all coming from the same IP will definitely look suspicious to a site like amazon that can track crawlers.

I've found that it's much easier (and cheaper) to let someone else orchestrate all of the proxy server setup and maintenance for hundreds of machines, instead of doing it yourself.

This allowed me to use one high-performance EC2 server for orchestration, and then rent bandwidth on hundreds of other machines for proxying out the requests.

I used ProxyBonanza and found it to be quick and simple to get access to hundreds of machines.

7. Don't Keep Much in Runtime Memory

If you keep big lists or dictionaries in memory, you're asking for trouble. What happens when you accidentally hit Ctrl-C when 3 hours into the scrape (as I did at one point)? Back to the beginning for you!

Make sure that the important progress information is stored somewhere more permanent.

8. Use a Database for Storing Product Information

Store each product that you crawl as a row in a database table. Definitely don't keep them floating in memory or try to write them to a file yourself.

Databases will let you perform basic querying, exporting and deduping, and they also have lots of other great features. Just get in a good habit of using them for storing your crawl's data.

9. Use Redis for Storing a Queue of URLs to Scrape

Store the 'frontier' of URLs that you're waiting to crawl in an in-memory cache like redis. This allows you to pause and continue your crawl without losing your place.

If the cache is accessible over the network, it also allows you to spin up multiple crawling machines and have them all pulling from the same backlog of URLs to crawl.

10. Log to a File, Not stdout

While it's temptingly easy to simply print all of your output to the console via stdout, it's much better to pipe everything into a log file. You can still view the log lines coming in, in real-time by running tail -f on the logfile.

Having the logs stored in a file makes it much easier to go back and look for issues. You can log things like network errors, missing data or other exceptional conditions.

I also found it helpful to log the current URL that was being crawled, so I could easily hop in, grab the current URL that was being crawled and see how deep it was in any category. I could also watch the logs fly by to get a sense of how fast requests were being made.

11. Use screen to Manage the Crawl Process instead of your SSH Client

If you SSH into the server and start your crawler with python crawler.py, what happens if the SSH connection closes? Maybe you close your laptop or the wifi connection drops. You don't want that process to get orphaned and potentially die.

Using the built-in Unix screen command allows you to disconnect from your crawling process without worrying that it'll go away. You can close your laptop and simple SSH back in later, reconnect to the screen, and you'll see your crawling process still humming along.

12. Handle Exceptions Gracefully

You don't want to start your crawler, go work on other stuff for 3 hours and then come back, only to find that it crashed 5 minutes after you started it.

Any time you run into an exceptional condition, simply log that it happened and continue. It makes sense to add exception handling around any code that interacts with the network or the HTML response.

Be especially aware of non-ascii characters breaking your logging.

Site-Specific Lessons I Learned About Amazon.com

Every site presents its own web scraping challenges. Part of any project is getting to know which patterns you can leverage, and which ones to avoid.

Here's what I found.

13. Spoof Headers

Besides using proxies, the other classic obfuscation technique in web scraping is to spoof the headers of each request. For this crawl, I just grabbed the User Agent that my browser was sending as I visited the site.

If you don't spoof the User Agent, you'll get a generic anti-crawling response for every request Amazon.

Web Scraping Free

In my experience, there was no need to spoof other headers or keep track of session cookies. Just make a GET request to the right URL – through a proxy server – and spoof the User Agent and that's it – you're past their defenses.

14. Strip Unnecessary Query Parameters from the URL

One thing I did out of an abundance of caution was to strip out unnecessary tracking parameters from the URL. I noticed that clicking around the site seemed to append random IDs to the URL that weren't necessary to load the product page.

I was a bit worried that they could be used to tie requests to each other, even if they were coming from different machines, so I made sure my program stripped down URLs to only their core parts before making the request.

15. Amazon's Pagination Doesn't Go Very Deep

While some categories of products claim to contain tens of thousands of items, Amazon will only let you page through about 400 pages per category.

This is a common limit on many big sites, including Google search results. Humans don't usually click past the first few pages of results, so the sites don't bother to support that much pagination. It also means that going too deep into results can start to look a bit fishy.

If you want to pull in more than a few thousand products per category, you need to start with a list of lots of smaller subcategories and paginate through each of those. But keep in mind that many products are listed in multiple subcategories, so there may be a lot of duplication to watch out for.

16. Products Don't Have Unique URLs

The same product can live at many different URLs, even after you strip off tracking URL query params. To dedupe products, you'll have to use something more specific than the product URL.

How to dedupe depends on your application. It's entirely possible for the exact same product to be sold by multiple sellers. You might look for ISBN or SKU for some kinds of products, or something like the primary product image URL or a hash of the primary image.

17. Avoid Loading Detail Pages

This realization helped me make the crawler 10-12x faster, and much simpler. I realized that I could grab all of the product information I needed from the subcategory listing view, and didn't need to load the full URL to each of the products' detail page.

I was able to grab 10-12 products with one request, including each of their titles, URLs, prices, ratings, categories and images – instead of needing to make a request to load each product's detail page separately.

Whether you need to load the detail page to find more information like the description or related products will depend on your application. But if you can get by without it, you'll get a pretty nice performance improvement.

18. Cloudfront has no Rate Limiting for Amazon.com Product Images

While I was using a list of 500 proxy servers to request the product listing URLs, I wanted to avoid downloading the product images through the proxies since it would chew up all my bandwidth allocation.

Fortunately, the product images are served using Amazon's CloudFront CDN, which doesn't appear to have any rate limiting. I was able to download over 100,000 images with no problems – until my EC2 instance ran out of disk space.

Then I broke out the image downloading into its own little python script and simply had the crawler store the URL to the product's primary image, for later retrieval.

19. Store Placeholder Values

There are lots of different types of product pages on Amazon. Even within one category, there can be several different styles of HTML markup on individual product pages, and it might take you a while to discover them all.

If you're not able to find a piece of information in the page with the extractors you built, store a placeholder value like '' in your database.

This allows you to periodically query for products with missing data, visit their product URLs in your browser and find the new patterns. Then you can pause your crawler, update the code and then start it back up again, recognizing the new pattern that you had initially missed.

How My Finished, Final Code Works

TL;DR: Here's a link to my code on github. It has a readme for getting you setup and started on your own amazon.com crawler.

Once you get the code downloaded, the libraries installed and the connection information stored in the settings file, you're ready to start running the crawler!

If you run it with the 'start' command, it looks at the list of category URLs you're interested in, and then goes through each of those to find all of the subcategory URLs that are listed on those page, since paginating through each category is limited (see lesson #15, above).

It puts all of those subcategory URLs into a redis queue, and then spins up a number of threads (based on settings.max_threads) to process the subcategory URLs. Each thread pops a subcategory URL off the queue, visits it, pulls in the information about the 10-12 products on the page, and then puts the 'next page' URL back into the queue.

The process continues until the queue is empty or settings.max_requests has been reached.

Note that the crawler does not currently visit each individual product page since I didn't need anything that wasn't visible on the subcategory listing pages, but you could easily add another queue for those URLs and a new function for processing those pages.

Hope that helps you get a better sense of how you can conduct a large scrape of amazon.com or a similar ecommerce website.

If you're interested in learning more about web scraping, I have an online course that covers the basics and teaches you how to get your own web scrapers running in 15 minutes.





broken image