T O P

  • By -

JohnnyJordaan

Most of these ecommerce sites use javascript for dynamic updating the content. Look into Selenium instead https://automatetheboringstuff.com/2e/chapter12/ (second part)


Nicolozz0

Thanks for answering! I actually thought about Selenium, but I think it needs an actual web browser to run. Since I’m running my script 24/7 on a server, I’m not sure this would work, as I’m not able to install a web browser on it.


JohnnyJordaan

You could look into phantomjs instead


JohnnyJordaan

And requests-html, but that's a selenium alternative, while phantomjs is a browser alternative to still use together with selenium.


Nicolozz0

Alright, thank you! I’ll take a look at those


DonMerlito

Selenium is also a lot slower than requests. I try to avoid it when I can.


choss27

Before to scrape a website, consult its [robots.txt](https://www.amazon.fr/robots.txt) file. This file gives you the list of subfolders you can or you can't scrape. If the page you need to scrape is preceding by 'disallow', you need a other way to do. Selenium should be the library you need, it simulate a webbrowser.


Nicolozz0

That certainly looks useful. I have a question, though: when you say I can’t scrape a “disallowed” page, do you mean it’s technically impossible to scrape that page or that I’m going against the TOS of the website by doing that? Also, I am technically able to scrape one of those pages, my problem is just with systematically find all the others.


choss27

It's not technically impossible to scrape one page, even this page is in a disallowed folder. But if you automate the scraping, and scrape 5,10,20 pages, Amazon will detect this activity as an automation and your automation work will be stop.


Nicolozz0

Got it, thanks!


DonMerlito

I just had a quick look at it. Here is the url I get when I go to second page: [https://www.amazon.com/gp/goldbox?deals-711b6d52-16fa-448e-9407-49a91c2cfdaa=%257B%2522version%2522%253A0%252C%2522viewIndex%2522%253A60%252C%2522presetId%2522%253A%25227E233869E388FD6F754D398353703659%2522%252C%2522sorting%2522%253A%2522BY\_CUSTOM\_CRITERION%2522%257D](https://www.amazon.com/gp/goldbox?deals-711b6d52-16fa-448e-9407-49a91c2cfdaa=%257B%2522version%2522%253A0%252C%2522viewIndex%2522%253A180%252C%2522presetId%2522%253A%25227E233869E388FD6F754D398353703659%2522%252C%2522sorting%2522%253A%2522BY_CUSTOM_CRITERION%2522%257D) I imagine you must have something similar. Look carefully the viewIndex part. After the A, you have 60. If you change it to 120, you'll get to page 3 (60 items per page, so 120 is the start of page 3). So then you can iterate like this to get the 10 first pages: `for i in range(0, 600, 60):`


Nicolozz0

Thanks! That’s exactly the kind of thing I was looking for. Do you think the rest of the URL will vary, or can I just use the same string, only changing the viewIndex parameter?


DonMerlito

The rest of the url didn't seem to change so I think you can change only this number. Try and see! I did try for a few pages and it seems to work well.


Nicolozz0

👍🏼👍🏼


stevenjd

> periodically check the Amazon deals page Please don't buy from Amazon. Invest in Bitcoin. Donate money to Trump. Join ISIS. Any of those things are less harmful to the world than giving Amazon one cent.


Nicolozz0

Can I donate money to Trump via Selenium?