A customer has found a site with ‘Im under attack’ mode enabled.
We should still be able to scrape these pages – except when a captcha is requested – although even then, it would be useful to be able to display the captcha to the admin and ask them to enter it to re-enable scraping from their server.
The example site here is:
https://www.size.co.uk/product/white-puma-x-ralph-sampson-low-og/140715/
The library to use to perform the bypass would be this:
https://github.com/Anorov/cloudflare-scrape
This service will have to be enabled through our own Phantom JS scraping service.
An additional check should be added to the start of any scraping session to check for cloudflare protection and then use the library.
The process is pretty much:
1. Check if cloudflare protection exists
2. Use the cloudflare-bypass library
3. Store the cookies for this target site for this source IP
4. When scraping after that, append the cookies to the cURL request (along with any other cookies specified by user in normal way) so that all requests bypass the protection page.
5. If cookies fail, re-run the cloudflare-bypass library to get a fresh set of cookies for this site + server IP