
When crawler meets dynamic loading, how should the manual party live?
You may have encountered this situation: when crawling data with requests, obviously there is nothing in the source code of the web page, but you can see the data in the browser. This is dynamic loading in the demon! This is the time to call out our model - Selenium. but just know how to open the browser is not enough, you may be blocked by the website IP, this time you have to use a proxy IP to save your life.
from selenium import webdriver
proxy = "http://用户名:密码@gateway.ipipgo.com:9021"
options = webdriver.ChromeOptions()
options.add_argument(f'--proxy-server={proxy}')
Remember to put the chromedriver in the same directory as the scripts
driver = webdriver.Chrome(options=options)
Here's a pitfall to watch out for:Don't tap your username and password by hand.It is recommended to save it in a configuration file. When using ipipgo's proxy, their domain gateway.ipipgo.com is followed by a different port number for each user, don't copy my code directly haha.
Three Rules for Surviving Proxy IP
Dynamic acquisition is like playing minesweeper, the proxy IP is not good with a minute to step on the mine. According to the pit I have traveled, summarize three life-saving experience:
① Rotation is better than being single
Don't catch an IP and use it to death, it is recommended to change the IP every 5 pages. ipipgo's API can extract IPs in bulk, and it's easy to manage it with a queue.
② Choose the right level of anonymity
| typology | Applicable Scenarios |
|---|---|
| Transparent Agent | It's basically useless. |
| General anonymous | routine collection |
| High Stash Agents | Anti-crawl strict website |
ipipgo's high stash of proxies has been tested to be able to avoid the 90% anti-climbing, especially when doing cross-border e-commerce data collection, which is particularly useful.
③ Timeout setting is an art
Don't wait! It is recommended that the load timeout be set to 15 seconds, along with the proxy connection timeout set to 20 seconds. If you use ipipgo's premium line, 10 seconds is enough, their response time is really fast.
Troublesome maneuvers in the real world
Recently, I helped a friend to get the price monitoring of travel website, and summarized two masterpieces:
Invisibility Spree: Use headless mode + proxy IP double buff stack. Remember to add the startup parameter--headless=newThe success rate is pulled right up to full with ipipgo's Dynamic Residential IP.
options.add_argument("--headless=new")
options.add_argument("--disable-blink-features=AutomationControlled")
Fingerprint confusion: Change the browser fingerprint parameter, this needs to load the extension. But with ipipgo's mobile IP pool, you don't really need to go through all that trouble, naturally different exit IPs are the best disguise.
Common Rollover Scene QA
Q:Why can't I open the webpage when the proxy is open?
A: Ninety percent of the problem is the certificate, in the options to add theoptions.add_argument('--ignore-certificate-errors')try out
Q: What should I do if I use a proxy and it loads especially slowly?
A: first change the ipipgo's machine room node, it is recommended to choose from the target site close. For example, to climb the Japanese website, use their Osaka engine room line.
Q: What should I do if I encounter human verification?
A: Residential proxy IP + analog mouse movement two-pronged approach. But it's best to control the frequency of collection, don't piss off the site.
Say something from the heart.
After so many years of data collection, the biggest lesson I've learned is in eight words:have the right tools at hand and the right resourcesSelenium is really powerful, but no reliable proxy IP support is like bare shoulders. I've used a lot of proxy services, the last long-term use of ipipgo is mainly interested in two points: First, their IP pool is updated quickly, the second is the technical support response in a timely manner, three o'clock in the middle of the night to mention the work order actually someone back....
One final note to newbies: don't just stare at the code.The quality of the proxy IP directly affects the success rateThe first is to use the ipipgo package to practise. At first it is recommended to use ipipgo's volume package, first get 500 IP to practice, and so on to figure out the target site's anti-climbing laws and then on the volume. After all, the cost of time saved can be worth a lot more than the agency fee.

