
Hands-on teaching you use Selenium + proxy IP to catch dynamic web pages
engaged in web crawling know that more and more sites are now starting to use JavaScript to dynamically load content. Two days ago, I helped a friend to catch the price data of an e-commerce platform, with ordinary requests library simply can not get the complete data, this time I remembered to move out of the Selenium this killer.
Why are dynamic web pages difficult to work with?
Now many sites are like Russian nesting dolls, the initial request only gives an empty shell page, the real content to wait for the completion of the JS execution will be loaded. To give a chestnut, the price information of some commodity detail page to wait 3 seconds before loading through the interface, this time the traditional crawler is stupid.
This time you have to use a browser tool that can simulate the operation of a real person, such as Selenium, but the problem is - many sites are particularly sensitive to automated access, frequent operation of the minutes to block the IP. last week I tested the consecutively blocked 5 IPs, so angry almost smashed the keyboard.
How do proxy IPs save lives?
This is the place to be.proxy IPThis magic weapon. The principle is very simple: each request for a different IP address, so that the site thinks it is a different user to visit. But pay attention to the choice of proxy type:
| Agent Type | Degree of anonymity | Applicable Scenarios |
|---|---|---|
| Transparent Agent | lowest | It's basically useless. |
| Anonymous agent | moderate | general anti-sealing |
| High Stash Agents | supreme | recommended choice |
This is a must.ipipgo's Dedicated High Stash ProxyTheir IP pool is updated quickly, and I've tested it for 24 hours of continuous collection without triggering a ban. Especially their dynamic authentication function, than the traditional username and password method is too convenient.
Selenium Configuration Proxy Hands-on
Take Chrome as an example, the key code is written like this (remember to install chromedriver first):
from selenium import webdriver
proxy = "http://user:pass@gateway.ipipgo.com:9020" proxy address provided by ipipgo
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(f'--proxy-server={proxy}')
Prevent being recognized as an automation tool
chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])
chrome_options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(options=chrome_options)
driver.get("https://目标网站.com")
Watch out for a few pit stops:
1. in the proxy addressport numberDon't make a mistake, ports may be different for different packages.
2. It is proposed to addDisable automation featuresspecifications
3. It is better to randomize the time between operations than to make it look like a robot.
Frequently Asked Questions QA
Q: Obviously used proxy IP or still blocked?
A: Check if you are using a high stash of proxies, transparent proxies will leak the real IP. we recommend changing to ipipgo's enterprise level proxies with automatic IP rotation function.
Q: What should I do if Selenium starts up especially slowly?
A: Try headless mode with these two lines:
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
Q: How do I break it when I encounter a website asking for a CAPTCHA?
A: Reasonable control of the collection frequency, ipipgo IP quality is good if the basic will not trigger the verification code. Really encountered can access the coding platform, but the cost goes up.
Maintenance Tips
If you are doing a long term collection program, it is recommended to get aIP health check mechanism. My dirt method is to visit https://ip.ipipgo.com/checkip every half hour and immediately alert if it returns anything other than a proxy IP.
Lastly, don't write a dead proxy address in your code! It's better to get them dynamically through an interface. ipipgo's API supports getting the latest proxies in real time on a per-volume basis, so that even if a certain IP hangs, it can be switched automatically.
I recently discovered that some websites detect browser fingerprints, which can be randomly changed at each startupUser-AgentThe effect of camouflage is directly pulled full of, and then with ipipgo's mobile proxy IP. Well, today's dry goods on the pour so much, there are specific questions welcome to tease ~!

