
Hands-on teaching you to use Selenium with proxy IP to catch data
Brothers engaged in crawler understand, now the site anti-climbing more and more strict. Recently, an e-commerce friend asked me to say that they use Selenium to catch the competitor's price is always blocked IP, anxious to jump straight to the feet. This issue we will nag how to use Python's regular expressions + proxy IP to solve this pain point.
Why do you have to use a proxy IP?
To give a real example: an e-commerce platform with the same IP visit 20 times in a row will be directly blacklisted. At this time, if you useDynamic Residential Proxy for ipipgoIf you change your IP to a different region for each request, the site won't be able to tell if it's a real person or a machine.
| take | No need for an agent. | Proxy with ipipgo |
|---|---|---|
| Requests per hour | 50 times must be blocked | 1000+ stabilized |
| data integrity | Frequent interruptions | Full collection |
The actual code is written like this
First of all, understand the core three-piece set: Selenium control browser, regular expressions to mention the data, proxy IP to keep safe. Here focus on proxy configuration:
from selenium import webdriver
Proxy format for ipipgo account:password@ip:port
proxy = "vipuser:123456@45.76.89.12:8080"
options = webdriver.ChromeOptions()
options.add_argument(f'--proxy-server=http://{proxy}')
Remember to add exception handling! Sometimes the proxy will time out
try: driver = webdriver.
driver = webdriver.Chrome(options=options)
driver.get("https://目标网站.com")
except Exception as e.
print("Proxy connection jerked:", e)
Watch out for potholes:Many tutorials teach people to use free proxies, which results in IPs that are either invalid or slow as turtles. It is recommended to go directly toPaid packages for ipipgoThe response time of their dedicated IP pool can be under 200ms.
Regular expressions play like this
After getting the web page source code, the price data is crawled with this regularity:
import re
Match the format ¥12.34
price_pattern = r'¥(d+.d{2})'
prices = re.findall(price_pattern, page_source)
encountered with a comma of ¥ 1,234.56 so that write
advanced_pattern = r'¥((d+,)d+.d{2})'
Don't underestimate this decimal point match, some sites intentionally add in the price ofinvisible characterIt's time to use thesto ignore whitespace: r'¥s(d+)s.s(d{2})'
Answers to high-frequency questions
Q: Why use Selenium without requests?
A: Now a lot of website data is dynamically loaded JS, requests can not get the complete data, you have to use the browser to render the
Q: How do ipipgo agents choose packages?
A: For small-scale testingpay per volumeLong-term project selectionEnterprise Customized PackagesTheir tech support can help with tuning
Q: What should I do if I can't match the rules?
A: first use print(page_source) to see the actual content, do not trust the eyes to see the page display, the source code may have hidden tags
Say something from the heart.
Last year, I was helping a friend to do data collection and almost messed up the project by using a free proxy. Then I switched toMixed dialing proxy for ipipgoThe collection efficiency is directly tripled with their IP rotation API. Especially to do price monitoring this kind of real-time requirements of high work, stable agent is the lifeblood.
A final word of advice: don't save money on proxies! The damage caused by blocking one number is enough to buy six months of paid service. Now use the promo codeSELENIUM666You can get 10% off at the ipipgo website, and new users can whore out a 3-day trial, so don't be shy about what you should be woolgathering.

