
Why do crawlers always get caught? Try Proxy IP + Class Selector Combo
Brothers engaged in data crawling understand that the web page to change a class name in minutes to let you script scrapped. Recently found a wild way - the proxy IP and XPath class selector tied with the use of crawlers, as installed a smart disguise system. For example, using//div[contains(@class,'list-item')]This kind of fuzzy match, even if the website changes the class name from "item-1″ to "item_new", we can still catch the data.
import requests
from lxml import html
proxies = {
'http': 'http://user:pass@ipipgo-proxy:9020',
'https': 'http://user:pass@ipipgo-proxy:9020'
}
response = requests.get('https://target.com', proxies=proxies)
tree = html.fromstring(response.text)
Fuzzy match class with items
items = tree.xpath("//div[contains(@class,'item')]/text()")
The focus here is on ipipgo'sDynamic Residential AgentsHe has over 2 million real home network addresses in his IP pool. The last time I took this with a class selector, I ran it for a week straight without triggering a reverse crawl, much more stable than using a data center IP.
Second, the class selector of the three great work
Don't be a fool and use the full CLASS name, these three tricks will make you less likely to fall off the wagon:
| manner | typical example | Usage Scenarios |
|---|---|---|
| fuzzy matching | contains(@class,'part') | Class names with dynamic suffixes |
| multiple filtration | [contains(@class,'a') and contains(@class,'b')] | Composite class styles |
| hierarchical positioning | //div[@class='wrap']//li[contains(@class,'item')] | Nested Structure Analysis |
Note that there may be hidden traps in the class name, for example, a certain treasure's product class will come with a timestamp. This time use ipipgo'srotational agentThe IP is automatically changed every 5 minutes, and with fuzzy matching is as steady as an old dog.
Third, the correct opening posture of the proxy IP
Having seen too many people put good agents to waste, remember these three things:
- Don't use free agents.8 out of 10 are honeypots, the remaining 2 are slower than snails.
- The protocol has to be right.: Don't use http proxy if you use https on your webpage, it will be leaked.
- timeout setting: It is recommended to set 3-5 seconds, and change IP directly after that.
Take ipipgo's proxy as an example, his family supports socks5 and https dual protocol. It is recommended to configure it this way:
PROXY_POOL = [
"socks5://user:pass@us1.ipipgo.io:1080",
"https://user:pass@eu1.ipipgo.io:8443"
]
IV. Practical guide to avoiding pitfalls
Recently, I helped a friend to do recruitment website collection, and encountered a strange problem: obviously used the class selector, or leakage of data. Later, I found that the website put some of the content in the<div class="item item-hidden">Ri. The solution is simple:
Containing both item and item-hidden cases
items = tree.xpath("//div[contains(@class,'item') and not(contains(@class,'hidden'))]")
At this point, if you use a normal proxy, frequent retries will be blocked. Switch to ipipgo'sLong-lasting static residential IPThe single IP lasts for 6 hours, and with this precision selector, the success rate is pulled right up to full capacity.
V. Frequently Asked Questions QA
Q: What should I do if the class selector always fails to match?
A: First check if the element is in an iframe, then copy the XPath with the browser developer tool. it is recommended to use ipipgo'sHighly anonymous agents, to avoid being blocked.
Q: What should I do if my proxy IP suddenly fails?
A: Add an exception retry mechanism in the request code, recommend using ipipgo'sAutomatic package switching, failure automatically cuts the next IP.
Q: What if I need to deal with a large number of CAPTCHAs?
A: Reduce request frequency + use a proxy with browser fingerprinting. ipipgo'sPremium AgentSupports customized User-Agent, which can effectively reduce the CAPTCHA trigger rate.
VI. Ultimate solutions
Packaging proxy IPs and smart resolution as a service is the way to go. For example, with ipipgo'sAPI Gateway ServiceIf you want to get the cleaned data, you can pass the XPath expression directly. This even the proxy management and HTML parsing are saved , suitable for the need to quickly out of the results of the scene .
import requests
api_url = "https://gateway.ipipgo.com/v1/extract"
params = {
"url": "https://target.com",
"xpath":"//div[contains(@class,'price')]",
"api_key": "your_ipgo_key"
}
response = requests.get(api_url, params=params)
print(response.json()['data'])
This play shifts the complexity to the service provider and focuses on the business logic itself. It is especially suitable for situations that require multi-geographic data collection, such as obtaining price information from different regions at the same time.

