Proxy IP using XPath class selector: Proxy IP assisted XPath parsing

Why do crawlers always get caught? Try Proxy IP + Class Selector Combo

Brothers engaged in data crawling understand that the web page to change a class name in minutes to let you script scrapped. Recently found a wild way - the proxy IP and XPath class selector tied with the use of crawlers, as installed a smart disguise system. For example, using//div[contains(@class,'list-item')]This kind of fuzzy match, even if the website changes the class name from "item-1″ to "item_new", we can still catch the data.


import requests
from lxml import html

proxies = {
    'http': 'http://user:pass@ipipgo-proxy:9020',
    'https': 'http://user:pass@ipipgo-proxy:9020'
}

response = requests.get('https://target.com', proxies=proxies)
tree = html.fromstring(response.text)
 Fuzzy match class with items
items = tree.xpath("//div[contains(@class,'item')]/text()")

The focus here is on ipipgo'sDynamic Residential AgentsHe has over 2 million real home network addresses in his IP pool. The last time I took this with a class selector, I ran it for a week straight without triggering a reverse crawl, much more stable than using a data center IP.

Second, the class selector of the three great work

Don't be a fool and use the full CLASS name, these three tricks will make you less likely to fall off the wagon:

manner	typical example	Usage Scenarios
fuzzy matching	contains(@class,'part')	Class names with dynamic suffixes
multiple filtration	[contains(@class,'a') and contains(@class,'b')]	Composite class styles
hierarchical positioning	//div[@class='wrap']//li[contains(@class,'item')]	Nested Structure Analysis

Note that there may be hidden traps in the class name, for example, a certain treasure's product class will come with a timestamp. This time use ipipgo'srotational agentThe IP is automatically changed every 5 minutes, and with fuzzy matching is as steady as an old dog.

Third, the correct opening posture of the proxy IP

Having seen too many people put good agents to waste, remember these three things:

Don't use free agents.8 out of 10 are honeypots, the remaining 2 are slower than snails.
The protocol has to be right.: Don't use http proxy if you use https on your webpage, it will be leaked.
timeout setting: It is recommended to set 3-5 seconds, and change IP directly after that.

Take ipipgo's proxy as an example, his family supports socks5 and https dual protocol. It is recommended to configure it this way:


PROXY_POOL = [
    "socks5://user:pass@us1.ipipgo.io:1080",
    "https://user:pass@eu1.ipipgo.io:8443"
]

IV. Practical guide to avoiding pitfalls

Recently, I helped a friend to do recruitment website collection, and encountered a strange problem: obviously used the class selector, or leakage of data. Later, I found that the website put some of the content in the<div class="item item-hidden">Ri. The solution is simple:


 Containing both item and item-hidden cases
items = tree.xpath("//div[contains(@class,'item') and not(contains(@class,'hidden'))]")

At this point, if you use a normal proxy, frequent retries will be blocked. Switch to ipipgo'sLong-lasting static residential IPThe single IP lasts for 6 hours, and with this precision selector, the success rate is pulled right up to full capacity.

V. Frequently Asked Questions QA

Q: What should I do if the class selector always fails to match?
A: First check if the element is in an iframe, then copy the XPath with the browser developer tool. it is recommended to use ipipgo'sHighly anonymous agents, to avoid being blocked.

Q: What should I do if my proxy IP suddenly fails?
A: Add an exception retry mechanism in the request code, recommend using ipipgo'sAutomatic package switching, failure automatically cuts the next IP.

Q: What if I need to deal with a large number of CAPTCHAs?
A: Reduce request frequency + use a proxy with browser fingerprinting. ipipgo'sPremium AgentSupports customized User-Agent, which can effectively reduce the CAPTCHA trigger rate.

VI. Ultimate solutions

Packaging proxy IPs and smart resolution as a service is the way to go. For example, with ipipgo'sAPI Gateway ServiceIf you want to get the cleaned data, you can pass the XPath expression directly. This even the proxy management and HTML parsing are saved , suitable for the need to quickly out of the results of the scene .


import requests

api_url = "https://gateway.ipipgo.com/v1/extract"
params = {
    "url": "https://target.com",
    "xpath":"//div[contains(@class,'price')]",
    "api_key": "your_ipgo_key"
}
response = requests.get(api_url, params=params)
print(response.json()['data'])

This play shifts the complexity to the service provider and focuses on the business logic itself. It is especially suitable for situations that require multi-geographic data collection, such as obtaining price information from different regions at the same time.

Proxy IP using XPath class selector: Proxy IP assisted XPath parsing

Why do crawlers always get caught? Try Proxy IP + Class Selector Combo

Second, the class selector of the three great work

Third, the correct opening posture of the proxy IP

IV. Practical guide to avoiding pitfalls

V. Frequently Asked Questions QA

VI. Ultimate solutions

business scenario

Professional foreign proxy ip service provider-IPIPGO

Leave a Reply Cancel reply

Contact Us

Follow us on WeChat

Why do crawlers always get caught? Try Proxy IP + Class Selector Combo

Second, the class selector of the three great work

Third, the correct opening posture of the proxy IP

IV. Practical guide to avoiding pitfalls

V. Frequently Asked Questions QA

VI. Ultimate solutions

business scenario

Professional foreign proxy ip service provider-IPIPGO

Related articles

ASN库有什么用：教你通过ASN号判断是否为真实宽带ISP

黑名单IP（Blacklist）怎么去查：不要让脏IP毁了你的项目

WebRTC泄露了真实IP：指纹浏览器防止IP穿透的高级设置

DNS泄露如何检测？配置好代理IP后必做的3次安全检查

欺诈分数过高（Fraud Score）怎么办：降低IP风险值的秘诀

怎么查我的IP归属地是不是原生：精准IP溯源查询方法总结

Leave a Reply Cancel reply

Contact Us

Follow us on WeChat