
What does XPath with class names really do?
The old iron of data crawling should understand that those elements in the web page are like chameleons, especially nowadays the streets are full of
// As a live example
//div[contains(@class,'product-item')]
How did proxy IP and XPath get together?
With ipipgo's proxy service with XPath to catch the data, it's like giving the crawler wearing a cloak of invisibility. For example, if you want to catch the price of an e-commerce site, people's anti-climbing mechanism found that you frequently visit, directly to your IP shut down the dark room. This time with ipipgo'sDynamic Residential AgentsThe success rate is directly doubled by changing different exit IPs for each request, together with accurate XPath localization.
Here is a real situation: a customer with a fixed IP to capture data, three days to be blocked. After switching to ipipgo's rotating proxy, theTwo weeks of continuous operation with no abnormalities, crawl accuracy spiked from 48% to 92%.
Class name positioning three big pits don't step on
1. Beware of class names with spaces: e.g.
2. Dynamically generated class name: like class="ui-component-12345″, this time to grab the fixed part, such as //[contains(@class,'ui-component-')]
3. Multi-matching issues: it is recommended to use developer tools to validate first, do not let XPath match to more than one element
Real-world configuration tutorials
Take the Python + ipipgo agent as an example:
import requests
from lxml import html
proxies = {
'http': 'http://用户名:密码@gateway.ipipgo.com:9020',
'https': 'https://用户名:密码@gateway.ipipgo.com:9020'
}
resp = requests.get('target url', proxies=proxies)
tree = html.fromstring(resp.content)
Here's the key ↓↓
price = tree.xpath('//span[contains(@class, "price-symbol")]/following-sibling::text()')[0]
Five Questions You're Sure to Ask
Q: What should I do if the class name changes every day?
A: look for the law of development, really can't go on ipipgo'sJS Rendering Proxy Service, can handle dynamically loaded content
Q: How do I break the match to more than one element?
A: Add layers of positioning, for example, first find the fixed features of the outer div, and then go inwards
Q: Why are ipipgo's proxies not easily blocked?
A: His family uses a real residential IP pool, each IP has real user behavior characteristics, than the server room IP is not a little bit more reliable!
Q: What if XPath is inefficient?
A: Combined with the use of CSS selectors, key positions and then contains function, ipipgo'sExclusive High Speed ProxyIt's also a speed bump.
Q: What should I do if I encounter a CAPTCHA?
A: ipipgo's proxy IP comes with cookie management function, with the request header randomization, can significantly reduce the verification code trigger rate
Why do you recommend ipipgo?
The actual test data to speak: comparison of three proxy service providers, using the same XPath script to capture a platform data
| service provider | success rate | blocking rate |
|---|---|---|
| ipipgo | 95% | 2% |
| Company A | 78% | 15% |
| Company B | 82% | 22% |
Special mention to his family.class name whitelisting featureIt can preset common class name rules to automatically adapt to different website structures, which is unique among similar products.

