
How does XPath's contains() really work?
Brothers engaged in data collection should understand that web page element positioning is like a needle in a haystack. At this time XPath contains () function is your magnet, especially when the element characteristics are not obvious. For example, to find a page with all the "price" of the word div label, directly written as//div[contains(text(),'price')], much more flexible than matching with full text.
//[contains(@class,'btn_submit')] //find elements that contain the submit button style
//a[contains(@href,'product_detail')] //grab the product detail page link
How do proxy IPs and XPath work together?
Many websites anti-climbing mechanism thieves fine, the same IP frequent visits directly to your black. At this time we have toDynamic Residential Proxy for ipipgoOut of the gate, their IP pool is updated with 8000+ nodes per day. Let's say you want to collect price data from an e-commerce site:
import requests
from lxml import etree
proxies = {
'http': 'http://user:pass@gateway.ipipgo.com:9021'
}
resp = requests.get('https://xxx.com', proxies=proxies)
html = etree.HTML(resp.text)
prices = html.xpath('//span[contains(@class, "price")]')
A practical guide to avoiding the pit
I've encountered this pitfall: a website that hides the price in thedata-priceIn the attribute, the surface text reads "¥??". in the attribute, the surface text shows "¥? At this point, simply using text() to locate it would be a bust, and you'd have to write it this way:
//div[@id='goods']/@data-price // extract attribute values directly
With ipipgo'sIntelligent Rotation StrategyThe company has set up an automatic IP change every 5 minutes, and the collection success rate has soared directly from 50% to 95%. They can also see the use status of each IP in the background, which is really worry-free.
I'm sure you want to ask these.
Q: Is contains() case sensitive?
A: points! To find "PRICE" you need to write 'PRICE', we suggest using the translate() function first to convert to lowercase!
Q: How do I break dynamically loaded content?
A: Use ipipgo's firstHigh Stash AgentsBypass the backcrawl, and with a tool like Selenium, wait for the element to finish loading before grabbing the
Q: Does ipipgo survive long enough?
A: The actual test of their single IP can be used for 10-30 minutes, do regular collection is completely enough. If it is a long-term task, it is recommended to open API to extract new IP automatically.
Why ipipgo?
Having compared several proxy providers, ipipgo has three hardcore advantages:
| functionality | General Agent | ipipgo |
|---|---|---|
| IP Type | server room IP-based | Real Residential IP |
| concurrency | 50 threads | limitless |
| geographic location | Fixed cities | Select base station location on demand |
I was helping a client do a comparison capture last week and used their homeShanghai Local IP访问目标网站,居然比普通代理快3倍。后来才知道他们和三大运营商有通道,这波属实专业。
The Ultimate Combo
Finally, I'll share a private configuration plan:
- Created in the ipipgo consolepersistent sessionact on behalf of sb. in a responsible position
- XPath is written as
//[contains(@id,'result_')]Matching Dynamic IDs - Setup failure retry 3 times + automatic IP switching
This set of combinations tested daily average collection of 100,000 pieces of data without jamming. Especially for those who do cross-border e-commerce, use theirOverseas Native IPWith XPath positioning, catching competitor data is a sure thing.

