
What does this XPath include function do for you?
The old iron engaged in data crawling certainly understand, web page elements often change around like a naughty child. This is the time to usecontains() functionIt's like installing a fuzzy searcher for XPath, for example, to find a div with the word "price" in its class attribute, you can directly write//div[contains(@class,'price')]It doesn't matter if it's followed by "-new" or "-discount".
As an example, the price tag of an e-commerce site today uses theproduct-priceTomorrow it will beitem-priceIf you use the normal positioning method, you have to change the code every day. If you use the contains function to write//span[contains(@id,'price')]The script will not have to be changed for at least three months, provided that the site does not change the word "price" as well.
Proxy IP and XPath work well together.
Many newbies don't know that frequent requests when grabbing data with XPath will be blocked by IP. this timeDynamic proxy pool for ipipgoIt comes in handy to have their residential proxy IPs in rotation, and with random request intervals, it minimizes the probability of being blocked.
| take | Recommended Agent Type |
|---|---|
| Daily data collection | Long-lasting static proxies |
| High Frequency Data Grabbing | Dynamic rotation of agents |
| Need to simulate real people | Residential High Stash Agents |
Special note: When using the contains function to locate, it is best to pair it with theHigh Stash Proxy for ipipgoBecause some sites will detect the query characteristics of XPath. Last time, a customer used ordinary proxy to grab data, the result of the website with contains the request are blocked, replaced with ipipgo customized proxy to solve the problem.
A practical guide to avoiding the pit
1. don't use contains as a panacea - meet<div class="price-box special">This kind of space in the middle has to be written ascontains(@class,'price')Instead of the whole string
2. case-sensitive people who have stepped in this pit know that it's a good idea to use thetranslate() functionIt's safer to convert to lowercase, for example:
//[contains(translate(text(),'abcdefghijklmnopqrstuvwxyz','abcdefghijklmnopqrstuvwxyz'),'iphone')]
3. Proxy IP should remember to set up a whitelist, especially with ipipgo's enterprise proxy, in the background binding server IP to use. Once a buddy forgot this, debugging half a day thought he wrote the wrong XPath.
Frequently Asked Questions QA
Q: XPath is written correctly but can't capture the data, what's wrong?
A: Eighty percent of the time, the anti-climbing mechanism is triggered and suggested:
1. Checking the completeness of the request header
2. Reducing the frequency of acquisition
3. Switching to ipipgo's dynamic residential proxy
Q: Does the contains function affect crawl speed?
A: It is true that it will be slower than the exact match, but with ipipgo's exclusive proxy can make up for it. The actual test with their 10M bandwidth proxy, processing 100,000 pieces of data can be faster 30% or so.
Q: How to optimize using multiple contains at the same time?
A: Try writing it like this://div[contains(@class,'box') and contains(@id,'item')], together with ipipgo's intelligent routing feature, can automatically select the node with the lowest latency.
One final rant, many sites now have added AI protection, and technical means alone are not enough. Like ipipgo's recent newFingerprint Browser Proxy PackageIt can simulate the real browser environment and is more stable with XPath crawling. Especially do e-commerce price comparison brother, with this program can lose a lot less hair.

