
Hands-on with XPath's contains to play with fuzzy matching
engage in data capture of the old iron should know, encountered the kind of web page elements like loach as slippery as the scene, with the contains function is simply like a bamboo basket to catch loach. Today we will take the actual case to say, how to use this tool with proxy IP to work.
I. XPath contains basic operations
This thing called contains is, quite frankly, aKeyword detector, the format looks like this://div[contains(text(),'keyword')]. As a chestnut, want to catch the price of an item but there are various versions hidden in the page:
| Web page source code | Corresponding XPath |
|---|---|
| Price: ¥199 | //span[contains(text(),'current price')] |
| Special price ¥168 | //em[contains(text(),'price')] |
Be careful not to putspecial symbolTo the leak, encounter ¥ this currency symbols remember to use the escape character processing. If you are really unsure, use ipipgo's dynamic IP to try a few more page versions, the probability of success can be doubled.
Second, the golden combination of proxy IP technology
What's the biggest fear of batch crawling, IP blocking? This is the time to use ouripipgo Dynamic IP Pool. Play it this way exactly:
- Randomly change the exit IP for each request
- Automatic line switching when encountering CAPTCHA
- Static residential IP for early morning data capture
Focusing on the third point, a lot of sites are interested inData Center IPEspecially sensitive. Using ipipgo's residential proxy, disguised as a real user access, with contains to do fuzzy matching, the success rate can be up to 90% or more.
Third, the actual battle in the tawdry operation
Recently, I encountered a pitfall when helping a client capture e-commerce data: the product title was mixed with themars symbolThe first thing you need to do is to find a way to get to the top of the charts. For example, [explosive ★ hot] such, regular XPath directly to rest. Later with contains (text (), 'pop-up') and contains (text (), 'hot') of the double insurance writing method, together with ipipgo's Hong Kong server room IP, perfect solution.
In an even more extreme case, a website splits the price into three parts and displays: ¥199. This is the time to use thecontains+node splicing://div[contains(@class,'price')]/span[contains(text(),'9′)]
IV. Anti-rollover guide
A common pitfall for newbies:
- Case sensitive (converted with the translate function)
- Spaces garbled (plus normalize-space handling)
- Dynamically loaded content (in conjunction with ipipgo'sAPIs are updated in real timeIP)
Last week a buddy was dying to match the data, and it turned out that the site was using thefont anti-climbing. Teach him to use ipipgo's cell phone 4G proxy + contains(text(),'urges') fuzzy write to bypass detection directly.
QA Frequently Asked Questions
Q: How to choose between dynamic IP and static IP?
A: test phase with dynamic IP casually build, the official run suggested using ipipgoLong-lasting static IPStability hangs over its peers.
Q: What should I do if I can't match XPath?
A: First check if the IP is ban, change ipipgo'sHigh Stash AgentsTry again. If that doesn't work, use multiple insurance like contains(text(),'price') or contains(text(),'$').
Q: What can I do about proxy IPs affecting crawling speed?
A: That has to be a compliment to ipipgo'sBGP line optimizationThe key is to set up a good IP rotation policy, so that you don't have to use a single IP for everything. The key is to set up a good IP rotation policy, do not use an IP to death.
The last nagging sentence, engage in data capture is like playing guerrilla warfare, XPath is the gun, proxy IP is bulletproof vest. Use ipipgo this magic weapon, to ensure that you win a hundred battles in the data battlefield. If you have any strange problems encountered in the actual combat, please feel free to tease our technical brother.

