IPIPGO ip proxy XPath text() contains fuzzy matching tutorial

XPath text() contains fuzzy matching tutorial

Teach you to use XPath contains to play with fuzzy matching The old iron should understand the data capture, encountered the kind of web page elements like loach as slippery as the scene, with the contains function is simply like catching the loach of the bamboo basket. Today we will take the actual case to say, how to use this tool to ride ...

XPath text() contains fuzzy matching tutorial

Hands-on with XPath's contains to play with fuzzy matching

engage in data capture of the old iron should know, encountered the kind of web page elements like loach as slippery as the scene, with the contains function is simply like a bamboo basket to catch loach. Today we will take the actual case to say, how to use this tool with proxy IP to work.

I. XPath contains basic operations

This thing called contains is, quite frankly, aKeyword detector, the format looks like this://div[contains(text(),'keyword')]. As a chestnut, want to catch the price of an item but there are various versions hidden in the page:

Web page source code Corresponding XPath
Price: ¥199 //span[contains(text(),'current price')]
Special price ¥168 //em[contains(text(),'price')]

Be careful not to putspecial symbolTo the leak, encounter ¥ this currency symbols remember to use the escape character processing. If you are really unsure, use ipipgo's dynamic IP to try a few more page versions, the probability of success can be doubled.

Second, the golden combination of proxy IP technology

What's the biggest fear of batch crawling, IP blocking? This is the time to use ouripipgo Dynamic IP Pool. Play it this way exactly:

  1. Randomly change the exit IP for each request
  2. Automatic line switching when encountering CAPTCHA
  3. Static residential IP for early morning data capture

Focusing on the third point, a lot of sites are interested inData Center IPEspecially sensitive. Using ipipgo's residential proxy, disguised as a real user access, with contains to do fuzzy matching, the success rate can be up to 90% or more.

Third, the actual battle in the tawdry operation

Recently, I encountered a pitfall when helping a client capture e-commerce data: the product title was mixed with themars symbolThe first thing you need to do is to find a way to get to the top of the charts. For example, [explosive ★ hot] such, regular XPath directly to rest. Later with contains (text (), 'pop-up') and contains (text (), 'hot') of the double insurance writing method, together with ipipgo's Hong Kong server room IP, perfect solution.

In an even more extreme case, a website splits the price into three parts and displays: ¥199. This is the time to use thecontains+node splicing://div[contains(@class,'price')]/span[contains(text(),'9′)]

IV. Anti-rollover guide

A common pitfall for newbies:

  • Case sensitive (converted with the translate function)
  • Spaces garbled (plus normalize-space handling)
  • Dynamically loaded content (in conjunction with ipipgo'sAPIs are updated in real timeIP)

Last week a buddy was dying to match the data, and it turned out that the site was using thefont anti-climbing. Teach him to use ipipgo's cell phone 4G proxy + contains(text(),'urges') fuzzy write to bypass detection directly.

QA Frequently Asked Questions

Q: How to choose between dynamic IP and static IP?
A: test phase with dynamic IP casually build, the official run suggested using ipipgoLong-lasting static IPStability hangs over its peers.

Q: What should I do if I can't match XPath?
A: First check if the IP is ban, change ipipgo'sHigh Stash AgentsTry again. If that doesn't work, use multiple insurance like contains(text(),'price') or contains(text(),'$').

Q: What can I do about proxy IPs affecting crawling speed?
A: That has to be a compliment to ipipgo'sBGP line optimizationThe key is to set up a good IP rotation policy, so that you don't have to use a single IP for everything. The key is to set up a good IP rotation policy, do not use an IP to death.

The last nagging sentence, engage in data capture is like playing guerrilla warfare, XPath is the gun, proxy IP is bulletproof vest. Use ipipgo this magic weapon, to ensure that you win a hundred battles in the data battlefield. If you have any strange problems encountered in the actual combat, please feel free to tease our technical brother.

This article was originally published or organized by ipipgo.https://www.ipipgo.com/en-us/ipdaili/31224.html

business scenario

Discover more professional services solutions

💡 Click on the button for more details on specialized services

新春惊喜狂欢,代理ip秒杀价!

Professional foreign proxy ip service provider-IPIPGO

Leave a Reply

Your email address will not be published. Required fields are marked *

Contact Us

Contact Us

13260757327

Online Inquiry. QQ chat

E-mail: hai.liu@xiaoxitech.com

Working hours: Monday to Friday, 9:30-18:30, holidays off
Follow WeChat
Follow us on WeChat

Follow us on WeChat

Back to top
en_USEnglish