
Hands-on with XPath to catch the data of the king next door
The old iron engaged in crawlers must have encountered this scenario: obviously look at the web page structure is very clear, really want to locate the elements but like in the labyrinth around. Especially when encountering table data, product listPeer elements pile upsituation, XPath's sibling positioning technique is your opening axe.
As a chestnut, an e-commerce site has prices hidden in theclass="price"in the span, but next door there's a bewitchingclass="fake-price". This is the time to usefollowing-siblingThe axis will be able to pinpoint the true price, just like picking a watermelon at the market, you have to be able to clap and listen to the sound.
//div[@class='product']/span[@class='title']/following-sibling::span[1]
Proxy IPs keep crawlers steady as old dogs
However, XPath is not enough, many websites are more strict than the anti-reptile thief. Two days ago, there is a price comparison brother, 20 consecutive requests to be blocked IP, anxious straight hair pulling. This is the time toDynamic Residential Proxy for ipipgoOn the field, its IP pool is bigger than Wanda Plaza, each request randomly change the vest, the site simply can not distinguish between a person is a crawler.
The live configuration is super easy (remember to replace username and password with your own account):
import requests
proxies = {
'http': 'http://username:password@gateway.ipipgo.com:9021',
'https': 'http://username:password@gateway.ipipgo.com:9021'
}
resp = requests.get('https://目标网站', proxies=proxies)
Gold Partner Case
Let's say we want to grab information about a show on a ticket site, and the page structure looks like this:
| elemental | hallmark |
|---|---|
| Name of show | h3 tag + class="event-title" |
| performance time | The first p tag immediately following the name |
| fares | The span in the second p tag |
With XPath sibling axes can be grabbed like this:
events = response.xpath('//div[@class="events-list"]/div')
for event in events.
name = event.xpath('. //h3/text()').get()
time = event.xpath('. //h3/following-sibling::p[1]/text()').get()
price = event.xpath('. //p[2]/span/text()').get()
With ipipgo'son-demand billing packageIf you set up a 5-second request interval, you can run a night's worth of data steadily, and you'll be less likely to step into the 80% pit than you would with a free proxy.
Common Rollover Scene QA
Q: What should I do if XPath positioning always has an empty list?
A:先检查是不是元素加载,用浏览器开发者工具复现定位。如果网站用了反爬,记得在请求头里加Referer和User-Agent,ipipgo的代理自带请求头伪装功能。
Q: What should I do if the proxy IP suddenly fails to connect?
A: Add a retry mechanism in the code, ipipgo's API supports automatic replacement of failed IPs. if you are disconnected frequently, we recommend switching to theirs.Long-lasting static residential IPThe stability is comparable to broadband dial-up.
Q: How to break dynamic web pages?
A: On Selenium or Playwright simulation browser, remember to give each browser instance with a different proxy. ipipgo supports the creation of multiple proxy sessions at the same time, a perfect solution to the problem of IP conflicts in multiple windows.
One last thing. Crawlers.three parts skill, seven parts agency. Having used seven or eight proxy services, ipipgo really has something in terms of responsiveness and failure retry mechanisms, especially theirIP Survival Detection APIIt can screen out the dud IPs in advance, saving the program from getting stuck halfway through the run.

