
Hands on teaching you to use Rvest to catch data without blocking the number of
Recently, a small friend always asked me to use rvest to capture data always be the site of the IP blocking how to deal with? This thing is like going to the market to buy food is always driven out as disturbing. Today we will nag how to use the proxy IP this "invisibility cloak" to solve the problem, focusing on recommending me to use the smooth ipipgo service.
Why does your crawler always get caught?
Webmasters aren't vegetarians, they have three axes to grind:Access frequency detection, IP anomaly identification, request characterizationThe same IP request 50 times per minute, which is a hundred thousand miles away from the normal browsing speed. To give a chestnut, the same IP request 50 times per minute, which with normal people browsing speed difference of eighteen thousand miles, do not block you block who?
Typical code examples
library(rvest)
for(i in 1:100){
read_html("https://example.com/data?page="%>%paste0(i))
}
Writing code like this is the equivalent of holding up a bullhorn and shouting "I'm a crawler!". Using a proxy IP is like muzzling the crawler so the site doesn't recognize you for who you really are.
ipipgo proxy configuration in practice
Take ipipgo's Dynamic Residential Proxy as an example (this is the most stable of theirs) and set it up in three steps:
library(httr)
proxy <- "username:password@gateway.ipipgo.com:9021" Replace your authentication information
Request with proxy
response <- GET("https://target-site.com",
use_proxy(proxy),
user_agent("Mozilla/5.0..."))
Use with rvest
html % html_text()
pay attention toChange proxy IPs regularlyThe API of ipipgo can be changed automatically, which is much less troublesome than switching manually. The survival rate of their family can be up to 99%, more reliable than the free proxy is not a half a star.
White common rollover site
I've planted myself in each of these pits in the beginning:
| problematic phenomenon | method settle an issue |
|---|---|
| Suddenly returns a 403 error | Immediate suspension and IP replacement |
| Incomplete data capture | Check IP geolocation restrictions |
| Connection timeout | Increase timeout to 30 seconds |
soul-searching session
Q: Is it legal to use a proxy IP?
A: As long as you don't touch your personal information and business secrets, it's no problem to collect public data normally. ipipgo's IPs are all regular carrier resources, so you can use them in a practical way.
Q: Do free proxies work?
A: you taste, you fine - free IP pool, there may be 100 people at the same time with the same IP, the site does not seal only strange! ipipipgo exclusive agent although expensive, but the success rate doubled.
Q: How can I tell if a proxy is in effect?
A: Add a test step in the code:
test_ip <- GET("https://api.ipify.org", use_proxy(proxy))
cat(content(test_ip, "text")) should show the proxy IP
Upgraded Capture Strategy
It's not enough to have an agent, you have to be tactical:
1. Random hibernation for 0.5-3 seconds to mimic human operation
2. Mixed use of PC and mobile User-Agents
3. Decentralization of requests with ipipgo's global nodes
4. Enable automatic retry function for important tasks
Lastly, the biggest thing about using ipipgo for the past two years is that their customer service is very responsive. Once encountered technical problems at 3:00 in the morning, the work order was replied in 10 minutes, really reliable. New users remember to register to receive 2G traffic trial, enough to catch a small million pages.

