
Teach you how to use R language to hang proxy gripping data
The old iron engaged in network crawlers must have encountered the IP was blocked the bad things, this time the proxy IP is your lifesaving straw. Let's nag today how to use R language configuration ipipgo proxy service, so that the work of the crawler is as stable as the old dog.
What the hell is wrong with proxy IPs?
In a nutshell.The middleman gets the data for you.. For example, you want to catch a certain website, directly with their own IP is easy to be recognized as a crawler. After using ipipgo's proxy IP, the website sees the proxy server's IP, even if it is blocked for a different IP will be able to continue to work.
As a chestnut, a normal request looks like this
response <- httr::GET("http://目标网站.com")
After hanging the proxy
proxy <- "123.45.67.89:8000"
response <- httr::GET("http://目标网站.com",
use_proxy(proxy))
R Language Practical Configuration Guide
recommendedhttrrespond in singingrvestThis golden pair operates in three steps:
Step 1 Load the necessary libraries
library(httr)
library(rvest)
Step 2 Set the proxy parameters
ipipgo_proxy <- "用户名:密码@gateway.ipipgo.com:9020" Here you fill in your account.
Step 3 Send request with proxy
resp <- GET("https://目标站点",
use_proxy(ipipgo_proxy), timeout(30))
timeout(30))
Parsing data
doc <- content(resp, "parsed")
Here's a guide to avoiding the pitfalls
Three common mistakes newbies make:
| pothole | symptomatic | method settle an issue |
|---|---|---|
| The accreditation wasn't right. | Return 407 error | Check that the account format is user:pass@ip:port |
| The timeout is not set. | stuck and not moving (idiom); fig. stuck in a rut | Don't exceed 30 seconds for the timeout parameter |
| IP Reuse | It's blocked again. | Dynamic Rotation with ipipgo |
Real-life cases go by the wayside
Recently there is an e-commerce friend to catch the price data, with ipipgo's residential agent, the success rate from 45% soared to 92%. the key code is long like this:
Setting up the proxy pool
proxies <- ipipgo_get_proxies(type="residential") call ipipgo's API to get new IPs
for(page in 1:100){
proxy <- sample(proxies,1)
res <- GET(paste0("https://电商网站/page=",page),
user_agent("Mozilla/5.0"))
Parsing the stored data...
}
Frequently Asked Questions QA
Q: What can I do about slow proxy IPs?
A: choose ipipgo's static enterprise proxy, latency can be controlled within 200ms
Q: What if I need to process a CAPTCHA?
A: With ipipgo's intelligent routing function, automatically assigns IP segments with low CAPTCHA probability
Q: Do free proxies work?
A: Don't think so! Nine out of 10 free agents are pits, and you should choose a professional service provider like ipipgo for commercial use!
Why do you recommend ipipgo?
Real life experience after using it in my own home for over two years:
1. ExclusiveIP Health DetectionFunction to automatically filter invalid proxies
2. 300+ city lines across the country, data that requires geographic positioning can also be accurately captured
3. Provision of specializedR Language SDKThe proxy service can be accessed in three lines of code.
Finally nagging, with the agent to crawl data to comply with the site's robots agreement, do not with a site to the death grip. Reasonable use of tools, in order to be a long stream of water is not?

