
When e-commerce data hits the R language Old Iron
Recently a lot of e-commerce friends and I touted that using Excel to pick up the data is like using chopsticks to eat steak - laborious! Today we will nag how to use the R language rvest package the whole point of real. Focus on those websites anti-climbing mechanism, and our savior!proxy IPHow the hell do you use it without flipping the car.
The Anti-Crawl Trifecta and the Survival of Proxy IPs
E-commerce sites are so smart these days that they come up with these damaging tricks:
①IP flow limiting-Like a supermarket sampling, you can only taste it three times per person;
②Captcha Bombing- - than a girlfriend checking in;
③ Behavioral tracking-Two mouse movements and you're being watched.
This is the time to offeripipgoThe proxy IP service is now easier to operate than cooking instant noodles:
| configuration item | Examples of parameters |
|---|---|
| agency agreement | http/https |
| IP address | ipipgo dynamically generated address |
| port number | random allocation |
| Authentication Methods | Username + Password |
Hands down, you can put body armor on a rvest.
Here's the point! Configure the agent for rvest in a handsome pose:
library(httr)
library(rvest)
The key code is here
proxy_settings %
html_text()
Watch this space:ipipgo's residential proxy will automatically rotate IPs, much more stable than those free proxies. The last test ran for 8 hours straight without being banned, the data is proper.
A practical guide to avoiding the pit
Have you encountered any of these moths?
- The page gets stuck halfway through loading
- The data returned is like a garbled skywriting
- pop-up human-machine verification (HMI)
With ipipgo.Intelligent RoutingFeature that automatically selects the fastest node. Coupled with a random User-Agent, the site thinks you're a normal user skulking around.
White QA time
Q: What can I do about slow proxy IPs?
A: Try switching protocols in the background of ipipgo, http to socks5 sometimes has a miraculous effect. Remember to selectlow latency nodeDon't try to use the free ones on the cheap!
Q: The code run reports 403 error?
A: 80% of the IP is marked, add a tryCatch in the code, automatically change ipipgo's new IP. it is recommended to set up a 3-second delay, do not follow the hungry wolves like a wild request.
Q: What happened to the incomplete data capture?
A: Check if the CSS selector is right, use browser developer tools to confirm. Open ipipgo'sdata pivotfunction to be able to see the request details.
Proxy IP Selection Metaphysics
There are three types of agents on the market:
- Transparent agents: no different from running around naked
- Ordinary anonymous agent: face with a mask on
- High Stash Agents: ipipgo, the kind that can do disguises.
Last time I used a certain proxy, it was recognized right after startup. After switching to ipipgo's high stash of proxies, it collected data for 3 days straight steady as an old dog. TheirIP Survival RateIt does hit the spot, a must for doing e-commerce price monitoring.
One final rant: data collection is not a race, control the frequency of requests. Use ipipgo'sIntelligent speed controlFunction, set a 20-30 second random interval, the site administrator can not see that you are doing things. If you don't understand anything, go to their official website and take a look at the documentation, which is written in more detail than a recipe.

