
First, why do crawlers must be on the proxy IP?
Brothers engaged in crawling understand that the target site's anti-climbing mechanism is more strict than the cell access control. Like you continuously brush the access control more than a dozen times, the security guards absolutely have to check your ID card. At this time, the proxy IP is equivalent to a daily change of clothes into the neighborhood, so that the anti-climbing system thinks that the normal user visits.
Let's take a real example: an e-commerce platform to monitor the price of goods, if you don't use a proxy, the IP will be blocked in less than half an hour. If you useipipgoAfter the agent pool, it ran for three consecutive days without triggering the wind control, and the collection success rate soared from 40% to 95%.
Second, how to step on the concurrency throttle of the Colly framework
The concurrency control that comes with the Colly framework is like driving a car in manual gear, with only 1 thread of concurrency working by default. We have to put it in gear manually:
c := colly.NewCollector(
colly.Async(true), // turn on the async switch
)
c.Limit(&colly.LimitRule{
Parallelism: 10, // run 10 concurrently
RandomDelay: 2time.Second, // randomize the pause
})
Note that there are two potholes here:
1. If you don't set Delay, the concurrency is too high and directly triggers anti-climbing.
2. Different websites have different tolerance, we have to try to find out the optimal number of concurrency slowly.
Proxy IP practical access guide
Straight to the dry goods, withipipgoThe API docking example:
func getProxy() string {
resp, _ := http.Get("https://api.ipipgo.com/proxy?format=text")
defer resp.Body.Close()
body, _ := ioutil.ReadAll(resp.Body)
return "http://" + string(body)
}
c.SetProxyFunc(func(r http.Request) (url url.URL, err error) {
return url.Parse(getProxy())
})
Focused Reminder:
- Must change to a new IP before each request
- To deal with proxy failures
- Remember to set a timeout to avoid jamming
Fourth, the collection of the actual battle to avoid the pit manual
I recently stepped on a mine while helping a client with a price comparison system:
1. An agent's IP library has a high repetition rate, with 3 out of 10 IP changes being the same.
2. No request header randomization, the target station identified the crawler features
3. Forgetting to set a timeout control leads to memory leaks
use (sth. different)ipipgoAfter the exclusive IP pool, the IP duplication rate is reduced to 0.3%, with the following configuration for better results:
| parameters | recommended value |
|---|---|
| timeout | 15 seconds. |
| Retries | 3 times |
| concurrent program (computing) | 5-20 |
V. Frequently Asked Questions QA
Q: What should I do if I can't connect to the proxy IP often?
A: Check three things: 1. IP survival monitoring 2. Replacement of port protocols 3. Contacting theipipgoCustomer Service Change Line
Q: What should I do if I can't get up to speed on acquisition?
A: First confirm whether the proxy IP dragged the feet, use theipipgoThe response time of the speed test interface detection, quality agent delay should be less than 800ms
Q: Does getting my IP blocked still involve the server?
A: The biggest advantage of using a proxy IP is to isolate the risk, even if the IP is blocked, it will not affect the local machine. However, you should be careful not to use the server to initiate requests directly, and do a good job of network isolation.
Finally, a piece of advice: don't try to be cheap and use a free agent, before a brother climbed the data leaked the company's internal API key, the result was targeted, the loss can be much more expensive than the agent's fee. The loss can be much more expensive than the proxy fee.ipipgoThis type of regular service provider has a request audit log, so you can still trace the real problem.

