IPIPGO ip proxy Go Crawler: Colly Framework Concurrent Collection

Go Crawler: Colly Framework Concurrent Collection

First, why engaged in crawling must be on the proxy IP? Brothers engaged in crawling understand that the target site's anti-climbing mechanism than the cell access control is still strict. Like you continuously brush the access control more than a dozen times, the security guards absolutely have to check your ID card. At this time, the proxy IP is the equivalent of different clothes into the neighborhood every day, so that the anti-climbing system thought it was normal to use ...

Go Crawler: Colly Framework Concurrent Collection

First, why do crawlers must be on the proxy IP?

Brothers engaged in crawling understand that the target site's anti-climbing mechanism is more strict than the cell access control. Like you continuously brush the access control more than a dozen times, the security guards absolutely have to check your ID card. At this time, the proxy IP is equivalent to a daily change of clothes into the neighborhood, so that the anti-climbing system thinks that the normal user visits.

Let's take a real example: an e-commerce platform to monitor the price of goods, if you don't use a proxy, the IP will be blocked in less than half an hour. If you useipipgoAfter the agent pool, it ran for three consecutive days without triggering the wind control, and the collection success rate soared from 40% to 95%.

Second, how to step on the concurrency throttle of the Colly framework

The concurrency control that comes with the Colly framework is like driving a car in manual gear, with only 1 thread of concurrency working by default. We have to put it in gear manually:

c := colly.NewCollector(
    colly.Async(true), // turn on the async switch
)
c.Limit(&colly.LimitRule{
    Parallelism: 10, // run 10 concurrently
    RandomDelay: 2time.Second, // randomize the pause
})

Note that there are two potholes here:
1. If you don't set Delay, the concurrency is too high and directly triggers anti-climbing.
2. Different websites have different tolerance, we have to try to find out the optimal number of concurrency slowly.

Proxy IP practical access guide

Straight to the dry goods, withipipgoThe API docking example:

func getProxy() string {
    resp, _ := http.Get("https://api.ipipgo.com/proxy?format=text")
    defer resp.Body.Close()
    body, _ := ioutil.ReadAll(resp.Body)
    return "http://" + string(body)
}

c.SetProxyFunc(func(r http.Request) (url url.URL, err error) {
    return url.Parse(getProxy())
})

Focused Reminder:
- Must change to a new IP before each request
- To deal with proxy failures
- Remember to set a timeout to avoid jamming

Fourth, the collection of the actual battle to avoid the pit manual

I recently stepped on a mine while helping a client with a price comparison system:
1. An agent's IP library has a high repetition rate, with 3 out of 10 IP changes being the same.
2. No request header randomization, the target station identified the crawler features
3. Forgetting to set a timeout control leads to memory leaks

use (sth. different)ipipgoAfter the exclusive IP pool, the IP duplication rate is reduced to 0.3%, with the following configuration for better results:

parameters recommended value
timeout 15 seconds.
Retries 3 times
concurrent program (computing) 5-20

V. Frequently Asked Questions QA

Q: What should I do if I can't connect to the proxy IP often?
A: Check three things: 1. IP survival monitoring 2. Replacement of port protocols 3. Contacting theipipgoCustomer Service Change Line

Q: What should I do if I can't get up to speed on acquisition?
A: First confirm whether the proxy IP dragged the feet, use theipipgoThe response time of the speed test interface detection, quality agent delay should be less than 800ms

Q: Does getting my IP blocked still involve the server?
A: The biggest advantage of using a proxy IP is to isolate the risk, even if the IP is blocked, it will not affect the local machine. However, you should be careful not to use the server to initiate requests directly, and do a good job of network isolation.

Finally, a piece of advice: don't try to be cheap and use a free agent, before a brother climbed the data leaked the company's internal API key, the result was targeted, the loss can be much more expensive than the agent's fee. The loss can be much more expensive than the proxy fee.ipipgoThis type of regular service provider has a request audit log, so you can still trace the real problem.

This article was originally published or organized by ipipgo.https://www.ipipgo.com/en-us/ipdaili/35083.html

business scenario

Discover more professional services solutions

💡 Click on the button for more details on specialized services

New 10W+ U.S. Dynamic IPs Year-End Sale

Professional foreign proxy ip service provider-IPIPGO

Leave a Reply

Your email address will not be published. Required fields are marked *

Contact Us

Contact Us

13260757327

Online Inquiry. QQ chat

E-mail: hai.liu@xiaoxitech.com

Working hours: Monday to Friday, 9:30-18:30, holidays off
Follow WeChat
Follow us on WeChat

Follow us on WeChat

Back to top
en_USEnglish