
Hands-on with Go to pickpocket proxy IPs
We do data collection of old drivers understand that no proxy IP is like driving without a steering wheel. Today, I will give you some hard goods, use Go language to write a proxy IP resolver, focusing on how to extract the proxy IP address from the web page.
// As an example: parse IPs from a web table
func parseIPTable(html string) []string {
re := regexp.MustCompile(`<td>(d+.d+.d+.d+)</td>.?<td>(d+)</td>`)
matches := re.FindAllStringSubmatch(html, -1)
var proxies []string
for _, match := range matches {
proxies = append(proxies, fmt.Sprintf("%s:%s", match[1], match[2]))
}
return proxies
}
This regular expression looks simple, but there are severalpotholeBe careful: the structure of the web page often changes, some sites will deliberately put a fake IP, the table may be mixed with advertising content. This is the time to use ipipgo ready-made proxy pool, than their own pull web pages to save a lot of trouble.
Proxy IP Authentication
It's hard to pull down the IP, eight out of ten can not be used how to do? I'll teach you a trick:
| verification step | take a period of (x amount of time) | success rate |
|---|---|---|
| TCP connection alone | 2 seconds. | 40% |
| Test with target site | 5 seconds. | 80% |
| Multi-node concurrency detection | 3 seconds. | 95% |
If it's too much trouble, just use ipipgo's.Pre-verified IP PoolThe company has already done three rounds of screening for us. Their API returns the IP is basically ready to use, saving you the trouble of verifying.
Practical case: the collection of an enterprise information website
Recently a brother asked me to help, said their company to collect enterprise data, the results of the site anti-climbing too hard. Give everyone a look at how we get it done:
func main() {
// Get 10 proxies from ipipgo
proxies := ipipgo.GetProxies(10, "http")
for _, proxy := range proxies {
client := &http.Client {
Transport: &http.Transport{Proxy: http.ProxyURL(proxy)},
Timeout: 8 time.Second
Timeout: 8 time.Second }
// Remember to handle exceptions here
resp, _ := client.Get("target site")
// Parsing the data...
}
}
Using this method, the anti-climbing mechanism was successfully bypassed. The key point isDifferent proxies for each requestThe IP pool of ipipgo is big enough for us to take turns using it.
Old Driver QA Time
Q: Why can't I use the proxy IP I got?
A: There are two common situations: either the proxy fails (short survival time of their own IP), or the target site blocked the proxy segment. It is recommended to use ipipgo this kind of professional service provider, they IP update fast, there are 24 hours survival guarantee!
Q: How can I increase the collection speed?
A: three tricks: 1. concurrent requests with the concurrent pool 2. set a reasonable timeout 3. do not catch a site fierce grip, with proxy IP to disperse requests
Q: What should I pay attention to when choosing a proxy service provider?
A: focus on these points: IP pool size (recommended ipipgo million pool), protocol support (HTTP/HTTPS/Socks5), response speed (measured ipipgo average of 200ms), whether to provide a trial (they have a 3 yuan experience package)

