
The old driver of the crawler is afraid of what, IP blocked the most headache!
Recently a lot of data collection friends and I complained, with the gospider this artifact crawl data is really fast, but not moving to the target site blocked IP. last week there is an e-commerce price comparison brother, just run half an hour on the blocked more than 20 IP, angry almost smashed the keyboard.
Here's a tip for the guys:Proxy IP is your stealth suitThe first thing you need to do is to get a proxy IP address to be able to carry two more shots. Like playing chicken games, wearing a three-level armor can carry two more shots, with a proxy IP can let your crawler in the site under the eyes of the repeated horizontal jump. Our domesticipipgoThe proxy service has been tested to be able to stably carry high concurrency requests.
Hands on vest for gospider.
gospider -s "https://target.com" -a -c 10 -d 3
--proxy http://user:pass@proxy.ipipgo.com:31028
in this command.-proxy parametersThat's the key, the ipipgo provide proxy address to fill in, immediately become a "thousand-face crawler". Pay attention to the format do not write the wrong, especially the account password and port number, novice is most likely to fall in this.
| parameters | corresponds English -ity, -ism, -ization | recommended value |
|---|---|---|
| -c | concurrency | 10-30 (depending on agent package) |
| -proxy | agency agreement | http/socks5 |
A practical guide to avoiding the pit
The last time I helped a customer to climb the price of the travel site, with ipipgo's residential agent pool, ran for three consecutive days without being blocked. Here is a little trick:Timed proxy IP change. Their API supports changing IPs by the minute, and with gospider's timed tasks, it's perfect.
Automatic IP Change Script
while true; do
new_ip=$(curl https://api.ipipgo.com/get_proxy)
gospider -s $url --proxy $new_ip
sleep 300 Change IP every 5 minutes
done
White common rollover scene QA
Q: What should I do if my proxy IP always times out?
A: First check the proxy format is not right, and then try to switch ipipgo's different server room nodes. Their tech support responds quickly, and last time I raised a work order at 2:00 in the middle of the night, there was actually someone who responded...
Q: Is it the agent's fault that the crawler is slowing down?
A: Not necessarily! Usecurl -x单独测试代理。如果超过200ms,建议换ipipgo的静态高速套餐,专门针对爬虫优化过。
Q: How many proxy IPs do I need to use at the same time?
A: Depends on the strength of the target site's wind control. It is generally recommended to prepare3-5 times the number of concurrencesThe amount of IP. For example, if you are running 20 concurrencies, it is best to have 60-100 IPs on hand to rotate, and ipipgo's packages have just the right amount of flexibility for this type of configuration.
Why ipipgo?
Used 7 or 8 agency services and finally locked in with them. Three hardcore advantages:
- 国内自建机房,能控制在50ms内
- The IP pool is updated hourly with 15%, which is simply too much to use up!
- Supports per-traffic billing, especially friendly to small projects
Lastly, a word of advice: don't use free proxies on the cheap! Before a buddy to save trouble, the result of crawling data was injected malicious code, the database was emptied. Professional things or toipipgoThis kind of reliable service provider is more safe and secure than anything else.

