
Hands-on teaching you how to use proxy IP to prevent blocking
Brothers engaged in crawling understand that the most headache is the site blocked IP. two days ago I just wrote a platform data collection script, running less than half an hour on the prompt "access to the abnormal", so angry that I fell on the spot on the keyboard. Later found that the use of proxy IP is the king of the road, here to give you nagging my combat experience.
For example, when you use the requests library to grab data, it's like running naked on the Internet if you don't add a proxy. The webmaster will see the same IP requesting frantically and blacklist you in a minute. At this point, you need to give each requestWear a different vest., that is, switching between different proxy IPs.
import requests
from bs4 import BeautifulSoup
proxies = {
'http': 'http://username:password@gateway.ipipgo.com:9020',
'https': 'http://username:password@gateway.ipipgo.com:9020'
}
response = requests.get('https://目标网站.com', proxies=proxies)
soup = BeautifulSoup(response.text, 'html.parser')
Here's the parsing logic...
How to choose a reliable proxy IP?
There are various proxies on the market, I have compared seven or eight, and finally locked ipipgo home dynamic residential IP. why choose it? Three words:Stable, fast and economical. Their IP pools are all real home broadband, harder to recognize than server room IPs, and the price is still cheaper than their peers by about 20%.
Here's a comparison table to visualize it better:
| typology | Applicable Scenarios | Price advantage |
|---|---|---|
| Dynamic residential (standard) | Routine data collection | 7.67 Yuan/GB |
| Dynamic Residential (Business) | High-frequency visit requirements | 9.47 Yuan/GB |
| Static homes | Long-term fixed IP requirements | 35 yuan/month |
Three guides to avoiding pitfalls in the real world
Pit 1: Failure to deal with proxy failures. It is recommended that you use the retry decorator to automatically retry, I usually set 3 retries + random cut proxies:
from tenacity import retry, stop_after_attempt
@retry(stop=stop_after_attempt(3))
def crawl_page(url).
Get a new proxy for each retry
current_proxy = get_random_proxy()
return requests.get(url, proxies=current_proxy)
Pit 2: Request for head to reveal identity. Remember to generate a random User-Agent for each request, don't let the site see a pattern. I've put together a UA library, private me if you need it.
Pit 3: Not verifying agent quality. It is recommended to run a test script before the crawler starts, I usually take httpbin.org/ip to verify that the proxy is working.
Frequently Asked Questions
Q: What should I do if my agent is slow?
A: Priority is given to local operator resources, such as climbing domestic stations with ipipgo East China node. In addition to checking whether the HTTPS proxy is used to go HTTP request, the protocol should correspond.
Q: How to manage a large number of proxy IPs?
A: Use redis to store IP pools and record the number of times each IP is used and the response time. It is recommended to refer to this structure:
{
"ip": "112.95.23.61:8080",
"used_count": 3,
"last_speed": 0.78,
"last_check": "2024-03-15 14:30"
}
Q: How do I break the CAPTCHA when I encounter it?
A: This belongs to another topic. Simply put, you can combine ipipgo's TK dedicated proxy (their unique feature) to automatically handle common CAPTCHA types.
Finally, I would like to remind you that you should look at the long-term stability of the proxy service. Before the cheap use of 9.9 monthly service, the result of the IP survival time of less than 5 minutes on average. Now with ipipgo's enterprise package, a single IP can be used for more than 2 hours, counting the cost is lower. New users are advised to buy their dynamic standard version to try the water, more than 7 yuan 1G traffic enough to run a small project.

