
Teach you to hang proxy IPs for crawlers.
When we do data capture, the most headache is to be the target site blocked IP. this time you need to give the crawler set a "vest" - that is, the proxy IP. today we take the most common Python crawler as an example, teach you how to give the program to wear armor.
Step 1: Get a reliable proxy IP
recommendedipipgoThe dynamic residential IP of the family, more than 7 dollars 1GB traffic is quite cost-effective. Their home IP pool is large, more than 200 countries around the world carrier resources, the probability of being blocked is much lower. I'm going to focus on how to get an IP:
import requests
Get the proxy from ipipgo's API
api_url = "https://api.ipipgo.com/getproxy"
params = {
"type": "dynamic",
"count": 5,
"protocol": "http"
}
response = requests.get(api_url, params=params)
proxies = response.json()['data']
This code can take 5 dynamic residential IPs at one time, note that when you actually use it, you have to replace it with your own API key. Their home client can also export the proxy list directly, which is more friendly to newbies.
Step 2: hooking up a proxy to the requests library
Assuming that you've got a proxy IP, the most common way to configure it is like this:
session = requests.Session()
proxy = "http://用户名:密码@ip address:port"
try.
response = session.get('destination URL', proxies={'http': proxy}, timeout=10)
print(response.text)
except Exception as e.
print(f "This IP is not working well, change to the next one: {str(e)}")
Note that you have to fill in hereUser name and password(ipipgo can be generated in the background), don't use the bare IP directly. encounter timeout or 403 error, then quickly change the IP, don't die.
Proxy IP Rotation Tips
Using a single IP is easy to be found, you have to learn to play guerrilla warfare. Here's a simple rotation scheme:
from itertools import cycle
proxy_pool = cycle(proxies) Put in the list of proxies you got.
for page in range(1, 100): current_proxy = next(proxy_pool)
current_proxy = next(proxy_pool)
try.
res = requests.get(url, proxies={'http': current_proxy})
Processing data...
except.
print(f "Skip failed proxy: {current_proxy}")
This will automatically cycle through the IPs in the proxy pool. it is recommended that you actively change your IP every 3-5 successful requests, rather than waiting until you are blocked.
Common Rollover Scene QA
Q: Why is it still blocked even after hanging the proxy?
A: Two possibilities: 1. The target site detected HTTP header anomaly 2. proxy IP quality is not good. It is recommended to add random User-Agent in the code, and at the same time change to ipipgo'sStatic Residential IP(More expensive but more stable)
Q: Proxy IP shows success but can't receive data?
A: 80% of the proxy server did not open the whitelist. Go to the ipipgo background to add the local IP to the whitelist, or use their family's客户端模式This one is the least troublesome.
Q: Do I need to change different agents for different sites?
A: Catch domestic websites with local carrier IP, overseas websites are recommended to use ipipgo'scross-border rail lineI'm not sure if I'm going to be able to do that. If you do Google crawler, remember to choose their TK dedicated package.
Package Selection Guide
Choose a package according to your business needs (prices are subject to change and are based on the official website):
| Business Type | Recommended Packages | average daily cost |
|---|---|---|
| data acquisition | Dynamic residential (standard) | About $0.25/GB |
| Account Registration | Static homes | About $1.16/IP |
| Overseas crawlers | cross-border rail line | Contact Customer Service for a quote |
Lastly, use a proxy IP to comply with the website's robots agreement. Encounter complex anti-climbing strategy, you can directly look for ipipgo technical support to customize the program, they can according to the specific business with different IP combinations, than their own blind toss much stronger.

