
Hands-On Python Crawler to Avoid Bans
The old iron engaged in crawlers must have encountered this hurdle - the target site suddenly give you an IP ban. Last week, I helped a friend pickpocket an e-commerce data, just run for half an hour on the harvest 502 gift packs, so angry that he almost smashed the keyboard. This time we have to move out of our savior:The Great Proxy IP RotationThe
How can a proxy IP be a crawler bodyguard?
Simply put, it makes the website think that the visit is coming from a different computer. It's like playing chicken with a voice changer so that your opponent can't figure out your real location. Here is a key point:Don't use free agents.I'm not sure if I'm going to be able to do that! I tested a free proxy pool last year, and only 3 out of 20 IPs worked, and the latency was high enough to cook a bowl of instant noodles.
| Agent Type | availability rate | tempo | stability |
|---|---|---|---|
| Free Agents | <15% | 3000ms+ | Abandoned at any time. |
| ipipgo commercial proxy | >99% | Within 200ms | 7×24 hours stable |
Practical code: to the crawler wear invisibility cloak
Here's a demo using the requests library, focusing on the proxy settings section. Note that you replace your_api_key with the real key you got from the ipipgo backend:
import requests
from random import choice
Proxy pool from ipipgo
def get_proxies():
api_url = "https://api.ipipgo.com/fetch?key=your_api_key"
resp = requests.get(api_url).json()
return [f "http://{ip}:{port}" for ip,port in resp['data']]
proxies_pool = get_proxies()
Request method with automatic IP change
def smart_request(url).
try.
proxy = {'http': choice(proxies_pool)}
resp = requests.get(url, proxies=proxy, timeout=10)
return resp.text
except Exception as e.
print(f "Planted: {e}, change IP now and retry")
return smart_request(url) auto-retry
Example: Crawling a product page
data = smart_request("https://target-site.com/product/123")
There are three key points to this routine:
- Random IP selection per request - It's like guerrilla warfare. It makes the site defenseless.
- Abnormal auto retry - When the IP is invalidated, you immediately change your armor.
- timeout setting - Don't fight the laggy agents.
Guide to avoiding the pit: 90% newbies will step on the mine
1. Inappropriate frequency of IP replacement:Don't change IP like Parkinson's, and don't use an IP to death. It is recommended to adjust the strength of anti-climbing according to the website, generally 5-10 minutes to change a wave.
2. Header information is not disguised:It's not enough to just change your IP, remember to bring a random User-Agent, it's like changing your clothes but not your shoes, you'll still be exposed.
headers_pool = [
{"User-Agent": "Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36"},
{"User-Agent": "Mozilla/5.0 (iPhone; CPU iPhone OS 16_6 like Mac OS X)"}
]
3. The agency agreement was wrong:The http and https proxies have to be separated, like using face wash as toothpaste. When using ipipgo, pay attention to their proxy support dual protocol, this is very worrying.
Frequently Asked Questions QA
Q: What should I do if all the proxy IPs suddenly hang up?
A: Check your account balance first, then make sure the API address is correct. If you use ipipgo, they have a backup interface https://backup.ipipgo.com which can save your life in critical times.
Q: How can I tell if an agent is really effective?
A: Add a check link in the code, for example, visit http://ip.ipipgo.com/checkip, it can return the current proxy IP to show the path.
Q: How to manage agents for multi-threaded crawlers?
A: It is recommended to use the queue mechanism, each thread from the queue to take the IP, used up put back. ipipgo's API supports batch acquisition, once to take 200 IP enough to open 20 threads to create.
Why ipipgo?
This one has three killer features that made me road trip:
- True Exclusive IP Pool - Unlike some merchants who say it's exclusive, it's actually a second-hand IP
- City-level positioning - When regional data is required, it can accurately obtain the IP of a certain place.
- Traffic is not wasted - It's not like a monthly subscription where you can't use it all.
Lastly, I'd like to say a few words about crawlers. Use ipipgo such formal proxy service, set a reasonable request frequency, do not make the other site collapse. Technology is a double-edged sword, used in the right way in order to long.

