
A guide to building a proxy pool for crawlers.
Brothers engaged in crawling should understand that the anti-climbing mechanism of the target site is like a gopher game. Today we teach you to use ipipgo's proxy IP pool to arm the crawler, and personally test to reduce the probability of 80%'s seal. Let's split into two genres: Scrapy old drivers and Requests novice village.
Scrapy's revamped program for veteran Scrapy drivers
Just fiddle around in middlewares.py, there's a live configuration template here:
class ProxyMiddleware(object).
def __init__(self).
self.proxy_api = "http://ipipgo.com/api/get?type=dynamic&count=10"
def process_request(self, request, spider).
Update IP pool every 5 minutes
if not hasattr(spider, 'proxy_pool') or time.time() - spider.proxy_time > 300: spider.proxy_pool = requests
spider.proxy_pool = requests.get(self.proxy_api).json()['data']
spider.proxy_time = time.time()
Randomly pick a lucky IP
proxy = random.choice(spider.proxy_pool)
request.meta['proxy'] = f "http://{proxy['ip']}:{proxy['port']}"
Remember to enable this middleware in settings!
Here comes the key point:It is recommended to set the IP validity period to 3-5 minutes. ipipgo's dynamic residential package supports customized time limit, which just matches this need. It has been tested that using the city-level location function can effectively reduce the wind control of off-site login.
Fancy operation for Requests party
Single-threaded players look here and teach you a lazy rotation method:
from itertools import cycle
def get_proxies().
Generate API links directly from the ipipgo backend.
return [f"{ip}:{port}" for ip in requests.get('ipipgo backend link').json()]
proxy_pool = cycle(get_proxies())
while True: proxy_pool = cycle(get_proxies())
try: current_proxy = next(proxy)
current_proxy = next(proxy_pool)
res = requests.get(url, proxies={
"http": current_proxy, "https": current_proxy, "https": current_proxy
"https": current_proxy
}, timeout=10)
timeout=10)
except.
print(f"{current_proxy} flopped, move to the next one!")
Remember to add a retry mechanism in the exception handling. ipipgo's static residential IP is suitable for scenarios that require long sessions, such as simulating data capture after login.
Guide to avoiding the pit (QA session)
Q: What should I do if my proxy IP is not working?
A: First check the package type, dynamic residential default 1-minute time limit. It is recommended to add a survival detection in the code, more than 30 seconds no response automatically switch. ipipgo's enterprise version of the package support to extend the time limit to 30 minutes!
Q: Does having multiple crawlers on at the same time rob IPs?
A: Use the account system to do isolation, ipipgo background can create sub-accounts, assign independent keys to each crawler, so that they will not crowd each other
Q: What should I do if I am bombarded with CAPTCHAs?
A: Two options: 1) switch static residential IPs 2) add device fingerprints in the request header. ipipgo's TikTok solution has a device emulation module that can be used as a reference.
Which package should I choose?
Match the business scenario to the business scenario:
| take | Recommended Packages | dominance |
|---|---|---|
| Routine data collection | Dynamic residential (standard) | 0.5$/GB, automatic rotation |
| Long-term monitoring missions | Static homes | Fixed IP available for 7 days |
| Enterprise Crawler | Dynamic Residential (Business) | Exclusive IP pool + customized protocols |
I recently discovered a little trick: in the ipipgo backend settingsprotocol shuntThe first one is to split the HTTP and HTTPS requests into different IP pools, which can improve the collection speed of about 20%. Especially when engaging in e-commerce price monitoring, pro-test effective!
Lastly, don't waste your time on free proxies. I've tested the cheap proxies I bought from somebay before, 8 out of 10 are blacklisted IPs, might as well use ipipgo's newbie trial pack, don't pay for the first 2GB anyway.

