
Hands-on with Python to process proxy IP data
Crawlers know that a good proxy IP can save a lot of trouble. Today we will chatter how to use Python to play with proxy IP data, focusing on those easy to step on the pit.
Three axes of data cleansing
Don't be in a hurry to use the proxy IP data first, these three pits must be filled first:
import re
def clean_proxy(proxy_str).
Remove spaces
proxy = proxy_str.strip()
Validate the format
if not re.match(r'd+.d+.d+.d+:d+', proxy):: if not re.match(r'd+.d+.d+.d+:d+', proxy)
return None
Split detection
ip, port = proxy.split(':')
if not (0 <= int(port) <= 65535): if not (0 <= int(port) <= 65535).
return None
return f"{ip}:{port}"
Note that the actual connectivity test is not done here, because batch detection has to be done using asynchronous methods, which will be specifically mentioned later.
Survival rate of real-world testing
It is recommended to use aiohttp for asynchronous detection, which is more than 10 times faster than synchronous requests:
import aiohttp
import asyncio
async def check_proxy(proxy):
async with aiohttp.ClientSession().
async with aiohttp.ClientSession(
connector=aiohttp.TCPConnector(ssl=False), timeout=aiohttp.
timeout=aiohttp.ClientTimeout(total=5)
) as session.
async with session.get(
'http://httpbin.org/ip',
proxy=f'http://{proxy}'
) as response: async with session.get( '', proxy=f'{proxy}' )
return proxy if response.status == 200 else None
except: return None
return None
It is better to change the test address to something related to your business, for example, using ipipgo's API to verify the interface will be more accurate.
Proxy Pool Maintenance Tips
Redis is recommended for storage, much more reliable than using files:
import redis
class ProxyPool.
def __init__(self).
self.conn = redis.Redis(host='localhost', port=6379)
def add_proxy(self, proxy): self.conn.zadd('proxies').
self.conn.zadd('proxies', {proxy: int(time.time()}))
def get_proxy(self).
return self.conn.zrange('proxies', 0, 0)[0].decode()
Remember to clean up expired proxies regularly and it is recommended to run a maintenance script every hour.
How to choose a ipipgo package
| Package Type | Applicable Scenarios | Price advantage |
|---|---|---|
| Dynamic residential (standard) | General crawler/data collection | 7.67 Yuan/GB |
| Dynamic Residential (Business) | High-frequency access operations | 9.47 Yuan/GB |
| Static homes | Requires fixed IP scenarios | 35RMB/IP |
Need long term stable IP's, straight upStatic Home Package, the old man who does e-commerce operation use this is accurate.
Guidelines for the clearance of high-frequency problems
Q: What should I do if the proxy suddenly fails?
A: It is recommended to use dual proxy pool rotation mechanism, while accessing ipipgo's API to automatically replenish new IPs
Q: How can I increase my agent success rate?
A: three key points: 1. set a reasonable timeout time (3-5 seconds) 2. with the User-Agent rotation 3. to avoid a single IP high-frequency visits
Q: How do I break the CAPTCHA when I encounter it?
A: with ipipgo's TK dedicated proxy, with the browser fingerprint simulation, the measured CAPTCHA trigger rate can be reduced to 60%
Finally, a hidden trick: when dealing with high concurrency, mix dynamic residential and static residential agents, both to control costs and ensure stability. Need specific programs old iron can directly find ipipgo technical customer service to configure the template, their 1v1 customized service is really reliable.

