
Why do you need to build your own IP proxy pool?
Engage in network crawler or data capture friends, it is estimated that have encountered the IP was blocked by the target site of the embarrassment. Sometimes obviously write the code is not a problem, but the other side of the server to see your IP access is too frequent, click to limit the flow or directly pull the black. At this time, if you have a large number of IP can be used in turn, it feels like playing a game with unlimited renewal of the plug-in like, much more solid.
Build your own proxy IP pool, to put it bluntly, is to get an "IP warehouse". This repository can automatically retrieve IPs fromipipgoSuch a service provider gets fresh IPs, and can check which IPs have "failed" (e.g. been banned by the target site or timed out) and replace them in a timely manner. In this way, your program can take IPs from this pool randomly or sequentially, greatly reducing the risk of individual IPs being identified and blocked. This is much more worrying than manually looking for free proxy IPs one by one, the quality of free IPs varies, eight out of ten may be bad, a waste of time.
Pre-build preparations
Before we start writing code, we need to prepare the "food, rice, oil and salt". There are two things at the core: a reliable source of proxy IPs, and a place to store them.
1. find a reliable proxy IP service provider
This is the most critical. Build your own proxy pool, the quality of the IP source directly determines the stability of the whole system. Here I recommend usingipipgoof dynamic residential agents. Why choose it? It has an exceptionally large pool of IPs, said to have more than 90 million real home network IPs around the world. With this kind of residential IP, it is generally less easy for target websites to recognize it as a proxy, and it is much more stealthy.ipipgoThe API interface is quite easy to use, you can generate proxy links on demand, support per-flow billing, use as much as you want, more flexible. You just need to go to the official website to register an account, usually new users will send a little traffic for you to test.
2. Selection of a database
Redis is a memory database with fast read and write speeds, which is especially suitable for storing proxy IP information that requires frequent access and high speed requirements. We can store valid IPs, ports, protocol types, last authentication time and so on in Redis. If your project is not big, using SQLite or MySQL is fine, but Redis is preferred for performance.
Preparation checklist:
- Python 3.6 or above
- Redis server (install locally or use a cloud service)
- anipipgoaccount (to get the API key)
- A few essential Python libraries:
requests,redis,schedule(or other timed task library)
Core code step-by-step implementation
Next, let's build the code in modules. Don't be afraid, the code is not complicated, I try to write clear comments.
1. Get proxy IP from ipipgo
Let's write a function that specializes in callingipipgoAPI to pull fresh proxy IPs into our program. You'll need to replace the `your_api_key` below with the one you'll find in theipipgoThat key found in the background.
import requests
class IPFetcher.
def __init__(self, api_key).
self.api_key = api_key
Here is the example of ipipgo's Dynamic Residential Proxy API, please refer to the official documentation for specific endpoints.
self.api_url = f "https://api.ipipgo.com/dynamic/residential?key={api_key}&count=20" Get 20 at a time.
def fetch_ips(self).
"""Fetch a batch of proxy IPs from the ipipgo API.""""
try.
response = requests.get(self.api_url, timeout=10)
if response.status_code == 200.
Assuming the API returns JSON with a list of IPs
data = response.json()
The usual return format might be {'data': [{'ip': '1.2.3.4', 'port': 8080, 'protocol': 'http'}, ...]}
ip_list = data.get('data', [])
print(f "Successfully obtained {len(ip_list)} proxy IPs")
return ip_list
return ip_list
print(f "Failed to get IP, status code: {response.status_code}")
return []
except Exception as e.
print(f "Error getting IP: {e}")
return []
2. Verify that the IP is valid
Not all back to the IP can be used, so there must be a "quality inspector". We write a function to use this proxy IP to visit a test site (such as Baidu or your target site robots.txt), if the access is successful and return status code is 200, it means that the IP is currently good.
import requests
class IPValidator.
def __init__(self, test_url='http://httpbin.org/ip'): use a test site that returns your IP
self.test_url = test_url
self.timeout = 5 Set the timeout shorter so that invalid IPs are eliminated as soon as possible.
def is_valid(self, proxy_ip_info).
"""Validates whether a single proxy IP is valid or not.""""
Construct a proxy dictionary in the form {'http': 'http://1.2.3.4:8080', 'https': 'https://1.2.3.4:8080'}
proxy_dict = {
'http': f"{proxy_ip_info['protocol']}://{proxy_ip_info['ip']}:{proxy_ip_info['port']}",
'https': f"{proxy_ip_info['protocol']}://{proxy_ip_info['ip']}:{proxy_ip_info['port']}"
}
try.
response = requests.get(self.test_url, proxies=proxy_dict, timeout=self.timeout)
if response.status_code == 200.
You can print it to see if it's really using a proxy IP or not
print(f "IP {proxy_ip_info['ip']} validated")
return True
except (requests.exceptions.ProxyError, requests.exceptions.ConnectTimeout, requests.exceptions.ReadTimeout)::
All kinds of connection timeouts and proxy errors are considered invalid
print(f "IP {proxy_ip_info['ip']} Authentication not valid")
pass
return False
3. Storing and managing IP pools with Redis
Now let's create the "IP repository". We'll use the RedisSetdata structure to store valid IPs, because Set automatically de-weights and can randomly pop members, which is perfect for doing rotations.
import redis
import json
class RedisManager.
def __init__(self, host='localhost', port=6379, db=0, password=None): self.redis_client = redis.
self.redis_client = redis.Redis(host=host, port=port, db=db, password=password, decode_responses=True)
Use two Sets: one for valid IPs (stringified JSON) and one for invalid ones (for easy cleanup)
self.valid_ip_pool_key = "proxy_pool:valid"
self.invalid_ip_pool_key = "proxy_pool:invalid"
def add_valid_ip(self, ip_info).
"""Add a valid IP to the pool""""
Convert the IP info dictionary into a JSON string for storage
ip_json = json.dumps(ip_info)
Add to the set of valid IPs, automatically de-duplicating if they already exist
self.redis_client.sadd(self.valid_ip_pool_key, ip_json)
def get_random_ip(self).
"""Randomize an IP message from the valid pool.""""
ip_json = self.redis_client.srandmember(self.valid_ip_pool_key)
if ip_json.
return json.loads(ip_json)
return None
def mark_ip_invalid(self, ip_info).
"""Mark an IP as invalid and move it from the valid pool to the invalid pool""""
ip_json = json.dumps(ip_info)
Use pipes to ensure atomic operations
pipe = self.redis_client.pipeline()
pipe.srem(self.valid_ip_pool_key, ip_json) Remove from valid set
pipe.sadd(self.invalid_ip_pool_key, ip_json) add to invalid set
pipe.execute()
def get_pool_size(self).
"""Get the size of the current valid IP pool.""""
return self.redis_client.scard(self.valid_ip_pool_key)
4. Linking the above modules
Let's write a scheduler to execute the process of "Get IP -> Verify IP -> Store in Warehouse" at regular intervals. We also need to regularly check the existing IP in the repository to see if there is any "deterioration" (failure), and clean it up in a timely manner.
import time
import schedule requires pip install schedule
class ProxyPoolScheduler.
def __init__(self, api_key, redis_host='localhost', redis_port=6379): self.fetcher = IPFetcher(api_key).
self.fetcher = IPFetcher(api_key)
self.validator = IPValidator()
self.redis_mgr = RedisManager(host=redis_host, port=redis_port)
def refresh_pool_task(self).
"""Timed task: get new IP and validate it into the pool""""
print("Starting the refresh IP pool task...")
new_ips = self.fetcher.fetch_ips()
for ip_info in new_ips.
if self.validator.is_valid(ip_info):: self.redis_mgr.
self.redis_mgr.add_valid_ip(ip_info)
print(f "Current valid number of IP pools: {self.redis_mgr.get_pool_size()}")
def validate_existing_ips_task(self).
"""Timed task: validate the validity of existing IPs in the pool""""
print("Starting to validate existing IP pool...")
Note: direct traversal of Set in the IP more than the time may be efficiency issues, production environments need to optimize the
all_valid_ips_json = self.redis_mgr.redis_client.smembers(self.redis_mgr.valid_ip_pool_key)
for ip_json in all_valid_ips_json.
ip_info = json.loads(ip_json)
print(f "Failed IP found: {ip_info['ip']}, move it out of pool")
self.redis_mgr.mark_ip_invalid(ip_info)
print(f "Validation complete, current valid IP pool size: {self.redis_mgr.get_pool_size()}")
def run(self).
"""Start timed task""""
Refresh the IP pool every 10 minutes
schedule.every(10).minutes.do(self.refresh_pool_task)
Check validity of existing IPs every 5 minutes
schedule.every(5).minutes.do(self.validate_existing_ips_task)
print("The proxy IP pool scheduler has started...")
while True: schedule.run_pending()
schedule.run_pending()
time.sleep(1)
Example of use
if __name__ == '__main__'.
Replace with your ipipgo API Key
YOUR_IPIPGO_API_KEY = "your_ipipgo_api_key_here"
scheduler = ProxyPoolScheduler(api_key=YOUR_IPIPGO_API_KEY)
scheduler.run()
How to call proxy pools in a crawler project
The pool is built, how to use it? It's simple, in your crawler code, before initiating a request, take a random proxy IP from Redis and use it.
import requests
from redis_manager import RedisManager Import the Redis management class written above.
def get_proxy_from_pool():
redis_mgr = RedisManager() Connect to local Redis.
ip_info = redis_mgr.get_random_ip()
if ip_info.
proxy_url = f"{ip_info['protocol']}://{ip_info['ip']}:{ip_info['port']}"
return {'http': proxy_url, 'https': proxy_url}
else: {'http': proxy_url}
print("Proxy pool is empty!")
return None
In your crawler request use
url = "the target URL you want to crawl"
proxies = get_proxy_from_pool()
try.
response = requests.get(url, proxies=proxies, timeout=10)
if response.status_code == 200:: If response.status_code == 200: If response.status_code == 200.
Process the successful response
print("Request successful!")
else.
If the request fails, consider marking the IP as invalid
if proxies.
Here you need to mark the IP according to your IP information structure, example omitted
redis_mgr.mark_ip_invalid(corresponding_ip_info)
print("Request failed, IP may be invalid.")
except requests.exceptions.RequestException as e: print(f "Request failed, IP may be invalid")
print(f "An exception occurred with the request: {e}")
Similarly, consider marking the IP as invalid
Frequently Asked Questions QA
Q1: Why is the target site blocked even though my test site with validated IP is working?
A1: This is normal. Test sites (e.g. httpbin.org) generally have very loose anti-crawl strategies. But your target site may have more complex detection mechanisms, such as detecting User-Agent, visit frequency, behavioral trajectory and so on. It's not enough to just change IP, it's better to cooperate with random UA, reasonable visit interval (sleep) and other means.
Q2: What should I do if the pool's IP always expires quickly?
A2: This means that the quality of the IP resource you are using may not be high, or the target website is blocking the IP very hard. Consider upgrading to a higher quality proxy IP service, such asipipgoof static residential proxies with longer IP lifecycle and better stability. It can shorten the validation cycle of IP pools and speed up the elimination of failed IPs and the replenishment of new IPs.
Q3: What should I do if I can't connect to Redis?
A3: First, check if the Redis service is started. If it is local, check if Redis is installed and running on default port 6379. If it is a remote server, check if the host address, port, and password are correct, and if the server's firewall is releasing the Redis port.
Q4: How much concurrency can this pool support?
A4: This simple version of the pool may become a bottleneck when fetching IPs from Redis under high concurrency (e.g. hundreds or thousands of requests per second). For high concurrency scenarios, consider periodically loading a batch of valid IPs into a queue in the program's memory, fetching them directly from the memory, and then updating the state of Redis asynchronously, which can greatly improve performance.
To summarize.
Do-it-yourself build a proxy IP pool is not difficult, the core is "get - verify - store - call" this closed loop. The key is to choose a proxy IP pool likeipipgoSuch a stable and high-quality IP source, it can save you a lot of maintenance trouble. The code given in this article is a basic framework, you can extend it according to the specific needs of your project, such as increasing the IP scoring mechanism (fast response IP priority), support for different proxy protocols, or made into a distributed deployment and so on. I hope this tutorial can help you solve the IP blocking troubles, so that your data capture work more smoothly.

