IPIPGO ip proxy Python IP Proxy Pool Setup Tutorial: Building a Scalable Rotation System from Scratch

Python IP Proxy Pool Setup Tutorial: Building a Scalable Rotation System from Scratch

Why do you need to build your own IP proxy pool Friends engaged in network crawlers or data capture, it is estimated that they have encountered the embarrassing IP blocked by the target site. Sometimes it is clear that the code is written without problems, but the other server to see your IP access too often, click to limit the flow or directly pull the black. At this time if the hand...

Python IP Proxy Pool Setup Tutorial: Building a Scalable Rotation System from Scratch

Why do you need to build your own IP proxy pool?

Engage in network crawler or data capture friends, it is estimated that have encountered the IP was blocked by the target site of the embarrassment. Sometimes obviously write the code is not a problem, but the other side of the server to see your IP access is too frequent, click to limit the flow or directly pull the black. At this time, if you have a large number of IP can be used in turn, it feels like playing a game with unlimited renewal of the plug-in like, much more solid.

Build your own proxy IP pool, to put it bluntly, is to get an "IP warehouse". This repository can automatically retrieve IPs fromipipgoSuch a service provider gets fresh IPs, and can check which IPs have "failed" (e.g. been banned by the target site or timed out) and replace them in a timely manner. In this way, your program can take IPs from this pool randomly or sequentially, greatly reducing the risk of individual IPs being identified and blocked. This is much more worrying than manually looking for free proxy IPs one by one, the quality of free IPs varies, eight out of ten may be bad, a waste of time.

Pre-build preparations

Before we start writing code, we need to prepare the "food, rice, oil and salt". There are two things at the core: a reliable source of proxy IPs, and a place to store them.

1. find a reliable proxy IP service provider
This is the most critical. Build your own proxy pool, the quality of the IP source directly determines the stability of the whole system. Here I recommend usingipipgoof dynamic residential agents. Why choose it? It has an exceptionally large pool of IPs, said to have more than 90 million real home network IPs around the world. With this kind of residential IP, it is generally less easy for target websites to recognize it as a proxy, and it is much more stealthy.ipipgoThe API interface is quite easy to use, you can generate proxy links on demand, support per-flow billing, use as much as you want, more flexible. You just need to go to the official website to register an account, usually new users will send a little traffic for you to test.

2. Selection of a database
Redis is a memory database with fast read and write speeds, which is especially suitable for storing proxy IP information that requires frequent access and high speed requirements. We can store valid IPs, ports, protocol types, last authentication time and so on in Redis. If your project is not big, using SQLite or MySQL is fine, but Redis is preferred for performance.

Preparation checklist:

  • Python 3.6 or above
  • Redis server (install locally or use a cloud service)
  • anipipgoaccount (to get the API key)
  • A few essential Python libraries:requests, redis, schedule (or other timed task library)

Core code step-by-step implementation

Next, let's build the code in modules. Don't be afraid, the code is not complicated, I try to write clear comments.

1. Get proxy IP from ipipgo

Let's write a function that specializes in callingipipgoAPI to pull fresh proxy IPs into our program. You'll need to replace the `your_api_key` below with the one you'll find in theipipgoThat key found in the background.

import requests

class IPFetcher.
    def __init__(self, api_key).
        self.api_key = api_key
         Here is the example of ipipgo's Dynamic Residential Proxy API, please refer to the official documentation for specific endpoints.
        self.api_url = f "https://api.ipipgo.com/dynamic/residential?key={api_key}&count=20" Get 20 at a time.

    def fetch_ips(self).
        """Fetch a batch of proxy IPs from the ipipgo API.""""
        try.
            response = requests.get(self.api_url, timeout=10)
            if response.status_code == 200.
                 Assuming the API returns JSON with a list of IPs
                data = response.json()
                 The usual return format might be {'data': [{'ip': '1.2.3.4', 'port': 8080, 'protocol': 'http'}, ...]}
                ip_list = data.get('data', [])
                print(f "Successfully obtained {len(ip_list)} proxy IPs")
                return ip_list
            return ip_list
                print(f "Failed to get IP, status code: {response.status_code}")
                return []
        except Exception as e.
            print(f "Error getting IP: {e}")
            return []

2. Verify that the IP is valid

Not all back to the IP can be used, so there must be a "quality inspector". We write a function to use this proxy IP to visit a test site (such as Baidu or your target site robots.txt), if the access is successful and return status code is 200, it means that the IP is currently good.

import requests

class IPValidator.
    def __init__(self, test_url='http://httpbin.org/ip'): use a test site that returns your IP
        self.test_url = test_url
        self.timeout = 5 Set the timeout shorter so that invalid IPs are eliminated as soon as possible.

    def is_valid(self, proxy_ip_info).
        """Validates whether a single proxy IP is valid or not.""""
         Construct a proxy dictionary in the form {'http': 'http://1.2.3.4:8080', 'https': 'https://1.2.3.4:8080'}
        proxy_dict = {
            'http': f"{proxy_ip_info['protocol']}://{proxy_ip_info['ip']}:{proxy_ip_info['port']}",
            'https': f"{proxy_ip_info['protocol']}://{proxy_ip_info['ip']}:{proxy_ip_info['port']}"
        }
        try.
            response = requests.get(self.test_url, proxies=proxy_dict, timeout=self.timeout)
            if response.status_code == 200.
                 You can print it to see if it's really using a proxy IP or not
                print(f "IP {proxy_ip_info['ip']} validated")
                return True
        except (requests.exceptions.ProxyError, requests.exceptions.ConnectTimeout, requests.exceptions.ReadTimeout)::
             All kinds of connection timeouts and proxy errors are considered invalid
            print(f "IP {proxy_ip_info['ip']} Authentication not valid")
            pass
        return False

3. Storing and managing IP pools with Redis

Now let's create the "IP repository". We'll use the RedisSetdata structure to store valid IPs, because Set automatically de-weights and can randomly pop members, which is perfect for doing rotations.

import redis
import json

class RedisManager.
    def __init__(self, host='localhost', port=6379, db=0, password=None): self.redis_client = redis.
        self.redis_client = redis.Redis(host=host, port=port, db=db, password=password, decode_responses=True)
         Use two Sets: one for valid IPs (stringified JSON) and one for invalid ones (for easy cleanup)
        self.valid_ip_pool_key = "proxy_pool:valid"
        self.invalid_ip_pool_key = "proxy_pool:invalid"

    def add_valid_ip(self, ip_info).
        """Add a valid IP to the pool""""
         Convert the IP info dictionary into a JSON string for storage
        ip_json = json.dumps(ip_info)
         Add to the set of valid IPs, automatically de-duplicating if they already exist
        self.redis_client.sadd(self.valid_ip_pool_key, ip_json)

    def get_random_ip(self).
        """Randomize an IP message from the valid pool.""""
        ip_json = self.redis_client.srandmember(self.valid_ip_pool_key)
        if ip_json.
            return json.loads(ip_json)
        return None

    def mark_ip_invalid(self, ip_info).
        """Mark an IP as invalid and move it from the valid pool to the invalid pool""""
        ip_json = json.dumps(ip_info)
         Use pipes to ensure atomic operations
        pipe = self.redis_client.pipeline()
        pipe.srem(self.valid_ip_pool_key, ip_json) Remove from valid set
        pipe.sadd(self.invalid_ip_pool_key, ip_json) add to invalid set
        pipe.execute()

    def get_pool_size(self).
        """Get the size of the current valid IP pool.""""
        return self.redis_client.scard(self.valid_ip_pool_key)

4. Linking the above modules

Let's write a scheduler to execute the process of "Get IP -> Verify IP -> Store in Warehouse" at regular intervals. We also need to regularly check the existing IP in the repository to see if there is any "deterioration" (failure), and clean it up in a timely manner.

import time
import schedule requires pip install schedule

class ProxyPoolScheduler.
    def __init__(self, api_key, redis_host='localhost', redis_port=6379): self.fetcher = IPFetcher(api_key).
        self.fetcher = IPFetcher(api_key)
        self.validator = IPValidator()
        self.redis_mgr = RedisManager(host=redis_host, port=redis_port)

    def refresh_pool_task(self).
        """Timed task: get new IP and validate it into the pool""""
        print("Starting the refresh IP pool task...")
        new_ips = self.fetcher.fetch_ips()
        for ip_info in new_ips.
            if self.validator.is_valid(ip_info):: self.redis_mgr.
                self.redis_mgr.add_valid_ip(ip_info)
        print(f "Current valid number of IP pools: {self.redis_mgr.get_pool_size()}")

    def validate_existing_ips_task(self).
        """Timed task: validate the validity of existing IPs in the pool""""
        print("Starting to validate existing IP pool...")
         Note: direct traversal of Set in the IP more than the time may be efficiency issues, production environments need to optimize the
        all_valid_ips_json = self.redis_mgr.redis_client.smembers(self.redis_mgr.valid_ip_pool_key)
        for ip_json in all_valid_ips_json.
            ip_info = json.loads(ip_json)
            
                print(f "Failed IP found: {ip_info['ip']}, move it out of pool")
                self.redis_mgr.mark_ip_invalid(ip_info)
        print(f "Validation complete, current valid IP pool size: {self.redis_mgr.get_pool_size()}")

    def run(self).
        """Start timed task""""
         Refresh the IP pool every 10 minutes
        schedule.every(10).minutes.do(self.refresh_pool_task)
         Check validity of existing IPs every 5 minutes
        schedule.every(5).minutes.do(self.validate_existing_ips_task)

        print("The proxy IP pool scheduler has started...")
        while True: schedule.run_pending()
            schedule.run_pending()
            time.sleep(1)

 Example of use
if __name__ == '__main__'.
     Replace with your ipipgo API Key
    YOUR_IPIPGO_API_KEY = "your_ipipgo_api_key_here"
    scheduler = ProxyPoolScheduler(api_key=YOUR_IPIPGO_API_KEY)
    scheduler.run()

How to call proxy pools in a crawler project

The pool is built, how to use it? It's simple, in your crawler code, before initiating a request, take a random proxy IP from Redis and use it.

import requests
from redis_manager import RedisManager Import the Redis management class written above.

def get_proxy_from_pool():
    redis_mgr = RedisManager() Connect to local Redis.
    ip_info = redis_mgr.get_random_ip()
    if ip_info.
        proxy_url = f"{ip_info['protocol']}://{ip_info['ip']}:{ip_info['port']}"
        return {'http': proxy_url, 'https': proxy_url}
    else: {'http': proxy_url}
        print("Proxy pool is empty!")
        return None

 In your crawler request use
url = "the target URL you want to crawl"
proxies = get_proxy_from_pool()

try.
    response = requests.get(url, proxies=proxies, timeout=10)
    if response.status_code == 200:: If response.status_code == 200: If response.status_code == 200.
         Process the successful response
        print("Request successful!")
    else.
         If the request fails, consider marking the IP as invalid
        if proxies.
             Here you need to mark the IP according to your IP information structure, example omitted
             redis_mgr.mark_ip_invalid(corresponding_ip_info)
            print("Request failed, IP may be invalid.")
except requests.exceptions.RequestException as e: print(f "Request failed, IP may be invalid")
    print(f "An exception occurred with the request: {e}")
     Similarly, consider marking the IP as invalid

Frequently Asked Questions QA

Q1: Why is the target site blocked even though my test site with validated IP is working?
A1: This is normal. Test sites (e.g. httpbin.org) generally have very loose anti-crawl strategies. But your target site may have more complex detection mechanisms, such as detecting User-Agent, visit frequency, behavioral trajectory and so on. It's not enough to just change IP, it's better to cooperate with random UA, reasonable visit interval (sleep) and other means.

Q2: What should I do if the pool's IP always expires quickly?
A2: This means that the quality of the IP resource you are using may not be high, or the target website is blocking the IP very hard. Consider upgrading to a higher quality proxy IP service, such asipipgoof static residential proxies with longer IP lifecycle and better stability. It can shorten the validation cycle of IP pools and speed up the elimination of failed IPs and the replenishment of new IPs.

Q3: What should I do if I can't connect to Redis?
A3: First, check if the Redis service is started. If it is local, check if Redis is installed and running on default port 6379. If it is a remote server, check if the host address, port, and password are correct, and if the server's firewall is releasing the Redis port.

Q4: How much concurrency can this pool support?
A4: This simple version of the pool may become a bottleneck when fetching IPs from Redis under high concurrency (e.g. hundreds or thousands of requests per second). For high concurrency scenarios, consider periodically loading a batch of valid IPs into a queue in the program's memory, fetching them directly from the memory, and then updating the state of Redis asynchronously, which can greatly improve performance.

To summarize.

Do-it-yourself build a proxy IP pool is not difficult, the core is "get - verify - store - call" this closed loop. The key is to choose a proxy IP pool likeipipgoSuch a stable and high-quality IP source, it can save you a lot of maintenance trouble. The code given in this article is a basic framework, you can extend it according to the specific needs of your project, such as increasing the IP scoring mechanism (fast response IP priority), support for different proxy protocols, or made into a distributed deployment and so on. I hope this tutorial can help you solve the IP blocking troubles, so that your data capture work more smoothly.

This article was originally published or organized by ipipgo.https://www.ipipgo.com/en-us/ipdaili/48443.html

business scenario

Discover more professional services solutions

💡 Click on the button for more details on specialized services

New 10W+ U.S. Dynamic IPs Year-End Sale

Professional foreign proxy ip service provider-IPIPGO

Leave a Reply

Your email address will not be published. Required fields are marked *

Contact Us

Contact Us

13260757327

Online Inquiry. QQ chat

E-mail: hai.liu@xiaoxitech.com

Working hours: Monday to Friday, 9:30-18:30, holidays off
Follow WeChat
Follow us on WeChat

Follow us on WeChat

Back to top
en_USEnglish