
Hands-on with a pool of agents who can carry the load
Crawler friends understand that there is no reliable agent pool is like riding a bicycle on the highway - simply can not run up. Free proxies on the market are like the sky in June that changes, today can be used tomorrow will be out of action. Here to give everyone a trick, with Scrapy + Redis to build an exclusive agent pool, and then with a reliable ipipgo agent package, to ensure that your reptile stable with the old driver driving like.
First, you need to understand why you need to build your own proxy pool.
1. The free agent is too much of a dud.Nine out of ten are useless, and the rest are probably slower than a turtle.
2. Commercial agents are too expensive: The small projects can't afford to be billed by volume if they don't move!
3. Flexibility in your own hands: Screening as much as you want, expanding and reducing capacity at any time
Preparing for the start of construction
| artifact | use |
|---|---|
| Scrapy | Crawl Proxy Sites |
| Redis | Dependent Agent + Task Scheduler |
| ipipgo account | Access to quality agent sources |
Focusing on the configuration of ipipgo: get the API interface in their backend, it is recommended to choose theDynamic Residential IP Package, this IP is not easily recognized as a crawler. Get the interface to look like this:
http://api.ipipgo.com/get?key=你的密钥&count=50
Four Steps to Build a Core Architecture
Step 1 Agent Acquisition
Write a crawler in Scrapy that focuses on catching these three types of sites:
- Public Proxy List website (note the timeliness)
- API interface for ipipgo (stable source)
- Agent sharing post for industry forums (for pickup)
Step 2 Redis Stores Data
Configure the Redis connection in settings.py, suggesting three libraries:
1. raw_proxies: raw proxies just captured
2. verified_proxies: verified available proxies
3. bad_proxies: lapsed blacklists
Step three: Get a validation middleware.
Write a custom middleware, each request before taking a random proxy from Redis. Here's a tip: tag different proxies, such as mobile/unicom operators to store separately, encounter specific sites can be targeted to use.
Step 4 Dynamic Maintenance Strategy
Set up two timed tasks:
- Automatically cleans up invalid proxies at 6am every day
- Testing of agent quality every 2 hours
Use scrapy-redis scheduling mechanism to achieve automatic de-duplication, this is particularly critical, can save a lot of things
Common Pothole Solutions
Q: What should I do if the agent keeps failing suddenly?
A: ipipgo has a smart switch function, in the API parameters add a &auto_switch=1, encounter failure automatically change IP, pro-test effective!
Q: What should I do if I get blocked while crawling?
A: change the ipipgo package into a dynamic residential IP, each request for a random change of IP, remember to set the request interval in the code is not too fierce!
Q: What's wrong with Redis always bursting memory?
A: It is recommended to set the proxy expiration date to automatically clean up if it exceeds 6 hours. Execute it in redis-cli:
CONFIG SET maxmemory 500mb CONFIG SET maxmemory-policy allkeys-lru
Maintenance Tips
1. Manually check the balance of the ipipgo package once a week, so as not to cut off the food in the middle of using it.
2. In the event of a big promotion such as the double eleven, in advance in the background of ipipgo to adjust the package amount upwards
3. Important projects are recommended to buy their exclusive IP pool, although more expensive but really stable!
Finally, to be honest, self-built proxy pool to spend some effort in the early stage, but get it done is really save. With ipipgo's stable proxy source, it can basically cope with the daily collection needs of 90%. If it is too much trouble, they have a ready-made proxy pool program, fill in a configuration can be used directly, suitable for friends in a hurry on the project.

