
Python crawler is blocked IP crack!
Crawler old iron should have experienced this scene: the program ran well, suddenly jammed, a look at the log screen full of 429, 503 errors. At this time do not rush to smash the keyboard, eighty percent of the target site to block your IP. Today we will nag how to use requests library + proxy IP to crack this predicament.
Putting an invisibility cloak on a reptile
requests libraries with agents is like putting a cloak of invisibility on a program, focusing on thesession objectof the application. A chestnut example:
import requests
from itertools import cycle
Proxy pool from ipipgo
proxy_pool = cycle([
"http://user:pass@gateway.ipipgo.com:8001",
"http://user:pass@gateway.ipipgo.com:8002"
])
session = requests.Session()
session.proxies = {"http": next(proxy_pool)}
Send the request as usual
response = session.get("https://target-site.com/data")
Here's a tasty maneuver: useitertools.cycleI got a proxy pool polling, much more stable than a single proxy. ipipgo's proxy with authentication parameters, remember to replace user and pass with your own account.
Spare tire mechanisms are important
Even the best agents can get jerky. You have to be prepared.dual insurance::
| Exception type | response strategy |
|---|---|
| ConnectionError | Switch Proxy Now |
| Timeout | Extended waiting time |
| HTTPError | Processing based on status codes |
Real-world code example:
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
retry_strategy = Retry(
retry_strategy = Retry(
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["GET", "POST"]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount('http://', adapter)
session.mount('https://', adapter)
This combo automatically retries failed requests with ipipgo'sHighly available agent clustersThe first step is to make sure that you are able to handle the exceptions manually.
The Balancing Act of Speed and Stability
Some brothers in pursuit of speed to adjust the delay very low, the result is crazy error. It is recommended to adjust the parameters according to the business scenario:
- Product comparison: timeout set to 3-5 seconds.
- Public opinion monitoring: timeout can be relaxed to 10 seconds.
- Image capture: best paired with asynchronous requests
Tested with ipipgo'sLong-lasting static proxiesThe success rate can go up to 98% or more under 5 seconds timeout, which is much more reliable than those cheap proxies.
Beginner's Guide to Avoiding Pitfalls
QA time:
Q: What should I do if the agent speed is fast or slow?
A: Check if you are using a shared proxy pool, change ipipgo'sExclusive lineshave an immediate effect
Q: What should I do if my connection always times out?
A: First use this command to test whether the proxy is fluent:
curl -x http://gateway.ipipgo.com:8001 http://httpbin.org/ip
Q: How to optimize when I need to handle a large number of requests?
A: On-line thread pool + agent pool double insurance, remember to set thespeed limitDon't bring down their servers.
the Great Mystery Killers (game)
Lastly, a dark technology is revealed - the use ofAgent Locale SwitchingCracking regional restrictions. For example, certain websites are more lenient for access to the north, with ipipgo'sCity-level targeted agentsThe "localized" access is easy to achieve.
Specify Shanghai Server Room Outlet
custom_proxy = "http://user:pass@sh.node.ipipgo.com:8800"
This technique works especially well when doing regional data comparisons, and whoever uses it knows.
In the end, the proxy IP play 6 or not 6, the key to look at the service provider reliable or not. I've used ipipgo for half a year, and I've seen their homeIP Survival Detectionrespond in singingAutomatic replacement mechanismIndeed, save heart, than before the use of those pheasant platform is too strong. Especially to do long-term crawler project, there is no need to save a little proxy money, blocking an IP loss of data can be much more expensive than the proxy fee.

