
When Crawlers Hit IP Blocking? Try this Jedi trick
What are you most afraid of doing crawler? It's not the anti-climbing mechanism, it's not the CAPTCHA, the worst thing is the sudden popping up of theIP blocking alertI have a friend who does e-commerce comparison! I have a friend to do e-commerce comparison, for three consecutive days by a platform blocked more than twenty IP, anxious to glean hair. Later used a trick - proxy IP rotation, hard to pull down the data.
import requests
from itertools import cycle
ip_pool = [
'123.123.123.123:8888', '124.124.124.124:9999',
'124.124.124.124:9999', ...
... More proxy IPs provided by ipipgo
]
proxy_cycler = cycle(ip_pool)
for page in range(1, 101): current_proxy = next(proxy_cycler)
current_proxy = next(proxy_cycler)
proxies = {
'http': f'http://{current_proxy}',
'https': f'https://{current_proxy}'
}
response = requests.get(url, proxies=proxies)
Processing the returned JSON data...
The right way to open a proxy IP
A mistake that many newbies tend to make isThink of the agent as a master key.. Here's a trick for the gang:IP quality over quantityI'm not sure if I'm going to be able to do that. I've used free proxies before, and nine out of ten IPs timed out, and the remaining one was blacked out by the target site.
Recommended for ipipgoDynamic Residential AgentsThe IP pool is updated every day, and the measured success rate can go up to 95%. The key is to learnIntelligent switching strategyDon't be stupid and change IPs for every request, you have to adjust dynamically based on the response status code.
The three main mysteries of JSON data processing
Don't rush to parse the data when you get it, but look at these three places first:
- The Content-Type in the response header is not application/json
- Whether the data has been gzip compressed or not (encountered the fiasco of returning garbled code)
- Are the key fields dynamically encrypted (e.g. price becomes Base64 encoded)
import json
from json.decoder import JSONDecodeError
try: data = response.json()
data = response.json()
except JSONDecodeError: data = response.json()
Handling exceptions
if 'gzip' in response.headers.get('Content-Encoding',''):: data = json.loads(response.content.decode('utf-8'))
data = json.loads(response.content.decode('utf-8'))
Troublesome maneuvers in the real world
Tell a real case: a travel site's anti-crawl will detect theGeographic location of the IP. Use ipipgo'sCity-level location agentsThe success rate soared directly from 40% to 90% by matching the request IP with the city ID in the request parameter!
| take | Recommended Agent Type | Switching frequency |
|---|---|---|
| General Data Acquisition | Data Center Agents | Every 5 minutes |
| High Defense Website | Residential Dynamic Agents | Per request |
Guidelines on demining of common problems
Q: Proxy IPs are not working when I use them?
A: 80% of them are using inferior proxies, choose ipipgo'sReal-time validation of agent poolsThe IP activity is automatically detected before each request.
Q: The returned data is always incomplete?
A: Check the Accept-Encoding in the request header, some sites will return different format data based on this
Q: Agents are slow to the point of skepticism?
A: Don't use free proxies! ipipgo'sExclusive High Speed Access实测在200ms以内
A final word of advice: being a crawler is like fighting a guerrilla war.Don't do it., to be wise. Reasonable with proxy IP and request strategy, with ipipgo's intelligent scheduling system, you will find that many seemingly copper and iron wall of the site, in fact, the vulnerability is more than a sieve...

