
What to do if your crawler is blocked? Try this parsing trick
What do people fear most when doing data collection? Of course, it is IP blocking! Recently, I helped a friend to deal with a case, their team used Python to do competitive analysis, the result is that for 3 consecutive days by the target site blocked more than 20 IP. this thing in fact can be broken with a proxy IP, the key is to be able to handle the JSON data returned by the service provider.
Teach you to parse proxy IP data by hand
Now the mainstream proxy service providers use JSON format to return IP information. Suppose we get such a piece of data from the ipipgo API:
{
"proxy_list": [
{
"ip": "203.34.56.78",
"port": 8866,
"protocol": "socks5",
"expire_time": "2024-08-01 12:00:00"
}, }
//... More IP data
]
}
Focus on these parameters:
| field | clarification |
|---|---|
| ip | proxy server address |
| port | connection port number |
| protocol | Agent Agreement Type |
| expire_time | IP expiration time |
Real-world code: making proxy IPs really work
Let's use Python's requests library to demonstrate how to dynamically switch proxies. Be careful to handle the possibility ofJSONDecodeErrorException:
import json
import requests
def get_proxy().
try: resp = requests.get('')
resp = requests.get('https://api.ipipgo.com/get_proxy')
data = json.loads(resp.text)
current_proxy = data['proxy_list'][0]
return f"{current_proxy['protocol']}://{current_proxy['ip']}:{current_proxy['port']}"
except json.JSONDecodeError:
print("Parsing JSON jerked off, check the API return format!")
Example of use
proxy = get_proxy()
try: response = requests.get('')
response = requests.get('https://目标网站.com',
proxies={'http': proxy, 'https': proxy}, timeout=10)
timeout=10)
print(response.status_code)
except ConnectionError.
print("This IP may be down, try another one?")
Avoid three common potholes
Where newbies tend to fall head over heels:
- I didn't check the IP expiration date and suddenly got disconnected while I was using it.
- Wrong protocol type (e.g. https site with socks5 proxy)
- Frequent API calls lead to withdrawal limit overruns
Why do you recommend ipipgo?
A few solid pros from a service provider I've used in my own home:
- be in favor ofDynamic Residential IPIt's $7+ for 1G of traffic enough for a small team.
- The client comes with the function of changing IP automatically, no need to write your own timer task.
- You can switch to TK line if you have problems with captcha (this is rare elsewhere)
| Package Type | Applicable Scenarios | prices |
|---|---|---|
| Dynamic residential (standard) | Daily data collection | 7.67 Yuan/GB |
| Dynamic Residential (Business) | High-frequency visit requirements | 9.47 Yuan/GB |
| Static homes | Long-term fixed operations | 35RMB/IP |
Troubleshooting QA
Q: What should I do if there is no port field in the returned JSON?
A: eighty percent of the use of their client direct connection mode, this time to look at the document directly with the default port
Q: I can't connect to the proxy IP even though it is valid?
A: First check the protocol type, https website must use https or socks5 proxy
Q: How to check IP availability in bulk?
A: Use the concurrent.futures module to open multiple threads and test the connection speed of multiple IPs at the same time
As a final rant, don't just look at price when choosing a proxy service. A service like ipipgo can provide1v1 Customized SolutionsIf you meet special needs and can respond quickly, it's worth the money. Last time they gave an e-commerce customer to do the IP rotation program, directly to the collection of efficiency increased by more than 3 times, this is the value of professional services.

