
Python play around with proxy IP data: hands-on disassembly of JSON sets
Crawlers know that dealing with JSON data returned by proxy IPs is like opening a blind box - you never know what kind of strange format the server will stuff you with. Today we will take ipipgo's API response as a case study to teach you a few ways to deal with JSON data.It's good in the field.The dictionary deals with wild ways.
import requests
from json import JSONDecodeError
def grab_proxies()::
def grab_proxies(): try.
resp = requests.get('https://api.ipipgo.com/proxy', timeout=5)
data = resp.json().get('data', {})
return data['ips'] if 'ips' in data else []
except JSONDecodeError: print("I'm not sure if this is the case.
print("The server returned false data!")
return []
See? There are two key points hidden in this basic operation:exception capturerespond in singingDefault Value Setting. Many newbies take data['ips'] directly and mindlessly, only to have the program pass away on the spot when they encounter a server-side draw that returns empty data.
A Thousand Layers of Dictionary Nesting
ipipgo's proxy IP data often comes with multiple layers of nesting, like this:
{
"node": {
"east-china": [
{ "ip": "1.1.1.1", "expire": "2024-08-01"}, }
{"ip": "2.2.2.2", "expire": "2024-08-02"}
]
}
}
At this point, don't rush to use for loops to harden your behavior, try this tawdry operation:
def extract_ips(raw_data): [ return [
return [
item['ip']
for region in raw_data.get('node', {}).values()
for item in region
if isinstance(region, list)
]
expense or outlayDictionary derivatives + type judgmentDouble insurance, no matter how the data changes can be as stable as an old dog. Especially ipipgo sometimes stuff debugging information into the node, without isinstance filtering minutes to report errors.
Dynamic Proxy Pool Maintenance Tips
Don't use the IP list directly after you get it, first do asurvival testingThe first thing you need to do is to make sure that you have a good proxy IP address. Many brothers feedback that the proxy IP with the use of the use of the failure, in fact, because it is not a good preprocessing:
def check_alive(ip_list).
working_ips = []
for ip in ip_list.
try.
test_resp = requests.get('http://httpbin.org/ip',
proxies={'http': f'http://{ip}'},
timeout=3)
if ip in test_resp.text.
working_ips.append(ip)
except: working_ips.append(ip)
continue
return working_ips
Here's a tip: use the httpbin.org/ip interface to verify that the return contains the currently used IP, which is much more reliable than simply looking at the response status code. Especially with ipipgo's short-lived proxies, this test step should never be skipped.
QA time: demining of common pitfalls
Q: What should I do if I always encounter JSON parsing errors?
A: 80% of the response content is contaminated. First use resp.text to print the raw data to see if it is interspersed with HTML error pages. In this case, it is recommended to contact the technical support of ipipgo, their API stability in the industry is considered to be the best.
Q: The IP I got always times out the connection?
A: Check three points: 1. whether to go to the proxy verification 2. the target site has not blocked the proxy 3. the local network has no restrictions. We recommend using ipipgo's volume billing package, their IP pool is updated frequently, the survival rate is higher than the monthly package 30% more than.
Q: How do you handle concurrent requests from agents?
A: Don't directly use multi-threaded hard dislike! It is recommended to use connection pooling + IP polling strategy. ipipgo's enterprise package supports high concurrency API calls, with the aiohttp library to do asynchronous processing, it is not a problem to handle hundreds of requests per second.
Practical Tips: IP Intelligent Scheduling
Finally, I'd like to share a high-level play - dynamically switching agents based on business scenarios:
from random import choice
class ProxyManager.
def __init__(self).
self.ips = []
self.last_update = 0
def refresh(self).
if time.time() - self.last_update > 300: update every 5 minutes
self.ips = grab_proxies()
self.last_update = time.time()
def get_ip(self).
self.refresh()
return choice(self.ips) if self.ips else None
This scheduler implements theAutomatic update + random selectionThe double guarantee. Especially with ipipgo's dynamic tunnel proxy, it can effectively avoid IP being blocked by the target website. Their intelligent routing technology can automatically assign the optimal line according to the type of business, which is much more hassle-free than manual switching.
At the end of the day, dealing with proxy IP data is a meticulous job. Use these tips, and with a reliable service provider like ipipgo, guaranteed to make your crawler efficiency directly take off. If you don't understand anything, please leave a comment and let's talk about it together!

