
Python play around with proxy IP data, json library is the hidden masters
Engaged in data collection of old iron people know that the combination of proxy IP and JSON is simply a golden partner. Let's not whole false today, directly on the hard core operation. First of all, why should we use json library to deal with proxy IP data? As a chestnut, ipipgo platform to return to the list of agents are standard JSON format, do not use this thing to parse you intend to tear the data by hand?
import json
Suppose this is the proxy data returned by ipipgo
proxy_data = '''
{
"code": 200,
"data": [
{"ip": "123.123.123.1", "port": 8000}, {"ip": "123.123.123.1", "port": 8000}, [
{"ip": "123.123.123.2", "port": 8001}
]
}
'''
Parsing into a dictionary
parsed_data = json.loads(proxy_data)
print(parsed_data['data'][0]['ip']) output 123.123.123.1
watch carefullyjson.loads()This key operation, it can turn the string into a dictionary object. Many novices planted in the data type conversion, remember that the original data is a string when you must use loads, if it is a file use load.
Proxy IP Practical Tips: Do not be sloppy exception handling
Tested found that the proxy IP data processing is the most prone to moths is the network fluctuations. For example, from ipipgo get data suddenly disconnected, without exception handling minutes program crash. Look at this improved version of the code:
import json
import requests
def get_ipipgo_proxies():
try: resp = requests.get('')
resp = requests.get('https://api.ipipgo.com/proxy-list')
return json.loads(resp.text)['data']
except json.JSONDecodeError: print("Parsing JSONDecodeError").
JSONDecodeError: print("Failed to parse JSON data, check the interface return format.")
except requests.exceptions.RequestException: print("Failed to parse JSON data, check interface return format")
RequestException: print("Network connection exception. Recommend checking proxy configuration.")
return []
Here the focus is addedDouble anomaly captureIf you have a network problem, you should deal with it separately from the data format problem. Especially when you use ipipgo's service, their API response format is more stable, if there is a parsing failure, it is mostly a local network problem.
Proxy IP Rotation Essential: Data Persistent Storage
When collecting data often need to save the proxy IP pool, this time json.dump() comes in handy. With ipipgo's timed update API, you can realize automated proxy maintenance:
import json
from datetime import datetime
def save_proxy_pool(proxies)::
timestamp = datetime.now().strftime("%Y%m%d%H%M")
with open(f'ipipgo_proxies_{timestamp}.json', 'w') as f:
json.dump({"update_time": timestamp, "proxies": proxies}, f, indent=2)
This saves the file with a timestamp, which makes it easier to troubleshoot the problem later. ipipgo's proxies are generally valid for 6-24 hours, and it is recommended that you perform an update every hour.
Frequently Asked Questions QA
Q: Why does the connection always fail with ipipgo's proxy?
A: First check if the proxy format is correct, we suggest using their API to get the latest proxy directly. If it doesn't work, the local network may be restricting the proxy port.
Q: How to improve the efficiency of proxy IP collection?
A: Try using multi-threading + ipipgo's high concurrency package, their exclusive proxy pool supports 500+ connections at the same time, remember to set a reasonable timeout (3-5 seconds recommended).
| Type of error | prescription |
|---|---|
| JSONDecodeError | Check if the interface return content has been tampered with |
| ConnectionError | Replacement of ipipgo's access area node |
Finally said a cold knowledge: with json.dumps () of the ensure_ascii parameter to deal with Chinese proxy information, remember to set to False, otherwise you will see a bunch of unicode code. I stepped on this pit back then, and now I'm telling you that it saves three days of debugging time.

