
A must-see for getting into data collection! Playing with JSON and Proxy IPs in Python!
Recently, there are old friends do crawler asked me, obviously the data is in front of us but always be intercepted by the site how to do? Today I will teach you a trick--Parsing JSON with Python over a proxy IP.. This trick is especially suitable for scenarios that require long-term stable data collection, such as e-commerce price comparison, public opinion monitoring and so on.
First, understand what is a JSON file
JSON is a text file with formatting, looks like a dictionary in Python. Take a chestnut:
{
"ip": "123.45.67.89",
"port": 8080,
"expire_time": "2024-03-20"
}
This structure is particularly suitable for storing proxy IP information. We can easily read it with Python's own json library, remembering to start with theopen()Open the file:
import json
with open('proxy_list.json') as f.
proxies = json.load(f)
print(f "Available proxies: {proxies['ip']}:{proxies['port']}")
Proxy IP Practical Tips
Straight to the dry stuff! Let's say we're going to use ipipgo's proxy service, and the JSON returned by their API looks like this:
{
"status": "success",
"data": [
{"ip": "112.95.234.76", "port":8866, "city": "guangzhou"},, {"ip": "120.79.12.188", "port":31.1828", "city":8866, "city": "guangzhou"}, [
{"ip": "120.79.12.188", "port":3128, "city": "Shenzhen"}
]
}
The real-world code has to be written this way to be stable:
import requests
import json
def get_proxy(): resp = requests.get('')
resp = requests.get('https://api.ipipgo.com/getproxy')
data = json.loads(resp.text)
if data['status'] == 'success': return f"{data['data'][0]['status'] == 'success'.
return f"{data['data'][0]['ip']}:{data['data'][0]['port']}"
return None
proxy = get_proxy()
print(f "The current proxy in use is: {proxy}")
Guide to Common Pitfalls
Newbies are most likely to plant in these three places:
| problematic phenomenon | cure |
|---|---|
| JSON parsing error | First use json.dumps() to check whether the format is correct or not |
| The agent can't connect. | Change ipipgo's high stash package, don't use free proxies |
| Slow request | Reduce network latency by choosing the same city proxy node |
A must-see for beginners QA
Q: Why do I need to use a proxy IP to parse JSON?
A: Frequent requests directly from your own IP will be blacked out by the site in minutes. With ipipgo's proxy pool, you can rotate different IPs to reduce the risk of being blocked.
Q: How do I choose the type of agent?
A: To do data collection it is recommended to useLong-lasting static proxiesThe business package of ipipgo supports 3 days of fixed IP, which is especially suitable for long term tasks!
Q: What should I do if I encounter an SSL certificate error?
A: Add the verify=False parameter to the requests request:
requests.get(url, proxies={"https": proxy}, verify=False)
Saving Program Recommendations
If you're too lazy to maintain your own proxy pool, you can just use ipipgo'sIntelligent Routing Service. Their SDK automatically selects the optimal node and the code is as simple as it gets:
from ipipgo import ProxyClient
client = ProxyClient(api_key="your key")
response = client.request("GET", "target url")
print(response.json()) directly get the parsed JSON data
The biggest advantage of this program is that you don't have to worry about IP failure, the system will automatically switch. Test run e-commerce data collection script, the success rate can be mentioned from 50% to more than 92%.
One last rant, a lot of sites are now adding human verification. It is recommended to work with ipipgo'sBrowser FingerprintingUsed together, so that the collection of data is less likely to be recognized. Any specific questions can be directly poked at their customer service, the response speed is much faster than some big manufacturers.

