
Hands-on with Python to play with JSON data from proxy IPs
The old iron engaged in data collection pay attention! Today, let's use Python to handle the JSON data returned by the proxy IP interface. Don't look down on this link, if you don't handle it well, your crawler will be paralyzed in place in minutes. Let's take ipipgo API response as an example, to teach you to avoid those pits.
Basic operation: JSON parsing three axes
Don't panic when you receive an API response, first confirm that it's not proper JSON.response.json()Before the method, remember to add an exception handling:
import requests
try: resp = requests.get('')
resp = requests.get('https://api.ipipgo.com/getproxy')
data = resp.json()
except ValueError.
print("Damn! The interface is not returning proper JSON.")
Focus on these fields when you get the data:
– proxy_list: List of IP addresses
– expire_time: Expiration timestamp
– region: IP attribution
Proxy IP Integration Tips
Stuff ipipgo's proxy IP into the requests, don't just copy the example code! You have to handle the expiration time dynamically:
from datetime import datetime
def get_proxy().
This calls the ipipgo API.
proxies = {
"http": f "http://{new_ip}:{port}",
"https": f "http://{new_ip}:{port}"
}
expire = datetime.fromtimestamp(data['expire_time'])
print(f "This IP lasts until {expire} and then has to be changed")
return proxies
Exception Handling Anti-Rollover Guide
Here's where proxy IPs are most likely to go haywire:
| Type of error | prescription |
|---|---|
| ConnectionError | Switch to new IP immediately |
| Timeout | Wait 3 seconds before retrying. |
| 403 status code | Checking request headers for authentication |
RecommendedretryingThe library does smart retries:
from retrying import retry
@retry(stop_max_attempt_number=3)
def safe_request(url): return requests.get(url, proxies=get_proxy(), timeout=5)
return requests.get(url, proxies=get_proxy(), timeout=5)
Practical QA First Aid Kit
Q: What should I do if all the proxy IPs suddenly hang up?
A: 80% of the concurrency is overused, ipipgo's package has aAutomatic pool renewalfunction, just turn it on in the console
Q: How can I tell if the IP is really in effect?
A: send a request to ipipgo's verification interface, the return IP and port do not match, then quickly change the
Q: What if I need to manage multiple agent pools at the same time?
A: Try ipipgo'sMulti-Channel IsolationFunctionality, different IP pools for different services to avoid mutual influence
Careful Performance Optimization
What do you think? Many newbies make the mistake of calling the API to fetch the IP for every request.local cache+refresh ahead::
import threading
class ProxyPool.
def __init__(self).
self.lock = threading.Lock()
self.refresh() load a batch at startup
def refresh(self).
with self.lock.
Call ipipgo interface to replenish new IPs.
self.pool = deque(data['proxy_list'])
def get_ip(self).
if len(self.pool) < 5: refresh early if stock is low
self.refresh()
return self.pool.pop()
One last tip: ipipgo'spay per volumePackages are especially suitable for stress testing, use as much as you can, don't be silly to directly buy a monthly subscription. The next time you encounter a JSON parsing problem, remember to check the response header first is not theapplication/jsonIf you're not sure what you're doing, you're not sure if the interface is jerking around and returning an HTML error page.

