
What to do when crawler boy meets network 404?
Done the data crawl brothers understand, the most afraid of the situation is: the code runs well, suddenly give you a face not work. This time eighty percent is triggered by the target site's anti-climbing mechanism, directly to your IP address off the small black room. At this time you need to find a substitute to help you work - that is, we have to nag today'sproxy IPThe
For example, you want to grab a remote JSON data using Python's requests library:
import requests
url = 'https://api.example.com/data.json'
response = requests.get(url)
print(response.json())
Run it a few times and you'll see that it returns a 403 error. This is the time to pull the proxy IP trick and make the server think a different person is accessing.
The right way to open a proxy IP
Here's the kicker! Using a proxy IP is not just a matter of finding a random address and filling it in, it's a matter of strategy. Here are some recommendationsipipgoHome service, their IP pool is as big as a seafood market, and they can get you a new vest with every request.
The modified code looks like this:
import requests
proxies = {
'http': 'http://用户名:密码@gateway.ipipgo.com:9020',
'https': 'http://用户名:密码@gateway.ipipgo.com:9020'
}
try.
response = requests.get(url, proxies=proxies, timeout=10)
response.raise_for_status()
data = response.json()
except requests.exceptions.RequestException as e:: print(f "f", "f", "f", "f", "f")
print(f "Request rollover: {str(e)}")
Note the use ofUser Name Password Authentication FormatMany newbies fill in the IP address directly without authentication information, and as a result, they can't connect to it. ipipgo's proxy address format is particularly simple, just copy it according to their documentation.
A practical guide to preventing pitfalls
Name a few easy places to plant your head:
1. IP survival time: free proxy often used twice on the hang, it is recommended to use ipipgo's dynamic short-lived proxy, each request automatically replace the
2. Time-out settingsDon't forget to add the timeout parameter, 5-10 seconds is recommended!
3. Exception handling: Web requests are not as reliable as 100%, must do a good try-except
4. JSON parsing: Sometimes the return is not standard JSON, first use response.text to see the original data
White QA First Aid Kit
Q: What should I do if my proxy IP always times out?
A: First check the format of the proxy address, especially the special symbols in the username and password should be encoded in URL. If you confirm that the format is OK, you can contact ipipgo customer service to check the node status.
Q: Do I need to manually change my IP every time?
A: Not with ipipgo's polling package, they switch automatically at the gateway level, just keep the same proxy address in the code
Q: What should I do if I encounter an SSL certificate error?
A: Add verify=False parameter in requests.get(), but this is not very safe. Suggest to check the system root certificate, or change to use ipipgo's HTTPS exclusive proxy channel!
Why ipipgo?
This is not a hard sell, it's a bloody experience. I've used 7 or 8 service providers before and finally settled on ipipgo for three reasons:
1. Response speed is top-notch, basically within 200ms
2. 200+ city lines across the country, very powerful when you need IP in a specific area.
3. Management background can see the real-time usage, not afraid of overruns
4. technical support is a real person, the last time I raised a work order at two o'clock in the middle of the night, it was actually answered in seconds.
They also recently came out with aIntelligent Routingfunction, can automatically select the fastest line. For the scene that needs to read JSON data stably, it is simply the existence of the opening. New user registration also sends 5G traffic, enough for testing.
The Ultimate Solution
A complete solution for the reachers:
from requests.adapters import HTTPAdapter
session = requests.Session()
session.mount('http://', HTTPAdapter(max_retries=3))
session.mount('https://', HTTPAdapter(max_retries=3))
def fetch_json(url):
proxies = ipipgo.get_proxy() call ipipgo's API to get the latest proxies
try.
response = session.get(url, proxies=proxies, timeout=(3, 7))
return response.json()
except JSONDecodeError: print("JSONDecodeError", "JSONDecodeError").
print("The returned data is not in JSON format.")
return None
This solution adds three insurances: connection retry, automatic acquisition of new IP, and exception catching. Using ipipgo's API you can directly get the latest available proxy address, which is much less laborious than maintaining your own IP pool.
Finally, to be honest, proxy IPs are worth every penny. If the project is important, don't save on the budget. After all, the loss of downtime due to server blocking can be much more expensive than the proxy fee.

