If you are always blocked by IP, try this trick, it's very effective!
Brothers engaged in network crawlers understand, batch under the picture of the biggest headache is the IP is blocked. In the morning, the script is still running well, but in the afternoon, it will give you a403 ForbiddenThis is the time to pull out the proxy IP this life preserver. Today we will use Python to get a picture downloader with a shield, with ipipgo's proxy service to escort.
Why is it cool to not use a proxy IP?
There are three main things to look for in a website against crawlers:Request frequency, IP traces, user characteristicsThe following is an example of this. Ordinary crawler with a fixed IP wildly send requests, like the same person every minute to smash the door 100 times, the security does not block you block who? Using a proxy IP is like knocking on the door with a different vest every time, so the security guards won't recognize you at all.
Example of core configuration for proxy IPs
proxies = {
'http': 'http://用户名:密码@gateway.ipipgo.com:9020',
'https': 'http://用户名:密码@gateway.ipipgo.com:9020'
}
hand in hand with the environment
Install these essential libraries first (remember that it's faster to install them with the Tsinghua source):
pip install requests pillow retrying -i https://pypi.tuna.tsinghua.edu.cn/simple
Focusing on the ipipgo configuration doorway: get on their backend toAPI Extraction LinksSuggested choicesLong-lasting static IPpackage, this IP survives for a long time and is particularly suitable for crawling tasks that require continuous work.
Code is written in such a way as to resist blocking
Straight to the hard stuff. Look at this tape.Triple Protectionof the code:
from retrying import retry
import requests
from urllib.parse import urlparse
def download_img(url, save_path): headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'}
Get the proxy IP dynamically from the ipipgo interface
proxy = requests.get("https://ipipgo.com/fetchproxy?type=json").json()
@retry(stop_max_attempt_number=3)
def _download().
resp = requests.get(url, headers=headers,
proxies={"http": proxy['proxy']},
timeout=15)
resp.raise_for_status()
with open(save_path, 'wb') as f.
f.write(resp.content)
try.
_download()
except Exception as e.
print(f "Download failed: {str(e)}, changing ipipgo's IP...")
return False
return True
Old Driver QA Time
Q: What should I do if the proxy IP suddenly doesn't work?
A: ipipgo's home IP pool has5 seconds auto switchingmechanism, just add a retry loop in the code. If you encounter a dead IP, their background can also manually refresh the node.
Q: How do I know if the proxy is in effect?
A: Add a detection logic in the code, visit http://ip.ipipgo.com/checkip before downloading to see if the returned IP is a proxy IP.
Q: What if I want to open a multi-threaded download?
A: ipipgo'sEnterprise PackageSupport simultaneous 500 IP concurrency, each thread with an independent proxy, remember to set the timeout to more than 30 seconds.
Pitfall Avoidance Guide Form
pothole | method settle an issue |
---|---|
The IP was blocked too fast. | Turn up the frequency of IP changes in the ipipgo backend |
Image not loading fully | Add selenium rendering and then download the |
Validated by the site's man-machine | Enabling IP Filtering for Server Rooms with ipipgo |
Tell the truth.
Don't believe in those free proxies, not to mention the slow speed, may also contain Trojan horses. ipipgo I have used for more than half a year, the biggest benefit is thatIP address can be selectedIf you want to grab images from any region, you can choose any node. Recently they have a campaign, new users get 10G of traffic, fill in the promo code when you sign up!IMG2024You also get 5G more, enough to download tens of thousands of images.
One last nag: don't set the delay too low! Some sites intentionally slow down their response time, and setting a timeout of 10 seconds or less makes it easy to misinterpret. If you're using ipipgo, it's recommended to set theTimeout to 15-20 secondsThe success rate can go up by 30%.