
Hands-on Web Crawling with Python
Recently, some friends asked Lao Zhang, want to learn to crawl, but always blocked by the website IP how to do? It's just like playing a game and always getting kicked out of the room. Today we will talk about how to use Python to do web crawling in plain language, focusing on how to use the proxy IP this "cloak of invisibility".
Prepare your toolbox
Let's start by loading a few essentials:
pip install requests beautifulsoup4
take note ofDon't use the latest version of the library, some of the newer versions will have compatibility issues. For example, requests is more stable with version 2.25.1.
The first snippet of scratch code for beginners
Let's start with a simple example of catching the price of an e-commerce site:
import requests
from bs4 import BeautifulSoup
url = 'https://example.com/product'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
price = soup.find('span', class_='price').text
print(f "Current price: {price}")
Getting blocked twice for doing that is the same as being watched by security guards at the supermarket for repeatedly flipping through the price tags.
The right way to open a proxy IP
This is where we bring out our "cloak of invisibility" - the ipipgo proxy service. They offerExclusive use of high-speed linesIt's a lot more stable than public proxies. That's how it works:
proxies = {
'http': 'http://用户名:密码@gateway.ipipgo.com:端口',
'https': 'https://用户名:密码@gateway.ipipgo.com:端口'
}
response = requests.get(url, proxies=proxies, timeout=10)
Be careful to change your username and password to the credentials you got in the ipipgo backend, don't copy this code directly from me!
Essential Tips for Grabbers
1. IP Rotation StrategyThe ipipgo API can dynamically obtain an IP address, but not an IP address.
import random
def get_proxy(): proxy_list = ipipgo.get_proxy_list()
proxy_list = ipipgo.get_proxy_list() This is a call to the ipipgo API.
return random.choice(proxy_list)
2. request header masquerading as:: Putting the "make-up" on the request.
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36',
'Accept-Language': 'zh-CN,zh;q=0.9'
}
Frequently Asked Questions First Aid Kit
Q: What should I do if I always get a connection timeout?
A: eighty percent is the agent is not stable, change ipipgo exclusive line try, do not use free agents!
Q: The returned data is garbled?
A: Remember to set response.encoding = 'utf-8', or use the chardet library to auto-detect the
Q: How can I tell if my IP is blocked?
A: See if the return status code is 403, or the content of the web page appears in the "too frequent visits" and so on.
Guide to avoiding the pit
1. don't use time.sleep(1) to fix the interval, use random.uniform(1,3) instead.
2. don't fight with captcha, use ipipgo'sHigh Stash IPPackages reduce the chances of triggering
3. Remember to cache important data locally, don't re-grab it every time.
Lastly, I would like to say a few words from the bottom of my heart: choosing a proxy service is just like finding a date, if you want to use a free proxy for a cheap price, sooner or later you'll fall into trouble. ipipgo I've been using it for half a year, and I've been using it for a long time.Stability can really hit, especially that pay-as-you-go package, which is especially friendly to small projects. Newbies are advised to practice with their experience packages first and get familiar with them before going on to big traffic.

