First, why use BeautifulSoup?
Anyone who's ever crawled data has heard of this thing, and BeautifulSoup is, quite frankly, aWebpage parserThe first thing you need to do is to use BeautifulSoup, which can turn the messy HTML code into a tree structure that is easy to manipulate. For example, you want to pick up the price of goods from a treasure, with requestss to get the page, BeautifulSoup three times five removed two will be able to key out the price figures.
import requests
from bs4 import BeautifulSoup
url = 'https://example.com/product'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
price = soup.find('span', class_='price').text
Second, the installation steps in detail (Windows/Mac universal version)
There are two scenarios here:Installation with piprespond in singingmanual installation. Let's start with the simplest:
Normal installation (remember to install the Python environment first)
pip install beautifulsoup4
Specific version installation (some older projects require a specific version)
pip install beautifulsoup4==4.9.3
If the installation encountersInternet troll (agent provocateur on forums etc), such as reporting an error SSLError or Timeout, it's time toProxy services for ipipgoUp. Do this from the command line:
pip install --proxy=http://用户名:密码@proxy.ipipgo.cn:端口 beautifulsoup4
Third, the golden pairing of proxy IP and BeautifulSoup
What's the biggest fear of data collection, IP blocking? At this time, we need toDynamic proxy pool for ipipgoto take cover. Give me a real-life scenario:
import requests
from bs4 import BeautifulSoup
proxies = {
'http': 'http://user:pass@proxy.ipipgo.cn:9020',
'https': 'http://user:pass@proxy.ipipgo.cn:9020'
}
for page in range(1,10): url = f'{page}'.
url = f'https://某电商网站/search?page={page}'
response = requests.get(url, proxies=proxies)
soup = BeautifulSoup(response.text, 'lxml')
And here's where the parsing logic comes in...
With ipipgo.Exclusive High Speed ProxyThe first thing you need to do is to get your hands on a website that can effectively avoid triggering the anti-climbing mechanism of the website. Their IP pool is updated every day 8 million + resources, pro-measurement crawl success rate can be 98% or more.
IV. QA session (a must for newbies)
Q: What should I do if I get an error after installing import?
A: It is likely that the package name is confused, pay attention to the installation with thebeautifulsoup4But the introduction has to be writtenfrom bs4 import BeautifulSoup
Q: What should I do if my connection always times out?
A: Use ipipgo's firstProxy connectivity test toolCheck if the proxy is in effect, and then check if the target site has an anti-crawl strategy
Q: How to optimize slow parsing speed?
A: two tricks: ① change to lxml parser ② use ipipgo'sStatic long-lasting agentsReduced time spent on forensics
V. Guide to avoiding the pit (blood and tears experience)
1. Don't use the old Python 2.7 environment, BeautifulSoup4 in Python3.6+ to play the full power of the
2. If you encounter an SSL certificate error, add the following to requests.get()verify=FalseParameters (provisional program)
3. Use of ipipgoIP Whitelist Authentication FunctionMake sure the proxy is configured correctly and don't let proxy issues take the fall
One final rant: do data collectionDon't run naked., ipipgo's proxy service can make you go 80% less. Their official website now sends 1G flow for new users, enough to test. There are any technical problems directly to their 7 × 24 online technical support, more reliable than online search tutorials.