
Hands-on teaching you to use BeautifulSoup pickpocket web pages
Recently, a small friend always asked me to use Python to do web crawling always be blocked IP how to do? Today we will nag about this. First of all, to engage in data collection must learn two tricks:HTML parsing + proxy IP comboThe first thing you need to know is how to pick and choose your food. It's like when you go to the market to buy food, you need to be able to pick the food (parsing), but you also need to be able to maneuver with the stall owners (anti-blocking).
BeautifulSoup basic operations
First, install a tool while you're at it:
pip install beautifulsoup4 requests
Give an example of catching the price of a commodity:
import requests
from bs4 import BeautifulSoup
Remember to use the ipipgo proxies here
proxies = {
'http': 'http://username:password@gateway.ipipgo.com:9020',
'https': 'http://username:password@gateway.ipipgo.com:9020'
}
resp = requests.get('https://example.com/products', proxies=proxies)
soup = BeautifulSoup(resp.text, 'lxml')
prices = soup.select('.price-tag')
for price in prices: print(price.text.strip)
print(price.text.strip())
Watch out for this pit:Many sites will detect User-Agent, remember to add in the headers, otherwise the use of proxy is useless.
The right way to open a proxy IP
Why use ipipgo's proxy? Just look at this comparison table:
| take | General Agent | ipipgo proxy |
|---|---|---|
| E-commerce website | 10 minutes to seal | Stable for 8 hours + |
| social media | Frequent CAPTCHA | Captcha Reduction 70% |
| high frequency acquisition | high frequency of disconnection | Intelligent IP Rotation |
Here's the kicker.IP Rotation TipsThe following is a list of the most popular and most popular IP addresses in the world: ipipgo, ipipgo, ipipgo, ipipgo, ipipgo.
A practical guide to avoiding the pit
Ever been in one of these situations?
Typical Error Reporting Examples
ConnectionError: HTTPSConnectionPool...
There are three things to check at this point:
1. whether the proxy address has been written incorrectly (especially the port number)
2. Is the account password expired?
3. Whether SSL authentication is enabled on the target site
Here's a trick for you: in requests.get() addverify=Falserespond in singingtimeout=10, which solves the SSL problem of 80%.
The Old Bird's Experience
Name a few easy places to plant your head:
- Don't use the default html.parser, change the lxml parser to be twice as fast!
- Encounter dynamically loaded data with Selenium + ipipgo's mobile agent
- Regular cookie clearing, recommended every 50 requests
Frequently Asked Questions QA
Q: What should I do if I always encounter 403 error?
A:Three points of troubleshooting: 1) the request header does not have User-Agent 2) IP is tagged 3) the frequency of visits is too high. It is recommended to use ipipgo's residential proxy, which has a higher degree of camouflage.
Q: What should I do if the data is not fully loaded?
A: 80% encountered dynamic rendering, with this combination: Selenium + headless browser + ipipgo's dynamic IP pool.
Q: How do I get a good deal on ipipgo's proxies?
A: New users first get a 3-day trial, batch collection choose Enterprise Edition package, remember to use the coupon code BS2023 to get 10% off.
I'll tell you what's on my mind.
Engage in data collection is like fighting guerrilla warfare, do not expect a set of configurations to go all over the world. Different sites to use different strategies, the key is to test more adjustments. Recently found ipipgoIntelligent RoutingThe function is quite fragrant, can automatically match the fastest node, the collection efficiency is directly doubled.
One last reminder: don't use Chinese in headers! Don't use Chinese! Don't use Chinese! (Important thing to say three times) Some sites will detect this, with a percentage code to deal with before sending.

