
Hands-on with Beautiful Soup to pickpocket data
Recently, a lot of small partners asked me, with Python pickpocket site is always anti-climbing to make the collapse of the mind how to do? Today we will nag how to use Beautiful Soup this tool with proxy IP, so that pick data this matter becomes easy and reliable.
Why do you need a proxy IP to pick up data?
For example, if you squat in an Internet cafe and play games all night, and the boss sees that you are playing too high and pulls out the Internet cable directly, then you can continue to get high if you change your seat. Proxy IP is the reason, when the site found that you request too often, change the IP address can continue to work.
We recommend it.ipipgoProxy services that specialize in solving these problems:
1. Massive IP pool switching at any time
2. request success rate stabilized at 99%+.
3. Support for HTTP/HTTPS/SOCKS5 protocols
Especially to do e-commerce price monitoring such as the need for long-term capture of the project, using their family agent can save a lot of heartache.
Configuring the Agent Environment
Load the essential toolkit first:
pip install beautifulsoup4 requests
There's a small pitfall to be aware of: many tutorials don't teach how to set the timeout, so we'll have to write it this way:
import requests
from bs4 import BeautifulSoup
proxies = {
'http': 'http://username:password@gateway.ipipgo.com:9020',
'https': 'http://username:password@gateway.ipipgo.com:9020'
}
try.
response = requests.get('destination URL', proxies=proxies, timeout=10)
soup = BeautifulSoup(response.text, 'html.parser')
except requests.exceptions.ProxyError as e:
print("The proxy is jerking around, check the configuration now!")
Here's what's usedipipgoThe authentication format, remember to replace your account password. The timeout setting is recommended not to be lower than 8 seconds to give the server time to react.
HTML Parsing Tips
Don't panic when you run into dynamically loaded content, try these positioning tips:
Find divs whose class contains "price".
soup.select('div[class=price]')
Grab the third row of the second table
soup.find_all('table')[1].find_all('tr')[2]
Extract keywords from meta tags
soup.find('meta', {'name':'keywords'})['content']
If you have a page structure that changes frequently, it is recommended to use theipipgoThe rotating IP feature paired with a retry mechanism:
for _ in range(3)::
try.
Here the IP will be switched automatically
response = requests.get(url, proxies=proxies)
break
except.
continue
Frequently Asked Questions QA
Q: What should I do if my proxy IP suddenly doesn't work?
A: First check if the account is expired, then use theipipgoIP detection tools in the background, sometimes the target site temporarily blocked certain IP segments.
Q: How can I be sure that the proxy is in effect?
A: Put a test in the code:
print(requests.get('http://httpbin.org/ip', proxies=proxies).json())
It's good to see that the IP returned is not the local address.
Q: What can I do if I encounter an SSL certificate error?
A: Add verify=False parameter to the requests request, or contact theipipgoCustomer service switched to their SSL certificate.
An upgraded version of the trick.
Remember to add these configurations if you want to run consistently 24 hours a day:
Random wait 1-3 seconds
import random
time.sleep(random.uniform(1,3))
Disguise the browser's identity
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36...'
}
become man and wifeipipgoThe volume-based billing package, do distributed crawling can save a lot of silver. Their API can also get a list of available IPs in real time, which is especially suitable for scenarios that require high concurrency.
Finally, although the use of proxy IP, but don't glean website data to death. Control the frequency of requests, and be an ethical crawler engineer, so that our crawler business can be long-lasting~.

