Hands-on teaching you to use pip load BeautifulSoup, encounter network jam what to do?
Nine out of ten people who use Python to do data crawling have to install BeautifulSoup, but the biggest headache for newbies is running into theInternet troll (agent provocateur on forums etc)If you want to install a proxy IP, you have to install it halfway through the day. This time we have to invite our savior - proxy IP!
Install directly with the proxy parameter
pip install beautifulsoup4 --proxy=http://用户名:密码@ipipgo-proxy.com:1234
Or set it permanently in the configuration file (recommended)
Create a new ~/.pip/pip.conf file and write:
[global]
proxy = http://用户名:密码@ipipgo-proxy.com:1234
What can a proxy IP really do? Why use ipipgo?
To cite a chestnut, just like online shopping courier stuck in the middle of the road, the proxy IP is your exclusive courier boy. With ipipgo's proxy service, the three main advantages are clear:
point of pain | prescription |
---|---|
download at lightning speed | National Backbone Node Acceleration |
high frequency of disconnection | Intelligent IP automatic switching |
Trouble with authentication | API one-click proxy acquisition |
Especially when doing automated deployments, it's a thief to write this in the Dockerfile:
ENV PIP_PROXY=http://ipipgo-proxy.com:1234
RUN pip install beautifulsoup4 requests
Guidelines for demining common pitfalls
Q: Why is the timeout still reported after the proxy is set?
A: 80% of the IP is invalid, go to ipipgo background to refresh the IP pool. Their survival detection function is quite smart and will kick out the spent IPs in advance.
Q: What about company intranet restrictions?
A: Try ipipgo'sTunnel Proxy Model, change the proxy address to http://tunnel.ipipgo.com to automatically take the encrypted channel.
Q: Will it be a conflict to use mirror source and proxy at the same time?
A: There is no conflict! Recommended to get it this way (a must for domestic users):
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple beautifulsoup4 --proxy=http://ipipgo-proxy.com:1234
Proxy IP can still be played like this?
Loading libraries is just a small case, the real big tricks are in the crawler real world. For example, when using the requests library, hook up the session with ipipgo's proxy pool:
import requests
from itertools import cycle
proxies = cycle(ipipgo.get_proxy_pool()) auto-rotate IPs
session = requests.Session()
session.proxies = {'http': next(proxies)}
Then just parse with bs4 as normal
Last but not least, don't use those free proxies! I've seen people being injected with malicious code before, and their hard-written projects have all gone cold. ipipgoEnterprise-grade encrypted channelThe data security piece is taken to death.