
Hands-on with proxy IPs to pickpocket Amazon review data
Recently, many friends doing cross-border e-commerce asked me how to get the Amazon product reviews of different regions. This thing, just manually copy and paste certainly not, have to use crawlers. But Amazon is not vegetarian, direct climb minutes to block IP. this time we have to rely on proxy IP to play the auxiliary.
Why do I have to use a proxy IP?
For example, you open 10 threads to crawl the data, Amazon server look: "This grandson the same IP crazy request, definitely have a problem!" Click on your IP to pull the black. If you use a proxy IP, it is equivalent to letting different "vests" to help you work, each request for a different IP address, so it is not easy to be found.
Here's the point:
- Anti-blocking: single IP high-frequency access will be blocked
- Cross-region: want to see reviews from different parts of the US UK Japan
- Stability: reliable agents can ensure uninterrupted collection
What are the doors to look for when choosing a proxy IP?
There are a bunch of proxy service providers on the market, but there are also a lot of pits. According to my experience in testing, you have to fulfill these conditions:
| norm | recommended value |
|---|---|
| IP Type | Residential agents are the safest |
| success rate | >95% is the only reliable one. |
| geographic location | Coverage of at least 20 countries |
| concurrency | Support 50+ threads |
Here's a little something for you.ipipgoI've been using their residential agent for half a year. The best thing is to be able to accurately select the city, for example, I want to climb the comments of New York users, directly specify the U.S. East IP, the success rate can be more than 97%.
Seven Steps to Real-World Operation
1. first go to the ipipgo official website to register an account, newcomers have 5G traffic trial
2. Generate the API key in the background, remember the endpoint address
3. installed Python environment, requests library must be
4. Write an agent rotation logic, code example:
import requests
proxies = {
"http": "http://用户名:密码@gateway.ipipgo.com:端口",
"https": "http://用户名:密码@gateway.ipipgo.com:端口"
}
response = requests.get("https://亚马逊商品链接", proxies=proxies, timeout=10)
5. Set up random request headers, don't use the same User-Agent
6. Control the frequency of requests to no more than 3 per second
7. Remember to de-duplicate data before storing it in the database
Summary of common pitfalls for white people
Q: Obviously used proxy IP or still blocked?
A: Check if you are using the IP of the server room, Amazon is particularly sensitive to the IP of the data center, change the residential proxy immediately solve the problem!
Q: Crawling and suddenly no data?
A: Eighty percent of the IP pool is used up, in the ipipgo background to "automatically replace the IP" function to open, set every 5 minutes to change a batch of IP
Q: How to judge the proxy IP quality?
A: Look at the response speed, more than 2 seconds of IP directly out. ipipgo background has a real-time monitoring panel, high latency IP will be automatically filtered!
Tell the truth.
Don't try to buy a cheap junk proxy, before the cheap use of 0.1 knife an IP, the result is that 8 out of 10 can't be used. Then change ipipgo's exclusive proxy, although more expensive, but can be stable to run all night without dropping. Remember, the proxy IP thing is a penny a penny, save money in the end have to lose in the time.
Finally, to remind, crawl data attention to comply with the Amazon robots agreement, do not catch a product to the dead crawl. The best time to collect, such as morning, noon and night climb half an hour, so that it is not easy to be blocked, but also to get the real-time update of the review data.

