
Hands-on with Python for API Data Processing
Recently, some friends asked Lao Zhang, using Python to adjust the interface to get the JSON data how to convert to CSV, this thing seems simple, but in reality there are a lot of pitfalls. Especially when you need to collect a lot of dataThe probability of IP blocking is directly doubledThe first thing you need to do is to get your hands dirty. Today we will take our ipipgo proxy service as an example and teach you how to get this done properly.
Why do I need a proxy IP to help?
To give a real case: Xiao Wang wrote a crawler last week, the results ran less than 2 hours, the target site pulled his IP black. This situation is too common, many API interfaces haveAccess frequency limitationThe program can be used as a proxy IP pool with ipipgo. With ipipgo's proxy IP pool, it's like installing countless "doppelgängers" for the program, and changing different IP addresses for each request, so it won't be easy to be found.
| take | No need for an agent. | Use ipipgo. |
|---|---|---|
| Number of requests per day | 500 times | 5000+ times |
| probability of IP blocking | >80% | <5% |
Preparation for the start of work
Start by installing a couple of essential libraries (skip the ones you've installed):
pip install requests pandas
Focusing on the proxy settings of the requests library, many newbies fall head over heels here. The proxy format for ipipgo should be written like this:
proxies = {
'http': 'http://用户名:密码@gateway-address:port',
'https': 'https://用户名:密码@gateway address:port'
}
Real-world code decomposition
Suppose we want to get weather data, the complete process is in three steps:
- Calling APIs with proxy IPs
- Flatten the JSON Data
- Save as CSV file
import requests
import pandas as pd
Here we replace it with the real proxy information provided by ipipgo
PROXY_USER = "your account number"
PROXY_PASS = "Your password"
GATEWAY = "gateway.ipipgo.com:9021"
def get_data():
proxies = {
'http': f'http://{PROXY_USER}:{PROXY_PASS}@{GATEWAY}',
'https': f'http://{PROXY_USER}:{PROXY_PASS}@{GATEWAY}'
}
Fill in your own API address here
resp = requests.get('https://api.weather.com/data', proxies=proxies)
return resp.json()
Focusing on nested structures
def parse_data(raw).
Expand a multi-layered nested dictionary
df = pd.json_normalize(raw, 'hourly', ['city', 'update_time'])
return df
if __name__ == '__main__'.
data = get_data()
df = parse_data(data)
df.to_csv('weather.csv', index=False)
Guide to avoiding the pit
Three common pitfalls for newbies:
1. Agent authentication error:检查账号密码里的特殊字符,比如@符号要换成%40
2. Missing fields: Be careful to specify the meta parameter when using json_normalize.
3. Coding issues: save csv with encoding='utf_8_sig' parameter
You may ask.
Q: Why use ipipgo and not others?
A: His family has a one-trick pony--dynamic port bindingThe same gateway can use both HTTP and HTTPS protocols without switching back and forth between configurations.
Q: What should I do if I get stuck when processing large amounts of data?
A: try paging + multithreading, remember to match each thread with a separate proxy. ipipgo'sHigh Stash Corporate PackageIt supports 500 concurrency and works well in person.
Q: What should I do if the data structure returned by the API always changes?
A: Add a try-except block before parsing, and use json.dumps(raw_data) to save the raw data to the database backup, so that you can still remedy the error.
Speak from the heart.
Data collection, proxy IP is like a car's tires. If you use poor quality tires (free proxies), you will get a flat tire on the highway in minutes. ipipgo'sCommercial level agentsOur team has tested it and it has been collected continuously for 3 days without dropping. Especially their intelligent routing function, automatic switching of the fastest node, than manually change IP to save a lot of heartache.
Lastly, I would like to remind newbies to use the pay-per-use package at the testing stage, and then change the monthly subscription when they run smoothly. json to csv is simple, but with a good proxy IP, it's a real productivity tool.

