Converting Nested JSON data to CSV using python/pandas
I want to share how I converted a json (Javascript Object Notation) data into easily readable csv file. The data was obtained from Stackexchange API. https://api.stackexchange.com/docs/top-answerers-on-tags. I used the tag –‘python’ and period –‘month’. I saved this data.json to my working directory in PyCharm IDE.
{
"items": [
{
"user": {
"reputation": 42660,
"user_id": 4909087,
"user_type": "registered",
"accept_rate": 96,
"profile_image": "https://i.stack.imgur.com/q6xFr.jpg?s=128&g=1",
"display_name": "cᴏʟᴅsᴘᴇᴇᴅ",
"link": "https://stackoverflow.com/users/4909087/c%e1%b4%8f%ca%9f%e1%b4%85s%e1%b4%98%e1%b4%87%e1%b4%87%e1%b4%85"
},
"post_count": 347,
"score": 623
},
{
"user": {
"reputation": 173184,
"user_id": 2901002,
"user_type": "registered",
"accept_rate": 96,
"profile_image": "https://i.stack.imgur.com/hMDvl.jpg?s=128&g=1",
"display_name": "jezrael",
"link": "https://stackoverflow.com/users/2901002/jezrael"
},
"post_count": 396,
"score": 613
},
{
"user": {
"reputation": 95634,
"user_id": 2336654,
"user_type": "registered",
"accept_rate": 96,
"profile_image": "https://i.stack.imgur.com/mQIca.jpg?s=128&g=1",
"display_name": "piRSquared",
"link": "https://stackoverflow.com/users/2336654/pirsquared"
},
"post_count": 163,
"score": 360
},
{
"user": {
"reputation": 13646,
"user_id": 7964527,
"user_type": "registered",
"accept_rate": 100,
"profile_image": "https://i.stack.imgur.com/mD4ZC.jpg?s=128&g=1",
"display_name": "Wen",
"link": "https://stackoverflow.com/users/7964527/wen"
},
"post_count": 213,
"score": 325
},
],
"has_more": false,
"quota_max": 10000,
"quota_remaining": 9985
}
Objective:
- List the top answerers in Stackoverflow with tag –‘Python’.
Method:
Create a python file named convert_JSON_to_CSV.py and import the modules pandas, csv and json. As the JSON data is nested, we need to only select the dictionary keys that we need. I especially need the ‘item’ key, and its subkeys, and furthermore I need the keys ‘score’ and ‘post_count’ for each Stackoverflow user(SO_user).
The js_r function reads the input data.json and converts it to python dictionary. I defined the original dict keys as keys= my_dic_data.keys(). Now, I only need the ‘items’ key which I used in dictionary comprehension to iterate over the dictionary of items in the my_dic_data python dictionary and assign it to a new key- ‘my_items’.
dict_you_want={'my_items':my_dic_data['items']for key in keys}
Then I used pandas to create a DataFrame(df) for dict_you_want as df. However this df has one column my_items each row consisting a dictionary of ‘user’.I want this dictionary to be used as a column, for this I apply an .apply(pd.Series) method on df and assign it df2.
df2=df['my_items'].apply(pd.Series)
Finally, We keep the columns ‘post_count’ and ‘score’ and drop df2’s ‘user’ column to avoid the nested ‘user’ dictionary, since we need to expand it one level. We concatenate (pd.concat) the df2 without ‘user’ column with expanded df2’s ‘user’ column by applying .apply (pd.Series) on df2 as df2[‘user’].apply(pd.Series).
df3=pd.concat([df2.drop(['user'],axis=1),df2['user'].apply(pd.Series)],axis=1)
Code: JSON_to_CSV.py:
import sys
import pandas as pd
from pandas import DataFrame
import json
data=r'C:\Users\Kaleab\Desktop\data.json'
print ("This is json data input", data)
# Reads and converts json to dict.
def js_r(data):
with open(data, encoding='utf-8') as f_in:
return(json.load(f_in))
if __name__ == "__main__":
my_dic_data = js_r(data)
print("This is my dictionary", my_dic_data)
keys= my_dic_data.keys()
print ("The original dict keys",keys)
# You assign a new dictionary key- SO_users, and make dictionary comprehension = { your_key: old_dict[your_key] for your_key in your_keys }
dict_you_want={'my_items':my_dic_data['items']for key in keys}
print ("These are the keys to dict_you_want",dict_you_want.keys())
print ("This is the dictionary of SO_users", dict_you_want)
df=pd.DataFrame(dict_you_want)
print ("df:", df)
#When .apply(pd.Series) method on items column is applied, the dictionaries in items column will be used as column headings
df2=df['my_items'].apply(pd.Series)
print ("df2",df2)
df3=pd.concat([df2.drop(['user'],axis=1),df2['user'].apply(pd.Series)],axis=1)
#df3=df2['user'].apply(pd.Series)
print ("df3",df3)
df3.to_csv('out.csv', sep=',',encoding='utf-8')
The output: out.csv
I hope you find this helpful. I appreciate comments.
Chandrasekaran A Are you using Python 2.7? Try 3.x or search a solution on Stackoverflow.
('This is json data input', '/home/chandru/data.json') Traceback (most recent call last): File "convert.py", line 16, in <module> my_dic_data = js_r(data) File "convert.py", line 12, in js_r with open(data, encoding='utf-8') as f_in: TypeError: 'encoding' is an invalid keyword argument for this function I got the result.. Please help ..how to resolve Kaleab Woldemariam