嵌套json下的pandas操作

在进行json处理的时候,经常会遇到需要提取不同级别的字段内容并合并成表格,见下图所示。本篇就总结下遇到这种情况下如何进行文本内容的提取。在知乎上看到一篇《骚操作!嵌套 JSON 秒变 Dataframe》似乎和我遇到的情况类似,不过在实际处理的时候又有不同。

json-nested

先看我要处理的原始数据:

{
  "code": 0,
  "error": "",
  "message": "",
  "data": {
    "page": 1,
    "page_size": 3,
    "total": 1026,
    "list": [
      [
        {
          "_object_id": "App",
          "instanceId": "5b655ab36b9c3",
          "name": "nginx"
        },
        {
          "_object_id": "VIRTUAL_MACHINE",
          "instanceId": "5b793e7f8f932",
          "name": "CSSCU_VDU-18"
        },
        {
          "_object_id": "PHYSICAL_MACHINE",
          "instanceId": "5b471103effff",
          "name": "IDC-0D12-S-SRV-41"
        },
        {
          "_object_id": "SWITCH",
          "instanceId": "5b4714cb11685",
          "name": "IDC-0D11-M-TOR-01"
        },
        {
          "_object_id": "SWITCH",
          "instanceId": "5b4714ca14f5d",
          "name": "IDC-0G06-M-EOR-02"
        }
      ],
      [
        {
          "_object_id": "App",
          "instanceId": "5b655ab36b9c3",
          "name": "nginx"
        },
        {
          "_object_id": "VIRTUAL_MACHINE",
          "instanceId": "5b793e7f8f932",
          "name": "CSSCU_VDU-18"
        },
        {
          "_object_id": "PHYSICAL_MACHINE",
          "instanceId": "5b471103effff",
          "name": "IDC-0D12-S-SRV-41"
        },
        {
          "_object_id": "SWITCH",
          "instanceId": "5b4714cb11685",
          "name": "IDC-0D11-M-TOR-01"
        },
        {
          "_object_id": "SWITCH",
          "instanceId": "5b4714c98cbd1",
          "name": "IDC-0F06-M-EOR-01"
        }
      ],
      [
        {
          "_object_id": "App",
          "instanceId": "5b655ab36b9c3",
          "name": "nginx"
        },
        {
          "_object_id": "VIRTUAL_MACHINE",
          "instanceId": "5b793e7f8f932",
          "name": "CSSCU_VDU-18"
        },
        {
          "_object_id": "PHYSICAL_MACHINE",
          "instanceId": "5b471103effff",
          "name": "IDC-0D12-S-SRV-41"
        },
        {
          "_object_id": "SWITCH",
          "instanceId": "5b4714ca966ae",
          "name": "IDC-0D12-M-TOR-02"
        },
        {
          "_object_id": "SWITCH",
          "instanceId": "5b4714ca14f5d",
          "name": "IDC-0G06-M-EOR-02"
        }
      ]
    ]
  }
}

上面是一条CMDB路径处理下的三条数据,对应的路径是 APP1(nginx) -- 对应的虚拟机---虚拟机所在的物理机 --- TOR交换 --- EOR交换机。直接通过pandas进行读取截取时,可以获取到如下信息:

import json
import pandas as pd
f = open("vhost.json")
results = json.load(f)
print(results["data"]["list"])
df = pd.DataFrame(results["data"]["list"],columns = ["APP","VM" , "PY", "TOR", "EOR"])

此时处理过以后,虽然可以正常返回对应的五列数据,不过每列的数据里又有json。

pandas-json

参考w3resource 上关于json的处理方法,多次折腾后,发现还是不对。所以就换笨方法,for循环解决。代码如下:

for col in ["APP","VM" , "PY", "TOR", "EOR"]:
    df[col] = df.apply(lambda df: df[col]["name"], axis=1)

处理完可以正常返回每个单元格里的json里的name字段。由于返回的这里是有重复值的,比如APP这里都是nginx,所以还会涉及到汇总去重统计,可以直接调用df.nunique()函数搞定。需要拿到每列的去重具体结果,可以使用DataFrame.drop_duplicates方法搞定。




本站的发展离不开您的资助,金额随意,欢迎来赏!

You can donate through PayPal.
My paypal id: itybku@139.com
Paypal page: https://www.paypal.me/361way

  1. 本文目前尚无任何评论.
  1. 本文目前尚无任何 trackbacks 和 pingbacks.