Elasticsearch的嵌套文档和父子文档

2024-10-14 17:04:08 # Elasticsearch # 问题总结 #Elasticsearch

Elasticsearch的嵌套文档和父子文档

类似于mysql，一张表可能对应多张表可以进行关联

但是对于es来说，es这种nosql数据库，索引是独立的文档形式不同索引文档之间一般是没有关系的

不过现在es也可以通过嵌套对象，嵌套文档，父子文档来建立关系

kibana自带的数据结构

"kibana_sample_data_ecommerce" : {
    "mappings" : {
      "properties" : {
        "category" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword"
            }
          }
        },
        "currency" : {
          "type" : "keyword"
        },
        "customer_full_name" : {
          "type" : "text",
          "fields" : {
            "keyword" : {
              "type" : "keyword",
              "ignore_above" : 256
            }
          }
        },
        //省略部分

        "products" : {
          "properties" : {
            "_id" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword",
                  "ignore_above" : 256
                }
              }
            },
            "base_price" : {
              "type" : "half_float"
            },
            "base_unit_price" : {
              "type" : "half_float"
            },
            "category" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword"
                }
              }
            },
            "created_on" : {
              "type" : "date"
            },
            "discount_amount" : {
              "type" : "half_float"
            },
            "discount_percentage" : {
              "type" : "half_float"
            },
            "manufacturer" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword"
                }
              }
            },
            "min_price" : {
              "type" : "half_float"
            },
            "price" : {
              "type" : "half_float"
            },
            "product_id" : {
              "type" : "long"
            },
            "product_name" : {
              "type" : "text",
              "fields" : {
                "keyword" : {
                  "type" : "keyword"
                }
              },
              "analyzer" : "english"
            },
            "quantity" : {
              "type" : "integer"
            },
            "sku" : {
              "type" : "keyword"
            },
            "tax_amount" : {
              "type" : "half_float"
            },
            "taxful_price" : {
              "type" : "half_float"
            },
            "taxless_price" : {
              "type" : "half_float"
            },
            "unit_discount_amount" : {
              "type" : "half_float"
            }
          }
        },
        "sku" : {
          "type" : "keyword"
        },
        "taxful_total_price" : {
          "type" : "half_float"
        },
        //省略部分

我们发现这个示例里面有这样的一个部分,products,这个products是一个对象，里面有很多的属性


"products" : {
  "properties" : {
    "_id" : {
      "type" : "text",
      "fields" : {
        "keyword" : {
          "type" : "keyword",
          "ignore_above" : 256
        }
      }
    },
    "base_price" : {
      "type" : "half_float"
    },
    "base_unit_price" : {
      "type" : "half_float"
    },
    "category" : {
      "type" : "text",
      "fields" : {
        "keyword" : {
          "type" : "keyword"
        }
      }
    },
    "created_on" : {
      "type" : "date"
    },
    "discount_amount" : {
      "type" : "half_float"
    },
    "discount_percentage" : {
      "type" : "half_float"
    },
    "manufacturer" : {
      "type" : "text",
      "fields" : {
        "keyword" : {
          "type" : "keyword"
        }
      }
    },
    "min_price" : {
      "type" : "half_float"
    },
    "price" : {
      "type" : "half_float"
    },
    "product_id" : {
      "type" : "long"
    },
    "product_name" : {
      "type" : "text",
      "fields" : {
        "keyword" : {
          "type" : "keyword"
        }
      },
      "analyzer" : "english"
    },
    "quantity" : {
      "type" : "integer"
    },
    "sku" : {
      "type" : "keyword"
    },
    "tax_amount" : {
      "type" : "half_float"
    },
    "taxful_price" : {
      "type" : "half_float"
    },
    "taxless_price" : {
      "type" : "half_float"
    },
    "unit_discount_amount" : {
      "type" : "half_float"
    }
  }
}

其实这就表示了一个订单有多个商品的信息

嵌套对象

数据示例

我们现在查询一下这个数据

"hits" : [
      {
        "_index" : "kibana_sample_data_ecommerce",
        "_type" : "_doc",
        "_id" : "VJz1f28BdseAsPClo7bC",
        "_score" : 1.0,
        "_source" : {
          "customer_first_name" : "Eddie",
          "customer_full_name" : "Eddie Underwood",
          "order_date" : "2020-01-27T09:28:48+00:00",
          "order_id" : 584677,
          "products" : [
            {
              "base_price" : 11.99,
              "discount_percentage" : 0,
              "quantity" : 1,
              "sku" : "ZO0549605496",
              "manufacturer" : "Elitelligence",
              "tax_amount" : 0,
              "product_id" : 6283,
            },
            {
              "base_price" : 24.99,
              "discount_percentage" : 0,
              "quantity" : 1,
              "sku" : "ZO0299602996",
              "manufacturer" : "Oceanavigations",
              "tax_amount" : 0,
              "product_id" : 19400,
            }
          ],
          "taxful_total_price" : 36.98,
          "taxless_total_price" : 36.98,
          "total_quantity" : 2,
          "total_unique_products" : 2,
          "type" : "order",
          "user" : "eddie",
            "region_name" : "Cairo Governorate",
            "continent_name" : "Africa",
            "city_name" : "Cairo"
          }
        }
      }

优点

实际上products是一个list里面包括两个对象，这样的好处就是数据都在一个文档里面可以不用和其他文档关联，直接查询就好了

缺点

缺点是假如我们要开始查询既要满足嵌套外面的内容还要满足嵌套里面的内容的条件的话就无法查出来

GET kibana_sample_data_ecommerce/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "products.base_price": 24.99 }},
        { "match": { "products.sku":"ZO0549605496"}},
        {"match": { "order_id": "584677"}}
      ]
    }
  }
}

因为我们的数据存储在es中的结构是

{
  "order_id":            [ 584677 ],
  "products.base_price":    [ 11.99, 24.99... ],
  "products.sku": [ ZO0549605496, ZO0299602996 ],
  ...
}

所以就丢失了products.base_price和products.sku的关系，因为都是数组的结构

嵌套文档

PUT test_index
{
  "mappings": {
    "properties": {
      "user": {
        "type": "nested" 
      }
    }
  }
}

user属性是nested，表示是个内嵌文档

这里我们写一个插入两条数据的es命令

PUT test_index/_doc/1
{
  "group" : "root",
  "user" : [
    {
      "name" : "John",
      "age" :  30
    },
    {
      "name" : "Alice",
      "age" :  28
    }
  ]
}

PUT test_index/_doc/2
{
  "group" : "wheel",
  "user" : [
    {
      "name" : "Tom",
      "age" :  33
    },
    {
      "name" : "Jack",
      "age" :  25
    }
  ]
}

我们查询的时候是这样查询的

GET test_index/_search
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.name": "Alice" }},
            { "match": { "user.age":  28 }} 
          ]
        }
      }
    }
  }
}

这里我们可以看到查询nested类型有特殊的写法需要指定path这个path就是字段名

我们也可以加上嵌套外面的条件一起查询：

GET test_index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "group": "root"
          }
        },
        {
          "nested": {
            "path": "user",
            "query": {
              "bool": {
                "must": [
                  {
                    "match": {
                      "user.name": "Alice"
                    }
                  },
                  {
                    "match": {
                      "user.age": 28
                    }
                  }
                ]
              }
            }
          }
        }
      ]
    }
  }
}

优点

对比products可以解决查询嵌套内互相之间的字段数据的关联关系的问题

缺点

我们可以查看一下文档信息

1	GET _cat/indices?v

我们会发现文档数多了，因为嵌套文档也算作独立的文档，查询的时候es内部做了join的处理

父子文档

前面的嵌套都有一个问题就是当我们想要更新嵌套之外的一个字段的值，这时候需要重新索引这个文档，虽然嵌套的对象不需要更新，但是他也跟着主文档一起被重新索引了

还有就是一张表和多张表有一对多的关系，也就是一个子文档可以对应多个文档，这时候使用nested就无法实现

使用

PUT my_index
{
  "mappings": {
    "properties": {
      "my_id": {
        "type": "keyword"
      },
      "my_join_field": { 
        "type": "join",
        "relations": {
          "question": "answer" 
        }
      }
    }
  }
}

这里表示的my_join_field是给我们的父子文档关系的名字，这个可以自定义。join关键字表示这是一个父子文档关系，接下来relations里面表示question是父，answer是子。

先插入两个父文档

PUT my_index/_doc/1
{
  "my_id": "1",
  "text": "This is a question",
  "my_join_field": {
    "name": "question" 
  }
}


PUT my_index/_doc/2
{
  "my_id": "2",
  "text": "This is another question",
  "my_join_field": {
    "name": "question"
  }
}

然后插入两个子文档

PUT my_index/_doc/3?routing=1
{
  "my_id": "3",
  "text": "This is an answer",
  "my_join_field": {
    "name": "answer", 
    "parent": "1" 
  }
}

PUT my_index/_doc/4?routing=1
{
  "my_id": "4",
  "text": "This is another answer",
  "my_join_field": {
    "name": "answer",
    "parent": "1"
  }
}

首先从文档id我们可以判断子文档都是独立的文档（跟nested不一样）。其次routing关键字指明了路由的id是父文档1，这个id和下面的parent关键字对应的id是一致的。

需要强调的是，索引子文档的时候，routing是必须的，因为要确保子文档和父文档在同一个分片上。

name关键字指明了这是一个子文档。

查询

无条件查询

GET my_index/_search
{
  "query": {
    "match_all": {}
  },
  "sort": ["my_id"]
}

这里的sort是根据my_id进行排序查询

返回的信息是


{
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : null,
        "_routing" : "1",
        "_source" : {
          "my_id" : "3",
          "text" : "This is an answer",
          "my_join_field" : {
            "name" : "answer",
            "parent" : "1"
          }
        },

可以看出这里我们是按照父文档的sort从小到大进行排序

has child查询

POST my_index/_search
{
  "query": {
    "has_child": {
      "type": "answer",
      "query" : {
                "match": {
                    "text" : "answer"
                }
            }
    }
  }
}

返回结果为

"hits" : [
      {
        "_index" : "my_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "my_id" : "1",
          "text" : "This is a question",
          "my_join_field" : {
            "name" : "question"
          }
        }
      }
    ]

这里返回的就是父文档对应的子文档了

has parent查询

POST my_index/_search
{
  "query": {
    "has_parent": {
      "parent_type": "question",
      "query" : {
                "match": {
                    "text" : "question"
                }
            }
    }
  }
}

结果为

"hits" : [
     {
       "_index" : "my_index",
       "_type" : "_doc",
       "_id" : "3",
       "_score" : 1.0,
       "_routing" : "1",
       "_source" : {
         "my_id" : "3",
         "text" : "This is an answer",
         "my_join_field" : {
           "name" : "answer",
           "parent" : "1"
         }
       }
     },
     {
       "_index" : "my_index",
       "_type" : "_doc",
       "_id" : "4",
       "_score" : 1.0,
       "_routing" : "1",
       "_source" : {
         "my_id" : "4",
         "text" : "This is another answer",
         "my_join_field" : {
           "name" : "answer",
           "parent" : "1"
         }
       }
     }
   ]

Parent Id查询

POST my_index/_search
{
  "query": {
    "parent_id": { 
      "type": "answer",
      "id": "1"
    }
  }
}

返回的结果和上面基本一样，区别在于parent id搜索默认使用相关性算分，而Has Parent默认情况下不使用算分。

注意

使用父子文档的模式有一些需要特别关注的点：

每一个索引只能定义一个 join field
父子文档必须在同一个分片上，意味着查询，更新操作都需要加上routing
可以向一个已经存在的join field上新增关系（这个到时候重点看一下使用）

总结

普通子对象模式实现一对多关系，会损失子对象的边界，子对象的属性之前关联性丧失。
嵌套对象可以解决普通子对象存在的问题，但是它有两个缺点，一个是更新主文档的时候要全部更新，另外就是不支持子文档从属多个主文档的场景。
父子文档能解决前面两个存在的问题，但是它适用于写多读少的场景。
嵌套对象通过冗余数据来提高查询性能，适用于读多写少的场景。父子文档类似关系型数据库中的关联关系，适用于写多的场景，减少了文档修改的范围。

2024-10-14 17:04:08 # Elasticsearch # 问题总结 #Elasticsearch