https://uga-box.hatenablog.com/entry/2020/04/29/000000

ESのデータ更新中に急にindexのstatusがREDになってしまった

理由をみてみるとプライマリーもレプリカもALLOCATION_FAILEDになっている

curl -XGET "localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason,node"

これまではrestore後にUNASSIGNEDのままのindexがあって、その際は一度indexを消して再度restoreしたらうまくいったことがあったが、今回はデータ更新中になったのでindexを消すなどはできない

uga-box.hatenablog.com

原因調査

_cluster/allocation/explainのでUNASSIGNEDの理由を確認する

{
  "index" : "hoge-index-1",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2020-04-30T04:18:24.034Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed to create shard, failure IOException[failed to obtain in-memory shard lock]; nested: ShardLockObtainFailedException[[discovery_details_45][2]: obtaining shard lock timed out after 5000ms]; ",
    "last_allocation_status" : "no"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
  ...
}

似たような以下のやりとりを参考にすると ES Cluster State Red - cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy - Elasticsearch - Discuss the Elastic Stack

シャードの割り当てがうまくいっていない場合は以下のコマンドで手動で再割り当てをするのがよいらしい

curl -XPOST "http://host.docker.internal:9200/_cluster/reroute?retry_failed"

※問題をシンプルにするために実行前にシャードの自動割り当てを無効に、レプリカシャード数を０にしてから実行した

curl -XPUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
  "transient": {
    "cluster.routing.allocation.enable": "none"
  }
}'

curl -XPUT "localhost:9200/scored-vacation-rental-v9-vrbo/_settings" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "index": {
      "number_of_replicas": 0
    }
  }
}'

結果はUNASIGNEDのままだがもう一度_cluster/allocation/explainを実行すると理由が変わっていて

{
  "index" : "hoge-index-1",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2020-04-30T04:45:24.034Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed to create shard, failure IOException[No space left on device]",
    "last_allocation_status" : "no"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
  ...
}

どうやらディスク容量が逼迫していたのが原因だったので不要なindexを削除後、もう一度_cluster/reroute?retry_failedを実行するとgreenに戻った

データも失われていなかったのでよかった

参考

Red Cluster State: failed to obtain in-memory shard lock · Issue #23199 · elastic/elasticsearch · GitHub