ayush sharma (づ｡◕‿‿◕｡)づ - Configuring Elasticsearch snapshots using SLM on Google Compute Engine

A tutorial on how to configure automatic snapshots of Elasticsearch using the in-built Snapshot Lifecycle Manager on Google Compute Engine.

Elasticsearch comes with a built-in mechanism to automatically snapshot all indexes in your cluster.

While the traditional approach involves a simple cron job, it introduces another moving part to log, monitor, and maintain.

SLM works by creating different “repositories” which store data on different cloud stores (AWS, Google Cloud, Azure) or distributed file systems (HDFS). You can then create “policies” on top of these repos which store the snapshot metadata such as the naming convention, snapshot schedule, and retention policy. All of this can be configured using Elasticsearch’s in-built REST endpoints.

Goal

Our goal today is to achieve the following:

Configure IAM credentials on the Google Compute Engine (GCE) VM which has Elasticsearch installed.
Install the plug-in required to snapshot to GCS.
Configure a snapshot repository pointing to a Google Cloud Storage (GCS) bucket.
Configure an SLM policy to snapshot all indexes to the repo every 3 hours.
Trigger the SLM manually and verify the snapshot.

Step 1: Configure IAM credentials on the VM

I’ll assume that you already have a GCE VM with Elasticsearch 7.7.1 up and running and that you have a GCS bucket for the snapshots. I’ll call my bucket gcs-es-snapshots.

To configure IAM for the snapshots, first find out your VM’s service account. GCE VMs use a service account like AWS EC2s use instance profiles. This service account will need the storage.admin role to be able to snapshot. Download the json key file for this service account. We will add this key to the Elasticsearch key store so it can authenticate with the Google Cloud.

Upload your json key file to the VM, or copy-paste using vim, to the path /tmp/key.json.

Once the file is in place, run:

/usr/share/elasticsearch/bin/elasticsearch-keystore add-file gcs.client.default.credentials_file /tmp/key.json

Step 2: Install the GCS plug-in

Now we’re going to install the GCS snapshot repository plug-in to allow Elasticsearch to interact with our GCS bucket. You can follow the steps here, or directly execute the command below:

sudo /usr/share/elasticsearch/bin/elasticsearch-plug-in install repository-gcs

The link above also has options for offline install in case you want a repeatable and automated setup.

Step 3: Configure the snapshot repository to a GCS bucket

I’m going to create a snapshot repository called my_snapshot and point it to gcs-es-snapshots. I will also put the snapshots in a folder so that my bucket root stays clean.

The command is:

curl -X PUT "http://localhost:9200/_snapshot/my_snapshot?pretty" -H 'Content-Type: application/json' -d'
{
  "type": "gcs",
  "settings": {
    "bucket": "gcs-es-snapshots",
    "base_path": "es_snapshots"
  }
}
'

If there were no errors, you should see the acknowledgement below:

{
  "acknowledged" : true
}

There are more settings available for the GCS repository plug-in available here.

You can verify that the repo was configured correctly by running:

curl -X GET "http://localhost:9200/_snapshot?pretty=true"

{
  "my_snapshot" : {
    "type" : "gcs",
    "settings" : {
      "bucket" : "gcs-es-snapshots",
      "base_path" : "es_snapshots"
    }
  }
}

Step 4: Configure the SLM policy

With all the pieces in place, let’s create the SLM policy called 3hour-snapshots by running the command below:

curl -X PUT "http://localhost:9200/_slm/policy/3hour-snapshots?pretty" -H 'Content-Type: application/json' -d'
{
  "schedule": "0 0 */3 * * ?", 
  "name": "<snap-{now{yyyy.MM.dd.HH.mm}}>", 
  "repository": "my_snapshot", 
  "config": { 
    "indices": ["*"] 
  }
}
'

The output should be:

{
  "acknowledged" : true
}

The JSON command has the following properties:

schedule takes a cron-formatted expression for the snapshot frequency. Note that this expression has six items: second, minute, hour, day of month, month, and day of week.
name is the formatted name of the snapshot. Within the angle-brackets, you can specify a fixed prefix/suffix (in this case, snap-) and a formatted date field. Since this example takes snapshots every 3 hours, I’ve included the hour and minute fields as well.
repository is the name of the snapshot repository we configured in the previous step. This setting allows us to set different policies for different repositories
config contains the list of indexes we want to snapshot, which in this case is every index. Along with the repository, you can use this field to get creative with your snapshot strategies.

To see your current policy, you can run:

curl -X GET "http://localhost:9200/_slm/policy?pretty"

Step 5: Manually execute our SLM

Our configured SLM will run every 3 hours, but we can force execution to test things out by running:

curl -X POST "http://localhost:9200/_slm/policy/3hour-snapshots/_execute?pretty"

On execution, you should see the name of the new snapshot:

{
  "snapshot_name" : "snap-2020.06.22.09.48-ljjv7mznqyscjzldewkngw"
}

Now let’s check our snapshot status:

curl -X GET "http://localhost:9200/_slm/policy/3hour-snapshots?human&pretty"

{
  "3hour-snapshots" : {
    "version" : 4,
    "modified_date" : "2020-06-22T08:25:03.571Z",
    "modified_date_millis" : 1592814303571,
    "policy" : {
      "name" : "<snap-{now{yyyy.MM.dd.HH.mm}}>",
      "schedule" : "0 0 */3 * * ?",
      "repository" : "my_snapshot",
      "config" : {
        "indices" : [
          "*"
        ]
      }
    },
    "last_success" : {
      "snapshot_name" : "snap-2020.06.22.09.48-ljjv7mznqyscjzldewkngw",
      "time_string" : "2020-06-22T09:48:15.353Z",
      "time" : 1592819295353
    },
    "next_execution" : "2020-06-22T12:00:00.000Z",
    "next_execution_millis" : 1592827200000,
    "stats" : {
      "policy" : "3hour-snapshots",
      "snapshots_taken" : 6,
      "snapshots_failed" : 0,
      "snapshots_deleted" : 0,
      "snapshot_deletion_failures" : 0
    }
  }
}

To get a list of all snapshots taken against a repo, you can run:

curl -X GET "http://localhost:9200/_snapshot/my_snapshot/_all?pretty"

Conclusion

The in-built SLM can be more useful than traditional cron jobs for managing index snapshots. With blob storage services of the major cloud providers already available, flexible snapshot schedules using cron syntax, and the ability to specify different indices, using SLM makes a lot more sense than trying to roll a custom solution using scripts/automation.

Configuring Elasticsearch snapshots using SLM on Google Compute Engine