A zero-downtime serverless ingestion architecture for Medicaid’s first cross-agency API

Sev Leonard
The Nuna Blog
Published in
8 min readSep 8, 2020

--

Illustration by Jesus Chico

In a previous blog post we discussed how the T-MSIS and AREMAC systems laid the groundwork for the Lifeline API Platform. In this post we’ll take a technical deep dive into one of the core components of the platform: the data ingestion and versioning process. We will focus on how we addressed common challenges in serverless environments including state management, strategies for handing off between cloud services, and dealing with the ephemeral nature of serverless components.

In addition to these serverless challenges, the Lifeline API is required to be available 24/7/365. To ensure that the latest Medicaid beneficiary data is available, regular ingestion events of 50million+ records had to occur while the API remained online. We will discuss how we met this requirement and also ensured batch atomicity; a guarantee that a batch of records was either fully ingested or not ingested at all. As if these challenges weren’t enough, the team also faced a five month deadline from design to deployment and was restricted to using technologies already approved by CMS.

Lifeline Ingestion Process Overview

The Lifeline API surfaces data on Medicaid beneficiaries, providing Lifeline program verifiers with a fast way to determine eligibility. The figure below illustrates the Lifeline ingestion process:

Lifeline ingestion architecture

Walking through this process starting from the left of the above diagram, eligibility determination is based on T-MSIS data, and is generated by an ETL process in Databricks.

Once generated, the eligibility data is stored in parquet on S3. Persisting the data in S3 decouples us from a specific storage solution, providing greater flexibility as the platform evolves. In addition, the eligibility data can be surfaced in the AREMAC hive metastore for analytics users, where the parquet format enables performant Spark queries.

To kick off the ingestion process, we use an S3 event that triggers on the presence of new files in the Eligibility Data bucket. This event is published to an SNS topic, which kicks off a Lambda function, setting up the ingestion pipeline to run in EMR. The ingestion artifacts are linked from an S3 bucket for use by the EMR cluster.

To ensure batch atomicity, each ingestion event creates a new version of eligibility data in DynamoDB. This also enables us to quickly roll back to a prior state if needed. If an ingestion is successful, the AWS Parameter Store is updated to reflect the latest version information. When the Lifeline API receives a request, it will retrieve the version information from the Parameter Store and issue a query to the correct version in DynamoDB. The eligibility data is a lookup, containing a boolean eligibility value for a given key based on identifying information. This makes a key-value store like DynamoDB a good choice.

Versioning for Batch Atomicity

Each ingestion event creates a new, versioned table in DynamoDB and updates the AWS Parameter Store to point to the most recent, successful version. This enables us to:

  • Use a pointer to the most recent version table for API queries, moving it only after an ingestion event has succeeded thus allowing the API to stay online.
  • Enable rollback to prior versions by simply rolling back to the previous version of the AWS parameter.

A version info table is used to keep track of the status of current and prior data versions, as shown below:

Lifeline version info table

The status reflects the result of the ingestion process, showing “IN PROGRESS” for ingestions that are running. The created and updated times can be used to monitor ingestion duration.

Serverless Orchestration and State Management

“Orchestration” refers to the activity of connecting various cloud services together to create a system. As mentioned above, we use AWS Lambda to kick off the ingestion process. With a maximum runtime of 15 minutes, a Lambda function is insufficient to ingest the data into DynamoDB. Instead, we used AWS Lambda to configure and launch an ingestion pipeline in EMR. To that end, the Lambda performs the following operations:

  1. Determine the next version number by querying the DynamoDB version info table.
  2. Retrieve the path to the ingestion artifacts from AWS parameter store.
  3. Configure the EMR cluster and steps for running ingestion.
  4. Launch the cluster.

In serverless environments, boundaries between services present an interesting challenge. In order to set up the ingestion steps, the Lambda function needs to know what the next version number will be. We considered a few options for how to approach this within the Lambda function:

  • Identify the next version number and create a new entry in the DynamoDB version info table, setting the state to “IN PROGRESS.”
  • Identify the next version number, but use an EMR step to create the new record in the version info table.

We chose the second approach to reduce the possibility of creating a dangling “IN PROGRESS” record in the version info table. From our experience with EMR pipelines on the AREMAC project, we knew clusters could fail to launch for a variety of reasons. If a new version record was created by the Lambda and the cluster failed to launch, this would leave an orphaned “IN PROGRESS” record in the version info table. Deferring creation of the new version enables us to encapsulate version generation and ingestion cleanup within the EMR cluster. This also enabled us to use existing CloudWatch rules to send Slack alerts if the EMR cluster failed, eliminating the need to spend extra engineering time setting up monitoring.

State is another tricky element to manage in a serverless environment. In addition to tracking the version number, we also had to ensure the EMR pipeline was set up to use the latest ingestion artifacts. Without access to an artifact store we accomplished this by including the ingestion artifacts with the Lambda code and leveraging AWS SAM to package and deploy to S3. The deploy process would then update the Parameter Store with the path to the latest ingestion artifacts.

EMR Ingestion Pipeline

An example of the EMR ingestion steps is shown below:

Lifeline ingestion pipeline in EMR

Starting at the bottom of the image we will walk through the ingestion process:

  1. The CreateNewVersion step generates the new version in DynamoDB, creating both a new record in the version table and a new eligibility table bearing the new version number. This will be used by the IngestToDynamo step for transferring the data into DynamoDB.
  2. The data is then transformed from parquet to JSON to enable ingest into DynamoDB.
  3. Using map reduce, the IngestToDynamo step loads the versioned eligibility table with the eligibility data.
  4. A cleanup step determines if the previous steps have been successful. If so, the corresponding version info record is updated with a SUCCEEDED state, and a parameter in AWS Systems Manager is updated with the name of the versioned eligibility table. This parameter is used by the Lifeline API to identify the DynamoDB table to query for eligibility requests. Because this parameter is only moved on completion of a successful ingest, we can guarantee batch atomicity and allow the API to remain online during ingestion. In the event that the ingestion failed, the version info record is marked as FAILED and the step fails, triggering an alert to the team to debug.

The final cleanup step addresses another idiosyncrasy in serverless environments. Severless components operate in an ephemeral manner, completing the task they have been assigned and terminating when their job is done. In a system where there are many moving parts, this can quickly lead to an unmaintainable string of events and functions to monitor, and add additional points of failure to a system. Running the ingestion process in an EMR cluster was an attractive solution that enabled us to encapsulate a variety of tasks into a single service, reducing maintenance and monitoring burden.

Other Potential Solutions

One of the challenges of developing technology for the government is the limitations on technology use. Many cloud services are now available, thanks in part to the pioneering work in the serverless space by T-MSIS and AREMAC, but the authorization process is time consuming and not something we could afford within the short development window for Lifeline. Here are some alternatives we considered that couldn’t be used at the time of development due to these constraints:

  • Hive provides a DynamoDB storage handler which allows for syncing data between the two systems. The underlying engine for this mechanism, tez, was not approved for use in this manner at the time.
  • AWS Step Functions provide a way to orchestrate AWS Lambda and other services, potentially addressing several of the issues we highlighted in this article. This technology was provisionally approved at the time, given the potential hurdles in getting approval for production use the team chose to move forward with the approaches described above.

Summary

Serverless systems present a lot of opportunity to quickly develop systems without the need to deploy and maintain underlying components. This was crucial for the Lifeline API platform to meet its five month deadline. As we discussed in this article, serverless systems also present several unique challenges:

  • Maintaining state: We found AWS parameter store to be a great way to share state between serverless components, enabling async communication between the ingest process and the API regarding the current version table as well as artifact versioning.
  • Handoff between serverless components: It is important to be cognizant of potential points of failure when handing off between components, such as with the DynamoDB versioning passed from our Lambda ingestion function to the EMR cluster. Keeping these edge cases in mind when designing serverless systems will save future headaches.
  • The ephemeral nature of serverless components: Serverless systems can quickly turn into a web of components, becoming difficult to maintain and incurring additional costs to monitor. By encapsulating our ingestion process in an EMR cluster we can track and cleanup after an ingestion process, keeping our service footprint small and easy to debug and maintain.

In addition to serverless challenges, we also presented our batch atomic ingest methodology. Using versioned DynamoDB tables for each ingestion event enables us to keep the API online during ingestion, quickly rollback to previous versions if needed, and keep track of ingestion runtimes over time.

--

--