81 stories
·
0 followers

Zero-latency SQLite storage in every Durable Object

1 Share

Traditional cloud storage is inherently slow, because it is normally accessed over a network and must carefully synchronize across many clients that could be accessing the same data. But what if we could instead put your application code deep into the storage layer, such that your code runs directly on the machine where the data is stored, and the database itself executes as a local library embedded inside your application?

Durable Objects (DO) are a novel approach to cloud computing which accomplishes just that: Your application code runs exactly where the data is stored. Not just on the same machine: your storage lives in the same thread as the application, requiring not even a context switch to access. With proper use of caching, storage latency is essentially zero, while nevertheless being durable and consistent.

Until today, DOs only offered key/value oriented storage. But now, they support a full SQL query interface with tables and indexes, through the power of SQLite.

SQLite is the most-used SQL database implementation in the world, with billions of installations. It’s on practically every phone and desktop computer, and many embedded devices use it as well. It's known to be blazingly fast and rock solid. But it's been less common on the server. This is because traditional cloud architecture favors large distributed databases that live separately from application servers, while SQLite is designed to run as an embedded library. In this post, we'll show you how Durable Objects turn this architecture on its head and unlock the full power of SQLite in the cloud.

Durable Objects (DOs) are a part of the Cloudflare Workers serverless platform. A DO is essentially a small server that can be addressed by a unique name and can keep state both in-memory and on-disk. Workers running anywhere on Cloudflare's network can send messages to a DO by its name, and all messages addressed to the same name — from anywhere in the world — will find their way to the same DO instance.

DOs are intended to be small and numerous. A single application can create billions of DOs distributed across our global network. Cloudflare automatically decides where a DO should live based on where it is accessed, automatically starts it up as needed when requests arrive, and shuts it down when idle. A DO has in-memory state while running and can also optionally store long-lived durable state. Since there is exactly one DO for each name, a DO can be used to coordinate between operations on the same logical object.

For example, imagine a real-time collaborative document editor application. Many users may be editing the same document at the same time. Each user's changes must be broadcast to other users in real time, and conflicts must be resolved. An application built on DOs would typically create one DO for each document. The DO would receive edits from users, resolve conflicts, broadcast the changes back out to other users, and keep the document content updated in its local storage.

DOs are especially good at real-time collaboration, but are by no means limited to this use case. They are general-purpose servers that can implement any logic you desire to serve requests. Even more generally, DOs are a basic building block for distributed systems.

When using Durable Objects, it's important to remember that they are intended to scale out, not up. A single object is inherently limited in throughput since it runs on a single thread of a single machine. To handle more traffic, you create more objects. This is easiest when different objects can handle different logical units of state (like different documents, different users, or different "shards" of a database), where each unit of state has low enough traffic to be handled by a single object. But sometimes, a lot of traffic needs to modify the same state: consider a vote counter with a million users all trying to cast votes at once. To handle such cases with Durable Objects, you would need to create a set of objects that each handle a subset of traffic and then replicate state to each other. Perhaps they use CRDTs in a gossip network, or perhaps they implement a fan-in/fan-out approach to a single primary object. Whatever approach you take, Durable Objects make it fast and easy to create more stateful nodes as needed.

In traditional cloud architecture, stateless application servers run business logic and communicate over the network to a database. Even if the network is local, database requests still incur latency, typically measured in milliseconds.

When a Durable Object uses SQLite, SQLite is invoked as a library. This means the database code runs not just on the same machine as the DO, not just in the same process, but in the very same thread. Latency is effectively zero, because there is no communication barrier between the application and SQLite. A query can complete in microseconds.

Reads and writes are synchronous

The SQL query API in DOs does not require you to await results — they are returned synchronously:

// No awaits!
let cursor = sql.exec("SELECT name, email FROM users");
for (let user of cursor) {
  console.log(user.name, user.email);
}

This may come as a surprise to some. Querying a database is I/O, right? I/O should always be asynchronous, right? Isn't this a violation of the natural order of JavaScript?

It's OK! The database content is probably cached in memory already, and SQLite is being called as a library in the same thread as the application, so the query often actually won't spend any time at all waiting for I/O. Even if it does have to go to disk, it's a local SSD. You might as well consider the local disk as just another layer in the memory cache hierarchy: L5 cache, if you will. In any case, it will respond quickly.

Meanwhile, synchronous queries provide some big benefits. First, the logistics of asynchronous event loops have a cost, so in the common case where the data is already in memory, a synchronous query will actually complete faster than an async one.

More importantly, though, synchronous queries help you avoid subtle bugs. Any time your application awaits a promise, it's possible that some other code executes while you wait. The state of the world may have changed by the time your await completes. Maybe even other SQL queries were executed. This can lead to subtle bugs that are hard to reproduce because they require events to happen at just the wrong time. With a synchronous API, though, none of that can happen. Your code always executes in the order you wrote it, uninterrupted.

Fast writes with Output Gates

Database experts might have a deeper objection to synchronous queries: Yes, caching may mean we can perform reads and writes very fast. However, in the case of a write, just writing to cache isn't good enough. Before we return success to our client, we must confirm that the write is actually durable, that is, it has actually made it onto disk or network storage such that it cannot be lost if the power suddenly goes out.

Normally, a database would confirm all writes before returning to the application. So if the query is successful, it is confirmed. But confirming writes can be slow, because it requires waiting for the underlying storage medium to respond. Normally, this is OK because the write is performed asynchronously, so the program can go on and work on other things while it waits for the write to finish. It looks kind of like this:

But I just told you that in Durable Objects, writes are synchronous. While a synchronous call is running, no other code in the program can run (because JavaScript does not have threads). This is convenient, as mentioned above, because it means you don't need to worry that the state of the world may have changed while you were waiting. However, if write queries have to wait a while, and the whole program must pause and wait for them, then throughput will suffer.

Luckily, in Durable Objects, writes do not have to wait, due to a little trick we call "Output Gates".

In DOs, when the application issues a write, it continues executing without waiting for confirmation. However, when the DO then responds to the client, the response is blocked by the "Output Gate". This system holds the response until all storage writes relevant to the response have been confirmed, then sends the response on its way. In the rare case that the write fails, the response will be replaced with an error and the Durable Object itself will restart. So, even though the application constructed a "success" response, nobody can ever see that this happened, and thus nobody can be misled into believing that the data was stored.

Let's see what this looks like with multiple requests:

If you compare this against the first diagram above, you should notice a few things:

  • The timing of requests and confirmations are the same.

  • But, all responses were sent to the client sooner than in the first diagram. Latency was reduced! This is because the application is able to work on constructing the response in parallel with the storage layer confirming the write.

  • Request handling is no longer interleaved between the three requests. Instead, each request runs to completion before the next begins. The application does not need to worry, during the handling of one request, that its state might change unexpectedly due to a concurrent request.

With Output Gates, we get the ease-of-use of synchronous writes, while also getting lower latency and no loss of throughput.

N+1 selects? No problem.

Zero-latency queries aren't just faster, they allow you to structure your code differently, often making it simpler. A classic example is the "N+1 selects" or "N+1 queries" problem. Let's illustrate this problem with an example:

// N+1 SELECTs example

// Get the 100 most-recently-modified docs.
let docs = sql.exec(`
  SELECT title, authorId FROM documents
  ORDER BY lastModified DESC
  LIMIT 100
`).toArray();

// For each returned document, get the author name from the users table.
for (let doc of docs) {
  doc.authorName = sql.exec(
      "SELECT name FROM users WHERE id = ?", doc.authorId).one().name;
}

If you are an experienced SQL user, you are probably cringing at this code, and for good reason: this code does 101 queries! If the application is talking to the database across a network with 5ms latency, this will take 505ms to run, which is slow enough for humans to notice.

// Do it all in one query with a join?
let docs = sql.exec(`
  SELECT documents.title, users.name
  FROM documents JOIN users ON documents.authorId = users.id
  ORDER BY documents.lastModified DESC
  LIMIT 100
`).toArray();

Here we've used SQL features to turn our 101 queries into one query. Great! Except, what does it mean? We used an inner join, which is not to be confused with a left, right, or cross join. What's the difference? Honestly, I have no idea! I had to look up joins just to write this example and I'm already confused.

Well, good news: You don't need to figure it out. Because when using SQLite as a library, the first example above works just fine. It'll perform about the same as the second fancy version.

More generally, when using SQLite as a library, you don't have to learn how to do fancy things in SQL syntax. Your logic can be in regular old application code in your programming language of choice, orchestrating the most basic SQL queries that are easy to learn. It's fine. The creators of SQLite have made this point themselves.

Point-in-Time Recovery

While not necessarily related to speed, SQLite-backed Durable Objects offer another feature: any object can be reverted to the state it had at any point in time in the last 30 days. So if you accidentally execute a buggy query that corrupts all your data, don't worry: you can recover. There's no need to opt into this feature in advance; it's on by default for all SQLite-backed DOs. See the docs for details.

Let's say we're an airline, and we are implementing a way for users to choose their seats on a flight. We will create a new Durable Object for each flight. Within that DO, we will use a SQL table to track the assignments of seats to passengers. The code might look something like this:

import {DurableObject} from "cloudflare:workers";

// Manages seat assignment for a flight.
//
// This is an RPC interface. The methods can be called remotely by other Workers
// running anywhere in the world. All Workers that specify same object ID
// (probably based on the flight number and date) will reach the same instance of
// FlightSeating.
export class FlightSeating extends DurableObject {
  sql = this.ctx.storage.sql;

  // Application calls this when the flight is first created to set up the seat map.
  initializeFlight(seatList) {
    this.sql.exec(`
      CREATE TABLE seats (
        seatId TEXT PRIMARY KEY,  -- e.g. "3B"
        occupant TEXT             -- null if available
      )
    `);

    for (let seat of seatList) {
      this.sql.exec(`INSERT INTO seats VALUES (?, null)`, seat);
    }
  }

  // Get a list of available seats.
  getAvailable() {
    let results = [];

    // Query returns a cursor.
    let cursor = this.sql.exec(`SELECT seatId FROM seats WHERE occupant IS NULL`);

    // Cursors are iterable.
    for (let row of cursor) {
      // Each row is an object with a property for each column.
      results.push(row.seatId);
    }

    return results;
  }

  // Assign passenger to a seat.
  assignSeat(seatId, occupant) {
    // Check that seat isn't occupied.
    let cursor = this.sql.exec(`SELECT occupant FROM seats WHERE seatId = ?`, seatId);
    let result = [...cursor][0];  // Get the first result from the cursor.
    if (!result) {
      throw new Error("No such seat: " + seatId);
    }
    if (result.occupant !== null) {
      throw new Error("Seat is occupied: " + seatId);
    }

    // If the occupant is already in a different seat, remove them.
    this.sql.exec(`UPDATE seats SET occupant = null WHERE occupant = ?`, occupant);

    // Assign the seat. Note: We don't have to worry that a concurrent request may
    // have grabbed the seat between the two queries, because the code is synchronous
    // (no `await`s) and the database is private to this Durable Object. Nothing else
    // could have changed since we checked that the seat was available earlier!
    this.sql.exec(`UPDATE seats SET occupant = ? WHERE seatId = ?`, occupant, seatId);
  }
}

(With just a little more code, we could extend this example to allow clients to subscribe to seat changes with WebSockets, so that if multiple people are choosing their seats at the same time, they can see in real time as seats become unavailable. But, that's outside the scope of this blog post, which is just about SQL storage.)

Then in wrangler.toml, define a migration setting up your DO class like usual, but instead of using new_classes, use new_sqlite_classes:

[[migrations]]
tag = "v1"
new_sqlite_classes = ["FlightSeating"]

SQLite-backed objects also support the existing key/value-based storage API: KV data is stored into a hidden table in the SQLite database. So, existing applications built on DOs will work when deployed using SQLite-backed objects.

However, because SQLite-backed objects are based on an all-new storage backend, it is currently not possible to switch an existing deployed DO class to use SQLite. You must ask for SQLite when initially deploying the new DO class; you cannot change it later. We plan to begin migrating existing DOs to the new storage backend in 2025.

Pricing

We’ve kept pricing for SQLite-in-DO similar to D1, Cloudflare’s serverless SQL database, by billing for SQL queries (based on rows) and SQL storage. SQL storage per object is limited to 1 GB during the beta period, and will be increased to 10 GB on general availability. DO requests and duration billing are unchanged and apply to all DOs regardless of storage backend. 

During the initial beta, billing is not enabled for SQL queries (rows read and rows written) and SQL storage. SQLite-backed objects will incur charges for requests and duration. We plan to enable SQL billing in the first half of 2025 with advance notice.

Workers Paid
Rows read First 25 billion / month included + $0.001 / million rows
Rows written First 50 million / month included + $1.00 / million rows
SQL storage 5 GB-month + $0.20/ GB-month

For more on how to use SQLite-in-Durable Objects, check out the documentation

Cloudflare Workers already offers another SQLite-backed database product: D1. In fact, D1 is itself built on SQLite-in-DO. So, what's the difference? Why use one or the other?

In short, you should think of D1 as a more "managed" database product, while SQLite-in-DO is more of a lower-level “compute with storage” building block.

D1 fits into a more traditional cloud architecture, where stateless application servers talk to a separate database over the network. Those application servers are typically Workers, but could also be clients running outside of Cloudflare. D1 also comes with a pre-built HTTP API and managed observability features like query insights. With D1, where your application code and SQL database queries are not colocated like in SQLite-in-DO, Workers has Smart Placement to dynamically run your Worker in the best location to reduce total request latency, considering everything your Worker talks to, including D1. By the end of 2024, D1 will support automatic read replication for scalability and low-latency access around the world. If this managed model appeals to you, use D1.

Durable Objects require a bit more effort, but in return, give you more power. With DO, you have two pieces of code that run in different places: a front-end Worker which routes incoming requests from the Internet to the correct DO, and the DO itself, which runs on the same machine as the SQLite database. You may need to think carefully about which code to run where, and you may need to build some of your own tooling that exists out-of-the-box with D1. But because you are in full control, you can tailor the solution to your application's needs and potentially achieve more.

When Durable Objects first launched in 2020, it offered only a simple key/value-based interface for durable storage. Under the hood, these keys and values were stored in a well-known off-the-shelf database, with regional instances of this database deployed to locations in our data centers around the world. Durable Objects in each region would store their data to the regional database.

For SQLite-backed Durable Objects, we have completely replaced the persistence layer with a new system built from scratch, called Storage Relay Service, or SRS. SRS has already been powering D1 for over a year, and can now be used more directly by applications through Durable Objects.

SRS is based on a simple idea:

Local disk is fast and randomly-accessible, but expensive and prone to disk failures. Object storage (like R2) is cheap and durable, but much slower than local disk and not designed for database-like access patterns. Can we get the best of both worlds by using a local disk as a cache on top of object storage?

So, how does it work?

The mismatch in functionality between local disk and object storage

A SQLite database on disk tends to undergo many small changes in rapid succession. Any row of the database might be updated by any particular query, but the database is designed to avoid rewriting parts that didn't change. Read queries may randomly access any part of the database. Assuming the right indexes exist to support the query, they should not require reading parts of the database that aren't relevant to the results, and should complete in microseconds.

Object storage, on the other hand, is designed for an entirely different usage model: you upload an entire "object" (blob of bytes) at a time, and download an entire blob at a time. Each blob has a different name. For maximum efficiency, blobs should be fairly large, from hundreds of kilobytes to gigabytes in size. Latency is relatively high, measured in tens or hundreds of milliseconds.

So how do we back up our SQLite database to object storage? An obviously naive strategy would be to simply make a copy of the database files from time to time and upload it as a new "object". But, uploading the database on every change — and making the application wait for the upload to complete — would obviously be way too slow. We could choose to upload the database only occasionally — say, every 10 minutes — but this means in the case of a disk failure, we could lose up to 10 minutes of changes. Data loss is, uh, bad! And even then, for most databases, it's likely that most of the data doesn't change every 10 minutes, so we'd be uploading the same data over and over again.

Trick one: Upload a log of changes

Instead of uploading the entire database, SRS records a log of changes, and uploads those.

Conveniently, SQLite itself already has a concept of a change log: the Write-Ahead Log, or WAL. SRS always configures SQLite to use WAL mode. In this mode, any changes made to the database are first written to a separate log file. From time to time, the database is "checkpointed", merging the changes back into the main database file. The WAL format is well-documented and easy to understand: it's just a sequence of "frames", where each frame is an instruction to write some bytes to a particular offset in the database file.

SRS monitors changes to the WAL file (by hooking SQLite's VFS to intercept file writes) to discover the changes being made to the database, and uploads those to object storage.

Unfortunately, SRS cannot simply upload every single change as a separate "object", as this would result in too many objects, each of which would be inefficiently small. Instead, SRS batches changes over a period of up to 10 seconds, or up to 16 MB worth, whichever happens first, then uploads the whole batch as a single object.

When reconstructing a database from object storage, we must download the series of change batches and replay them in order. Of course, if the database has undergone many changes over a long period of time, this can get expensive. In order to limit how far back it needs to look, SRS also occasionally uploads a snapshot of the entire content of the database. SRS will decide to upload a snapshot any time that the total size of logs since the last snapshot exceeds the size of the database itself. This heuristic implies that the total amount of data that SRS must download to reconstruct a database is limited to no more than twice the size of the database. Since we can delete data from object storage that is older than the latest snapshot, this also means that our total stored data is capped to 2x the database size.

Credit where credit is due: This idea — uploading WAL batches and snapshots to object storage — was inspired by Litestream, although our implementation is different.

Trick two: Relay through other servers in our global network

Batches are only uploaded to object storage every 10 seconds. But obviously, we cannot make the application wait for 10 whole seconds just to confirm a write. So what happens if the application writes some data, returns a success message to the user, and then the machine fails 9 seconds later, losing the data?

To solve this problem, we take advantage of our global network. Every time SQLite commits a transaction, SRS will immediately forward the change log to five "follower" machines across our network. Once at least three of these followers respond that they have received the change, SRS informs the application that the write is confirmed. (As discussed earlier, the write confirmation opens the Durable Object's "output gate", unblocking network communications to the rest of the world.)

When a follower receives a change, it temporarily stores it in a buffer on local disk, and then awaits further instructions. Later on, once SRS has successfully uploaded the change to object storage as part of a batch, it informs each follower that the change has been persisted. At that point, the follower can simply delete the change from its buffer.

However, if the follower never receives the persisted notification, then, after some timeout, the follower itself will upload the change to object storage. Thus, if the machine running the database suddenly fails, as long as at least one follower is still running, it will ensure that all confirmed writes are safely persisted.

Each of a database's five followers is located in a different physical data center. Cloudflare's network consists of hundreds of data centers around the world, which means it is always easy for us to find four other data centers nearby any Durable Object (in addition to the one it is running in). In order for a confirmed write to be lost, then, at least four different machines in at least three different physical buildings would have to fail simultaneously (three of the five followers, plus the Durable Object's host machine). Of course, anything can happen, but this is exceedingly unlikely.

Followers also come in handy when a Durable Object's host machine is unresponsive. We may not know for sure if the machine has died completely, or if it is still running and responding to some clients but not others. We cannot start up a new instance of the DO until we know for sure that the previous instance is dead – or, at least, that it can no longer confirm writes, since the old and new instances could then confirm contradictory writes. To deal with this situation, if we can't reach the DO's host, we can instead try to contact its followers. If we can contact at least three of the five followers, and tell them to stop confirming writes for the unreachable DO instance, then we know that instance is unable to confirm any more writes going forward. We can then safely start up a new instance to replace the unreachable one.

Bonus feature: Point-in-Time Recovery

I mentioned earlier that SQLite-backed Durable Objects can be asked to revert their state to any time in the last 30 days. How does this work?

This was actually an accidental feature that fell out of SRS's design. Since SRS stores a complete log of changes made to the database, we can restore to any point in time by replaying the change log from the last snapshot. The only thing we have to do is make sure we don't delete those logs too soon.

Normally, whenever a snapshot is uploaded, all previous logs and snapshots can then be deleted. But instead of deleting them immediately, SRS merely marks them for deletion 30 days later. In the meantime, if a point-in-time recovery is requested, the data is still there to work from.

For a database with a high volume of writes, this may mean we store a lot of data for a lot longer than needed. As it turns out, though, once data has been written at all, keeping it around for an extra month is pretty cheap — typically cheaper, even, than writing it in the first place. It's a small price to pay for always-on disaster recovery.

SQLite-backed DOs are available in beta starting today. You can start building with SQLite-in-DO by visiting developer documentation and provide beta feedback via the #durable-objects channel on our Developer Discord.

Do distributed systems like SRS excite you? Would you like to be part of building them at Cloudflare? We're hiring!

Read the whole story
bernhardbock
1 day ago
reply
Share this story
Delete

How the CSI (Container Storage Interface) Works

1 Share

If you work with persistent storage in Kubernetes, maybe you've seen articles about how to migrate from in-tree to CSI volumes, but aren't sure what all the fuss is about? Or perhaps you're trying to debug a stuck VolumeAttachment that won't unmount from a node, holding up your important StatefulSet rollout? A clear understanding of what the Container Storage Interface (or CSI for short) is and how it works will give you confidence when dealing with persistent data in Kubernetes, allowing you to answer these questions and more!

The Container Storage Interface is an API specification that enables developers to build custom drivers which handle the provisioning, attaching, and mounting of volumes in containerized workloads. As long as a driver correctly implements the CSI API spec, it can be used in any supported Container Orchestration system, like Kubernetes. This decouples persistent storage development efforts from core cluster management tooling, allowing for the rapid development and iteration of storage drivers across the cloud native ecosystem.

In Kubernetes, the CSI has replaced legacy in-tree volumes with a more flexible means of managing storage mediums. Previously, in order to take advantage of new storage types, one would have had to upgrade an entire cluster's Kubernetes version to access new PersistentVolume API fields for a new storage type. But now, with the plethora of independent CSI drivers available, you can add any type of underlying storage to your cluster instantly, as long as there's a driver for it.

But what if existing drivers don't provide the features that you require and you want to build a new custom driver? Maybe you're concerned about the ramifications of migrating from in-tree to CSI volumes? Or, you simply want to learn more about how persistent storage works in Kubernetes? Well, you're in the right place! This article will describe what the CSI is and detail how it's implemented in Kubernetes.

It's APIs All the Way Down

Like many things in the Kubernetes ecosystem, the Container Storage Interface is actually just an API specification. In the container-storage-interface/spec GitHub repo, you can find this spec in 2 different versions:

  1. A protobuf file that defines the API schema in gRPC terms
  2. A markdown file that describes the overall system architecture and goes into detail about each API call

What I'm going to discuss in this section is an abridged version of that markdown file, while borrowing some nice ASCII diagrams from the repo itself!

Architecture

A CSI Driver has 2 components, a Node Plugin and a Controller Plugin. The Controller Plugin is responsible for high-level volume management; creating, deleting, attaching, detatching, snapshotting, and restoring physical (or virtualized) volumes. If you're using a driver built for a cloud provider, like EBS on AWS, the driver's Controller Plugin communicates with AWS HTTPS APIs to perform these operations. For other storage types like NFS, EXSI, ZFS, and more, the driver sends these requests to the underlying storage's API endpoint, in whatever format that API accepts.

On the other hand, the Node Plugin is responsible for mounting and provisioning a volume once it's been attached to a node. These low-level operations usually require privileged access, so the Node Plugin is installed on every node in your cluster's data plane, wherever a volume could be mounted.

The Node Plugin is also responsible for reporting metrics like disk usage back to the Container Orchestration system (referred to as the "CO" in the spec). As you might have guessed already, I'll be using Kubernetes as the CO in this post! But what makes the spec so powerful is that it can be used by any container orchestration system, like Nomad for example, as long as it abides by the contract set by the API guidelines.

The specification doc provides a few possible deployment patterns, so let's start with the most common one.

 CO "Master" Host
+-------------------------------------------+
| |
| +------------+ +------------+ |
| | CO | gRPC | Controller | |
| | +-----------> Plugin | |
| +------------+ +------------+ |
| |
+-------------------------------------------+

 CO "Node" Host(s)
+-------------------------------------------+
| |
| +------------+ +------------+ |
| | CO | gRPC | Node | |
| | +-----------> Plugin | |
| +------------+ +------------+ |
| |
+-------------------------------------------+

Figure 1: The Plugin runs on all nodes in the cluster: a centralized
Controller Plugin is available on the CO master host and the Node
Plugin is available on all of the CO Nodes.

Since the Controller Plugin is concerned with higher-level volume operations, it does not need to run on a host in your cluster's data plane. For example, in AWS, the Controller makes AWS API calls like ec2:CreateVolume, ec2:AttachVolume, or ec2:CreateSnapshot to manage EBS volumes. These functions can be run anywhere, as long as the caller is authenticated with AWS. All the CO needs is to be able to send messages to the plugin over gRPC. So in this architecture, the Controller Plugin is running on a "master" host in the cluster's control plane.

On the other hand, the Node Plugin must be running on a host in the cluster's data plane. Once the Controller Plugin has done its job by attaching a volume to a node for a workload to use, the Node Plugin (running on that node) will take over by mounting the volume to a well-known path and optionally formatting it. At this point, the CO is free to use that path as a volume mount when creating a new containerized process; so all data on that mount will be stored on the underlying volume that was attached by the Controller Plugin. It's important to note that the Container Orchestrator, not the Controller Plugin, is responsible for letting the Node Plugin know that it should perform the mount.

Volume Lifecycle

The spec provides a flowchart of basic volume operations, also in the form of a cool ASCII diagram:

 CreateVolume +------------+ DeleteVolume
 +------------->| CREATED +--------------+
 | +---+----^---+ |
 | Controller | | Controller v
+++ Publish | | Unpublish +++
|X| Volume | | Volume | |
+-+ +---v----+---+ +-+
 | NODE_READY |
 +---+----^---+
 Node | | Node
 Publish | | Unpublish
 Volume | | Volume
 +---v----+---+
 | PUBLISHED |
 +------------+

Figure 5: The lifecycle of a dynamically provisioned volume, from
creation to destruction.

Mounting a volume is a synchronous process: each step requires the previous one to have run successfully. For example, if a volume does not exist, how could we possibly attach it to a node?

When publishing (mounting) a volume for use by a workload, the Node Plugin first requires that the Controller Plugin has successfully published a volume at a directory that it can access. In practice, this usually means that the Controller Plugin has created the volume and attached it to a node. Now that the volume is attached, it's time for the Node Plugin to do its job. At this point, the Node Plugin can access the volume at its device path to create a filesystem and mount it to a directory. Once it's mounted, the volume is considered to be published and it is ready for a containerized process to use. This ends the CSI mounting workflow.

Continuing the AWS example, when the Controller Plugin publishes a volume, it calls ec2:CreateVolume followed by ec2:AttachVolume. These two API calls allocate the underlying storage by creating an EBS volume and attaching it to a particular instance. Once the volume is attached to the EC2 instance, the Node Plugin is free to format it and create a mount point on its host's filesystem.

Here is an annotated version of the above volume lifecycle diagram, this time with the AWS calls included in the flow chart.

 CreateVolume +------------+ DeleteVolume
 +------------->| CREATED +--------------+
 | +---+----^---+ |
 | Controller | | Controller v
+++ Publish | | Unpublish +++
|X| Volume | | Volume | |
+-+ | | +-+
 | |
 <ec2:CreateVolume> | | <ec2:DeleteVolume>
 | |
 <ec2:AttachVolume> | | <ec2:DetachVolume>
 | |
 +---v----+---+
 | NODE_READY |
 +---+----^---+
 Node | | Node
 Publish | | Unpublish
 Volume | | Volume
 +---v----+---+
 | PUBLISHED |
 +------------+

If a Controller wants to delete a volume, it must first wait for the Node Plugin to safely unmount the volume to preserve data and system integrity. Otherwise, if a volume is forcibly detatched from a node before unmounting it, we could experience bad things like data corruption. Once the volume is safely unpublished (unmounted) by the Node Plugin, the Controller Plugin would then call ec2:DetachVolume to detatch it from the node and finally ec2:DeleteVolume to delete it, assuming that the you don't want to reuse the volume elsewhere.

What makes the CSI so powerful is that it does not prescribe how to publish a volume. As long as your driver correctly implements the required API methods defined in the CSI spec, it will be compatible with the CSI and by extension, be usable in COs like Kubernetes and Nomad.

Running CSI Drivers in Kubernetes

What I haven't entirely make clear yet is why the Controller and Node Plugins are plugins themselves! How does the Container Orchestrator call them, and where do they plug into?

Well, the answer depends on which Container Orchestrator you are using. Since I'm most familiar with Kubernetes, I'll be using it to demonstrate how a CSI driver interacts with a CO.

Deployment Model

Since the Node Plugin, responsible for low-level volume operations, must be running on every node in your data plane, it is typically installed using a DaemonSet. If you have heterogeneous nodes and only want to deploy the plugin to a subset of them, you can use node selectors, affinities, or anti-affinities to control which nodes receive a Node Plugin Pod. Since the Node Plugin requires root access to modify host volumes and mounts, these Pods will be running in privileged mode. In this mode, the Node Plugin can escape its container's security context to access the underlying node's filesystem when performing mounting and provisioning operations. Without these elevated permissions, the Node Plugin could only operate inside of its own containerized namespace without the system-level access that it requires to provision volumes on the node.

The Controller Plugin is usually run in a Deployment because it deals with higher-level primitives like volumes and snapshots, which don't require filesystem access to every single node in the cluster. Again, lets think about the AWS example I used earlier. If the Controller Plugin is just making AWS API calls to manage volumes and snapshots, why would it need access to a node's root filesystem? Most Controller Plugins are stateless and highly-available, both of which lend themselves to the Deployment model. The Controller also does not need to be run in a privileged context.

Now that we know how CSI plugins are deployed in a typical cluster, it's time to focus on how Kubernetes calls each plugin to perform CSI-related operations. A series of sidecar containers, that are registered with the Kubernetes API server to react to different events across the cluster, are deployed alongside each Controller and Node Plugin. In a way, this is similar to the typical Kubernetes controller pattern, where controllers react to changes in cluster state and attempt to reconcile the current cluster state with the desired one.

There are currently 6 different sidecars that work alongside each CSI driver to perform specific volume-related operations. Each sidecar registers itself with the Kubernetes API server and watches for changes in a specific resource type. Once the sidecar has detected a change that it must act upon, it calls the relevant plugin with one or more API calls from the CSI specification to perform the desired operations.

Here is a table of the sidecars that run alongside a Controller Plugin:

Sidecar NameK8s Resources WatchedCSI API Endpoints Called
external-provisionerPersistentVolumeClaimCreateVolume,DeleteVolume
external-attacherVolumeAttachmentController(Un)PublishVolume
external-snapshotterVolumeSnapshot(Content)CreateSnapshot,DeleteSnapshot
external-resizerPersistentVolumeClaimControllerExpandVolume

How do these sidecars work together? Let's use an example of a StatefulSet to demonstrate. In this example, we're dynamically provisioning our PersistentVolumes (PVs) instead of mapping PersistentVolumeClaims (PVCs) to existing PVs. We start at the creation of a new StatefulSet with a VolumeClaimTemplate.

---
apiVersion: apps/v1
kind: StatefulSet
spec:
 volumeClaimTemplates:
 - metadata:
 name: www
 spec:
 accessModes: [ "ReadWriteOnce" ]
 storageClassName: "my-storage-class"
 resources:
 requests:
 storage: 1Gi

Creating this StatefulSet will trigger the creation of a new PVC based on the above template. Once the PVC has been created, the Kubernetes API will notify the external-provisioner sidecar that this new resource was created. The external-provisioner will then send a CreateVolume message to its neighbor Controller Plugin over gRPC. From here, the CSI driver's Controller Plugin takes over by processing the incoming gRPC message and will create a new volume based on its custom logic. In the AWS EBS driver, this would be an ec2:CreateVolume call.

At this point, the control flow moves to the built-in PersistentVolume controller, which will create a matching PV and bind it to the PVC. This allows the StatefulSet's underlying Pod to be scheduled and assigned to a Node.

Here, the external-attacher sidecar takes over. It will be notified of the new PV and call the Controller Plugin's ControllerPublishVolume endpoint, mounting the volume to the StatefulSet's assigned node. This would be the equivalent to ec2:AttachVolume in AWS.

At this point, we have an EBS volume that is mounted to an EC2 instance, all based on the creation of a StatefulSet, PersistentVolumeClaim, and the work of the AWS EBS CSI Controller Plugin.

There is only one unique sidecar that is deployed alongside the Node Plugin; the node-driver-registrar. This sidecar, running as part of a DaemonSet, registers the Node Plugin with a Node's kubelet. During the registration process, the Node Plugin will inform the kubelet that it is able to mount volumes using the CSI driver that it is part of. The kubelet itself will then wait until a Pod is scheduled to its corresponding Node, at which point it is then responsible for making the relevant CSI calls (PublishVolume) to the Node Plugin over gRPC.

There is also a livenessprobe sidecar that runs in both the Container and Node Plugin Pods that monitors the health of the CSI driver and reports back to the Kubernetes Liveness Probe mechanism.

Communication Over Sockets

How do these sidecars communicate with the Controller and Node Plugins? Over gRPC through a shared socket! So each sidecar and plugin contains a volume mount pointing to a single unix socket.

CSI Controller Deployment

This diagram highlights the pluggable nature of CSI Drivers. To replace one driver with another, all you have to do is simply swap the CSI Driver container with another and ensure that it's listening to the unix socket that the sidecars are sending gRPC messages to. Becase all drivers advertise their own different capabilities and communicate over the shared CSI API contract, it's literally a plug-and-play solution.

Conclusion

In this article, I only covered the high-level concepts of the Container Storage Interface spec and implementation in Kubernetes. While hopefully it has provided a clearer understanding of what happens once you install a CSI driver, writing one requires significant low-level knowledge of both your nodes' operating system(s) and the underlying storage mechanism that your driver is implementing. Luckily, CSI drivers exist for a variety of cloud providers and distributed storage solutions, so it's likely that you can find a CSI driver that already fulfills your requirements. But it always helps to know what's happening under the hood in case your particular driver is misbehaving.

If this article interests you and you want to learn more about the topic, please let me know! I'm always happy to answer questions about CSI Drivers, Kubernetes Operators, and a myriad of other DevOps-related topics.

Read the whole story
bernhardbock
5 days ago
reply
Share this story
Delete

Using the Page Visibility API

1 Share
This post takes a look at what page visibility is, how you can use the Page Visibility API in your applications, and describes pitfalls to avoid if you build features around this functionality.

Read the whole story
bernhardbock
6 days ago
reply
Share this story
Delete

A few notes on AWS Nitro Enclaves: Attack surface

1 Share

By Paweł Płatek

In the race to secure cloud applications, AWS Nitro Enclaves have emerged as a powerful tool for isolating sensitive workloads. But with great power comes great responsibility—and potential security pitfalls. As pioneers in confidential computing security, we at Trail of Bits have scrutinized the attack surface of AWS Nitro Enclaves, uncovering potential bugs that could compromise even these hardened environments.

This post distills our hard-earned insights into actionable guidance for developers deploying Nitro Enclaves. After reading, you’ll be equipped to:

  • Identify and mitigate key security risks in your enclave deployment
  • Implement best practices for randomness, side-channel protection, and time management
  • Avoid common pitfalls in virtual socket handling and attestation

We’ll cover a number of topics, including:

Whether you’re new to Nitro Enclaves or looking to harden existing deployments, this guide will help you navigate the unique security landscape of confidential computing on AWS.

A brief threat model

First, a brief threat model. Enclaves can be attacked from the parent Amazon EC2 instance, which is the only component that has direct access to an enclave. In the context of an attack on an enclave, we should assume that the parent instance’s kernel (including its nitro_enclaves drivers) is controlled by the attacker. DoS attacks from the instance are not really a concern, as the parent can always shut down its enclaves.

If the EC2 instance forwards user traffic from the internet, then attacks on its enclaves could come from that direction and could involve all the usual attack vectors (business-logic, memory corruption, cryptographic, etc.). And in the other direction, users could be targeted by malicious EC2 instances with impersonation attacks.

In terms of trust zones, an enclave should be treated as a single trust zone. Enclaves run normal Linux and can theoretically use its access control features to “drive lines” within themselves. But that would be pointless—adversarial access (e.g., via a supply-chain attack) to anything inside the enclave would diminish the benefits of its strong isolation and of attestation. Therefore, compromise of a single enclave component should be treated as a total enclave compromise.

Finally, the hypervisor is trusted—we must assume it behaves correctly and not maliciously.

Figure 1: A simplified model of the AWS Nitro Enclaves system

Vsocks

The main entrypoint to an enclave is the local virtual socket (vsock). Only the parent EC2 instance can use the socket. Vsocks are managed by the hypervisor—the hypervisor provides the parent EC2 instance’s and the enclave’s kernels with /dev/vsock device nodes.

Vsocks are identified by a context identifier (CID) and port. Every enclave must use a unique CID, which can be set during initialization and can listen on multiple ports. There are a few predefined CIDs:

  • VMADDR_CID_HYPERVISOR = 0
  • VMADDR_CID_LOCAL = 1
  • VMADDR_CID_HOST = 2
  • VMADDR_CID_PARENT = 3 (the parent EC2 instance)
  • VMADDR_CID_ANY = 0xFFFFFFFF = -1U (listen on all CIDs)

Enclaves usually use only the VMADDR_CID_PARENT CID (to send data) and the VMADDR_CID_ANY CID (to listen for data). An example use of the VMADDR_CID_PARENT can be found in the init.c module of AWS’s enclaves SDK—the enclave sends a “heartbeat” signal to the parent EC2 instance just after initialization. The signal is handled by the nitro-cli tool.

Standard socket-related issues are the main issues to worry about when it comes to vsocks. When developing an enclave, consider the following to ensure such issues cannot enable certain attack vectors:

  • Does the enclave accept connections asynchronously (with multithreading)? If not, a single user may block other users from accessing the enclave for a long period of time.
  • Does the enclave time out connections? If not, a single user may persistently occupy a socket or open multiple connections to the enclave and drain available resources (like file descriptors).
  • If the enclave uses multithreading, is its state synchronization correctly implemented?
  • Does the enclave handle errors correctly? Reading from a socket with the recv method is especially tricky. A common pattern is to loop over the recv call until the desired number of bytes is received, but this pattern should be carefully implemented:
    • If the EINTR error is returned, the enclave should retry the recv call. Otherwise, the enclave may drop valid and live connections.
    • If there is no error but the returned length is 0, the enclave should break the loop. Otherwise, the peer may shut down the connection before sending the expected number of bytes, making the enclave loop infinitely.
    • If the socket is non-blocking, then reading data correctly is even more tricky.

The main risk of these issues is DoS. The parent EC2 instance may shut down any of its enclaves, so the actual risks are present only if a DoS can be triggered by external users. Providing timely access to the system is the responsibility of both the enclave and the EC2 instance communicating with the enclave.

Another vulnerability class involving vsocks is CID confusion: if an EC2 instance runs multiple enclaves, it may send data to the wrong one (e.g., due to a race condition issue). However, even if such a bug exists, it should not pose much risk or contribute much to an enclave’s attack surface, because traffic between users and the enclave should be authenticated end to end.

Finally, note that enclaves use the SOCK_STREAM socket type by default. If you change the type to SOCK_DGRAM, do some research to learn about the security properties of this communication type.

Randomness

Enclaves must have access to secure randomness. The word “secure” in this context means that adversaries don’t know or control all the entropy used to produce random data. On Linux, a few entropy sources are mixed together by the kernel. Among them are the CPU-provided RDRAND/RDSEED source and platform-provided hardware random number generators (RNGs). The AWS Nitro Trusted Platform Module provides its own hardware RNG (called nsm-hwrng).

Figure 2: Randomness sources in the Linux kernel

The final randomness can be obtained via the getrandom system call or from (less reliable) /dev/{u}random devices. There is also the /dev/hwrng device, which gives more direct access to the selected hardware RNG. This device should not be used by user-space applications.

When a new hardware RNG is registered by the kernel, it is used right away to add entropy to the system. A list of available hardware RNGs can be found in the /sys/class/misc/hw_random/rng_available file. One of the registered RNGs is selected automatically to periodically add entropy and is indicated in the /sys/devices/virtual/misc/hw_random/rng_current file.

We recommend configuring your enclaves to explicitly check that the current RNG (rng_current) is set to nsm-hwrng. This check will ensure that the AWS Nitro RNG was successfully registered and that it’s the one the kernel uses periodically to add entropy.

To further boost the security of your enclave’s randomness, have it pull entropy from external sources whenever there are convenient sources available. A common external source is the AWS Key Management Service, which provides a convenient GenerateRandom method that enclaves can use to bring in entropy over an encrypted channel.

If you want to follow NIST/AIS standards (see section 5.3.1 in “Documentation and Analysis of the Linux Random Number Generator”) or suspect issues with the RDRAND/RDSEED instructions (see also this LWNet article and this tweet), you can disable the random.trust_{bootloader,cpu} kernel parameters. That will inform the kernel not to include these sources for estimation of available entropy.

Lastly, make sure that your enclaves use a kernel version greater than 5.17.12important changes were introduced to the kernel’s random algorithm.

Side channels

Application-level timing side-channel attacks are a threat to enclaves, as they are to any application. Applications running inside enclaves must process confidential data in constant time. Attacks from the parent EC2 instance can use almost system-clock-precise time measurements, so don’t count on network jitter for mitigations. You can read more about timing attack vectors in our blog post “The life of an optimization barrier.”

Also, though this doesn’t really constitute a side-channel attack, error messages returned by an enclave can be used by attackers to reason about the enclave’s state. Think about issues like padding oracles and account enumeration. We recommend keeping errors returned by enclaves as generic as possible. How generic errors should be will depend on the given business requirements, as users of any application will need some level of error tracing.

CPU memory side channels

The main type of side-channel attack to know about involves CPU memory. CPUs share some memory—most notably the cache lines. If memory is simultaneously accessible to two components from different trust zones—like an enclave and its parent EC2 instance—then it may be possible for one component to indirectly leak the other component’s data via measurements of memory access patterns. Even if an application processes secret data in constant time, attackers with access to this type of side channel can exploit data-dependent branching.

In a typical architecture, CPUs can be categorized into NUMA nodes, CPU cores, and CPU threads. The smallest physical processing unit is the CPU core. The core may have multiple logical threads (virtual CPUs)—the smallest logical processing units—and threads share L1 and L2 cache lines. The L3 line (also called the last-level cache) is shared between all cores in a NUMA node.

Figure 3: Example CPU arrangement of a system, obtained by the lstopo command

Parent EC2 instances may have been allocated only a few CPU cores from a NUMA node. Therefore, they may share an L3 cache with other instances. However, the AWS white paper “The Security Design of the AWS Nitro System” claims that the L3 cache is never shared simultaneously. Unfortunately, there is not much more information on the topic.

Figure 4: An excerpt from the AWS white paper, stating that instances with one-half the max amount of CPUs should fill a whole CPU core (socket?)

What about CPUs in enclaves? CPUs are taken from the parent EC2 instance and assigned to an enclave. According to the AWS and nitro-cli source code, the hypervisor enforces the following:

  • The CPU #0 core (all its threads) is not assignable to enclaves.
  • Enclaves must use full cores.
  • All cores assigned to an enclave must be from the same NUMA node.

In the worst case, an enclave will share the L3 cache with its parent EC2 instance (or with other enclaves). However, whether the L3 cache can be used to carry out side-channel attacks is debatable. On one hand, the AWS white paper doesn’t make a big deal of this attack vector. On the other hand, recent research indicates the practicality of such an attack (see “Last-Level Cache Side-Channel Attacks Are Feasible in the Modern Public Cloud”).

If you are very concerned about L3 cache side-channel attacks, you can run the enclave on a full NUMA node. To do so, you would have to allocate more than one full NUMA node to the parent EC2 instance so that one NUMA node can be used for the enclave while saving some CPUs on the other NUMA node for the parent. Note that this mitigation is resource-inefficient and costly.

Alternatively, you can experiment with Intel’s Cache Allocation Technology (CAT) to isolate the enclave’s L3 cache (see the intel-cmt-cat software) from the parent. Note, however, that we don’t know whether CAT can be changed dynamically for a running enclave—that would render this solution unuseful.

If you implement any of the above mitigations, you will have to add relevant information to the attestation. Otherwise, users won’t be able to ensure that the L3 side-channel attack vector was really mitigated.

Anyway, you want your security-critical code (like cryptography) to be implemented with secrets-independent memory access patterns. Both hardware- and software-level security controls are important here.

Memory

Memory for enclaves is carved out from parent EC2 instances. It is the hypervisor’s responsibility to protect access to an enclave’s memory and to clear it after it’s returned to the parent. When it comes to enclave memory as an attack vector, developers really only need to worry about DoS attacks. Applications running inside an enclave should have limits on how much data external users can store. Otherwise, a single user may be able to consume all of an enclave’s available memory and crash the enclave (try running cat /dev/zero inside the enclave to see how it behaves when a large amount of memory is consumed).

So how much space does your enclave have? The answer is a bit complicated. First of all, the enclave’s init process doesn’t mount a new root filesystem, but keeps the initial initramfs and chroots to a directory (though there is a pending PR that will change this behavior once merged). This puts some limits on the filesystem’s size. Also, data saved in the filesystem will consume available RAM.

You can check the total available RAM and filesystem space by executing the free command inside the enclave. The filesystem’s size limit should be around 40–50% of that total space. You can confirm that by filling the whole filesystem’s space and checking how much data ends up being stored there:

dd count=9999999999 if=/dev/zero > /fillspace
du -h -d1 /

Another issue with memory is that the enclave doesn’t have any persistent storage. Once it is shut down, all its data is lost. Moreover, AWS Nitro doesn’t provide any specific data sealing mechanism. It’s your application’s responsibility to implement it. Read our blog post “A trail of flipping bits” for more information.

Time

A less common source of security issues is an enclave’s time source—namely, from where the enclave gets its time. An attacker who can control an enclave’s time could perform rollback and replay attacks. For example, the attacker could switch the enclave’s time to the past and make the enclave accept expired TLS certificates.

Getting a trusted source of time may be a somewhat complex problem in the space of confidential computing. Fortunately, enclaves can rely on the trusted hypervisor for delivery of secure clock sources. From the developer’s side, there are only three actions worth taking to improve the security and correctness of your enclave’s time sources:

  • Ensure that current_clocksource is set to kvm-clock in the enclave’s kernel configuration; consider even adding an application-level runtime check for the clock (in case something goes wrong during enclave bootstrapping and it ends up with a different clock source).
  • Enable the Precision Time Protocol for better clock synchronization between the enclave and the hypervisor. It’s like the Network Time Protocol (NTP) but works over a hardware connection. It should be more secure (as it has a smaller attack surface) and easier to set up than the NTP.
  • For security-critical functionalities (like replay protections) use Unix time. Be careful with UTC and time zones, as daylight saving time and leap seconds may “move time backwards.”

Why kvm-clock?

Machines using an x86 architecture can have a few different sources of time. We can use the following command to check the sources available to enclaves:

cat /sys/devices/system/clocksource/clocksource0/available_clocksource

Enclaves should have two sources: tsc and kvm-clock (you can see them if you run a sample enclave and check its sources); the latter is enabled by default, as can be checked in the current_clocksource file. How do these sources work?

The TSC mechanism is based on the Time Stamp Counter register. It is a per-CPU monotonic counter implemented as a model-specific register (MSR). Every (virtual) CPU has its own register. The counter increments with every CPU cycle (more or less). Linux computes the current time based on the counter scaled by the CPU’s frequency and some initial date.

We can read (and write!) TSC values if we have root privileges. To do so, we need the TSC’s offset (which is 16) and its size (which is 8 bytes). MSR registers can be accessed through the /dev/cpu device:

dd iflag=count_bytes,skip_bytes count=8 skip=16 if=/dev/cpu/0/msr
    dd if=<(echo "34d6 f1dc 8003 0000" | xxd -r -p) of=/dev/cpu/0/msr seek=16 oflag=seek_bytes

The TSC can also be read with the clock_gettime method using the CLOCK_MONOTONIC_RAW clock ID, and with the RDTSC assembly instruction.

Theoretically, if we change the TSC, the wall clock reported by clock_gettime with the CLOCK_REALTIME clock ID, by the gettimeofday function, and by the date command should change. However, the Linux kernel works hard to try to make TSCs behave reasonably and be synchronized with each other (for example, check out the tsc watchdog code and functionality related to the MSR_IA32_TSC_ADJUST register). So breaking the clock is not that easy.

The TSC can be used to track time elapsed, but where do enclaves get the “some initial date” from which the time elapsed is counted? Usually, in other systems, that date is obtained using the NTP. However, enclaves do not have out-of-the-box access to the network and don’t use the NTP (see slide 26 of this presentation from AWS’s 2020 re:Invent conference).

Figure 5: Possible sources of time for an enclave

With the tsc clock and no NTP, the initial date is somewhat randomly selected—the truth is we haven’t determined where it comes from. You can force an enclave to boot without the kvm-clock by passing the no-kvmclock no-kvmclock-vsyscall kernel parameters (but note that these parameters should not be provided at runtime) and check the initial date for yourself. In our experiments, the date was:

Tue Nov 30 00:00:00 UTC 1999

As you can see, the TSC mechanism doesn’t work well with enclaves. Moreover, it breaks badly when the machine is virtualized. Because of that, AWS introduced the kvm-clock as the default source of time for enclaves. It is an implementation of the paravirtual clock driver (pvclock) protocol (see this article and this blog post for more info on pvclock). With this protocol, the host (the AWS Nitro hypervisor in our case) provides the pvclock_vcpu_time_info structure to the guest (the enclave). The structure contains information that enables the guest to adjust its time measurements—most notably, the host’s wall clock (system_time field), which is used as the initial date.

Interestingly, the guest’s userland applications can use the TSC mechanism even if the kvm-clock is enabled. That’s because the RDTSC instruction is (usually) not emulated and therefore may provide non-adjusted TSC register readings.

Please note that if your enclaves use different clock sources or enable NTP, you should do some additional research to see if there are related security issues.

Attestation

Cryptographic attestation is the source of trust for end users. It is essential that users correctly parse and validate attestations. Fortunately, AWS provides good documentation on how to consume attestations.

The most important attestation data is protocol-specific, but we have a few generally applicable tips for developers to keep in mind (in addition to what’s written in the AWS documentation):

  • The enclave should enforce a minimal nonce length.
  • xUsers should check the timestamp provided in the attestation in addition to nonces.
  • The attestation’s timestamp should not be used to reason about the enclave’s time. This timestamp may differ from the enclave’s time, as the former is generated by the hypervisor, and the latter by whatever clock source the enclave is using.
  • Don’t use RSA for the public_key feature.

The NSM driver

Your enclave applications will use the NSM driver, which is accessible via the /dev/nsm node. Its source code can be found in the aws-nitro-enclaves-sdk-bootstrap and kernel repositories. Applications communicate with the driver via the IOCTL system call and can use the nsm-api library to do so.

Developers should be aware that applications running inside an enclave may misuse the driver or the library. However, there isn’t much that can go wrong if developers take these steps:

  • The driver lets you extend and lock more platform configuration registers (PCRs) than the basic 0–4 and 8 PCRs. Locked PCRs cannot be extended, and they are included in enclave attestations. How these additional PCRs are used depends on how you configure your application. Just make sure that it distinguishes between locked and unlocked ones.
  • Remember to make the application check the PCRs’ lock state properties when sending the DescribePCR request to the NSM driver. Otherwise, it may be consulting a PCR that may still be manipulated.
  • Requests and responses are CBOR-encoded. Make sure to get the encoding right. Incorrectly decoded responses may provide false data to your application.
  • It is not recommended to use the nsm_get_random method directly. It skips the kernel’s algorithm for mixing multiple entropy sources and therefore is more prone to errors. Instead, use common randomness APIs (like getrandom).
  • The nsm_init method returns -1 on error, which is an unusual behavior in Rust, so make sure your application accounts for that.

That’s (not) all folks

Securing AWS Nitro Enclaves requires vigilance across multiple attack vectors. By implementing the recommendations in this post—from hardening virtual sockets to verifying randomness sources—you can significantly reduce the risk of compromise to your enclave workloads, helping shape a more secure future for confidential computing.

Key takeaways:

  1. Treat enclaves as a single trust zone and implement end-to-end security.
  2. Mitigate side-channel risks through proper CPU allocation and constant-time processing.
  3. Verify enclave entropy sources in the runtime.
  4. Use the right time sources inside the enclave.
  5. Implement robust attestation practices, including nonce and timestamp validation.

For more security considerations, see our first post on enclave images and attestation. If your enclave uses external systems—like AWS Key Management Service or AWS Certificate Manager—review the systems and supporting tools for additional security footguns.

We encourage you to critically evaluate your own Nitro Enclave deployments. Trail of Bits offers in-depth security assessments and custom hardening strategies for confidential computing environments. If you’re ready to take your Nitro Enclaves’ security to the next level, contact us to schedule a consultation with our experts and ensure that your sensitive workloads remain truly confidential.

Read the whole story
bernhardbock
21 days ago
reply
Share this story
Delete

Multiple Anchors | CSS-Tricks

1 Share

Only Chris, right? You’ll want to view this in a Chromium browser:

This is exactly the sort of thing I love, not for its practicality (cuz it ain’t), but for how it illustrates a concept. Generally, tutorials and demos try to follow the “rules” — whatever those may be — yet breaking them helps you understand how a certain thing works. This is one of those.

The concept is pretty straightforward: one target element can be attached to multiple anchors on the page.

<div class="anchor-1"></div>
<div class="anchor-2"></div>
<div class="target"></div>

We’ve gotta register the anchors and attach the .target to them:

.anchor-1 {
  anchor-name: --anchor-1;
}

.anchor-2 {
  anchor-name: --anchor-2;
}

.target {
  
}

Wait, wait! I didn’t attach the .target to the anchors. That’s because we have two ways to do it. One is using the position-anchor property.

.target {
  position-anchor: --anchor-1;
}

That establishes a target-anchor relationship between the two elements. But it only accepts a single anchor value. Hmm. We need more than that. That’s what the anchor() function can do. Well, it doesn’t take multiple values, but we can declare it multiple times on different inset properties, each referencing a different anchor.

.target {
  top: anchor(--anchor-1, bottom);
}

The second piece of anchor()‘s function is the anchor edge we’re positioned to and it’s gotta be some sort of physical or logical insettop, bottom, start, end, inside, outside, etc. — or percentage. We’re bascially saying, “Take that .target and slap it’s top edge against --anchor-1‘s bottom edge.

That also works for other inset properties:

.target {
  top: anchor(--anchor-1 bottom);
  left: anchor(--anchor-1 right);
  bottom: anchor(--anchor-2 top);
  right: anchor(--anchor-2 left);
}

Notice how both anchors are declared on different properties by way of anchor(). That’s rad. But we aren’t actually anchored yet because the .target is just like any other element that participates in the normal document flow. We have to yank it out with absolute positioning for the inset properties to take hold.

.target {
  position: absolute;

  top: anchor(--anchor-1 bottom);
  left: anchor(--anchor-1 right);
  bottom: anchor(--anchor-2 top);
  right: anchor(--anchor-2 left);
}

In his demo, Chris cleverly attaches the .target to two <textarea> elements. What makes it clever is that <textarea> allows you to click and drag it to change its dimensions. The two of them are absolutely positioned, one pinned to the viewport’s top-left edge and one pinned to the bottom-right.

If we attach the .target's top and left edges to --anchor-1‘s bottom and right edges, then attach the target's bottom and right edges to --anchor-2‘s top and left edges, we’re effectively anchored to the two <textarea> elements. This is what allows the .target element to stretch with the <textarea> elements when they are resized.

But there’s a small catch: a <textarea> is resized from its bottom-right corner. The second <textarea> is positioned in a way where the resizer isn’t directly attached to the .target. If we rotate(180deg), though, it’s all good.

Again, you’ll want to view that in a Chromium browser at the time I’m writing this. Here’s a clip instead if you prefer.

That’s just a background-color on the .target element. We can put a little character in there instead as a background-image like Chris did to polish this off.

Fun, right?! It still blows my mind this is all happening in CSS. It wasn’t many days ago that something like this would’ve been a job for JavaScript.

Direct Link →

Read the whole story
bernhardbock
21 days ago
reply
Share this story
Delete

Scaling virtio-blk disk I/O with IOThread Virtqueue Mapping

1 Share

This article covers the IOThread Virtqueue Mapping feature for Kernel-based virtual machine (KVM) guests that was introduced in Red Hat Enterprise Linux (RHEL) 9.4.

The problem

Modern storage evolved to keep pace with growing numbers of CPUs by providing multiple queues through which I/O requests can be submitted. This allows CPUs to submit I/O requests and handle completion interrupts locally. The result is good performance and scalability on machines with many CPUs.

Although virtio-blk devices in KVM guests have multiple queues by default, they do not take advantage of multi-queue on the host. I/O requests from all queues are processed in a single thread on the host for guests with the <driver io=native …> libvirt domain XML setting. This single thread can become a bottleneck for I/O bound workloads.

KVM guests can now benefit from multiple host threads for a single device through the new IOThread Virtqueue Mapping feature. This improves I/O performance for workloads where the single thread is a bottleneck. Guests with many vCPUs should use this feature to take advantage of additional capacity provided by having multiple threads.

If you are interested in the QEMU internals involved in developing this feature, you can find out more in this blog post and this KVM Forum presentation. Making QEMU’s block layer thread safe was a massive undertaking that we are proud to have contributed upstream.

How IOThread Virtqueue Mapping works

IOThread Virtqueue Mapping lets users assign individual virtqueues to host threads, called IOThreads, so that a virtio-blk device is handled by more than one thread. Each virtqueue can be assigned to one IOThread.

Most users will opt for round-robin assignment so that virtqueues are automatically spread across a set of IOThreads. Figure 1 illustrates how 4 queues are assigned in round-robin fashion across 2 IOThreads.

A depiction of a virtio-blk device with four queues assigned to two IOThreads. Queue 1 and Queue 3 are green and assigned to IOThread 1. Queue 2 and Queue 4 are red and assigned to IOThread 2.
Figure 1: A virtio-blk device with 4 queues assigned to 2 IOThreads.

The libvirt domain XML for this configuration looks like this:

<domain>
  …
  <vcpu>4</vcpu>
  <iothreads>2</iothreads>
  …
  <devices>
    <disk …>
      <driver name='qemu' cache=’none’ io=’native’ …>
        <iothreads>
          <iothread id='1'></iothread>
          <iothread id='2'></iothread>
        </iothreads>

More details on the syntax can be found in the libvirt documentation.

Configuration tips

The following recommendations are based on our experience developing and benchmarking this feature:

  • Use 4-8 IOThreads. Usually this is sufficient to saturate disks. Adding more threads beyond the point of saturation does not increase performance and may harm it.

  • Share IOThreads between devices unless you know in advance that certain devices are heavily utilized. Keeping a few IOThreads busy but not too busy is ideal.

  • Pin IOThreads away from vCPUs with <iothreadpin> and <vcpupin> if you have host CPUs to spare. IOThreads need to respond quickly when the guest submits I/O. Therefore they should not compete for CPU time with the guest’s vCPU threads.

  • Use <driver io=”native” cache=”none” …>. IOThread Virtqueue Mapping was designed for io=”native”. Using io=”threads” is not recommended as it does not combine with IOThread Virtqueue Mapping in a useful way.

Performance

The following random read disk I/O benchmark compares IOThread Virtqueue Mapping with 2 and 4 IOThreads against a guest without IOThread Virtqueue Mapping (only 1 IOThread). The guest was configured with 8 vCPUs all submitting I/O in parallel. See Figure 2.

A bar graph depicting random read disk I/O benchmark comparing IOThread Virtqueue Mapping with 2 and 4 IOThreads against a guest without IOThread Virtqueue Mapping (only 1 IOThread). The y axis is labeled iops and the x axis is labeled iodepth.
Figure 2: Random read 4 KB benchmark results for iodepth 1 and 64 with IOPS increasing when comparing 1, 2, and 4 IOThreads.

The most important fio benchmark options are shown here:

fio --ioengine=libaio –rw=randread –bs=4k --numjobs=8 --direct=1
    --cpus_allowed=0-7 --cpus_allowed_policy=split

This microbenchmark shows that when 1 IOThread is unable to saturate a disk, adding more IOThreads with IOThread Virtqueue Mapping is a significant improvement. Virtqueues were assigned round-robin to the IOThreads. The disk was an Intel Optane SSD DC P4800X and the guest was running Fedora 39 x86_64. The libvirt domain XML, fio options, benchmark output, and an Ansible playbook are available here.

Real workloads may benefit less depending on how I/O bound they are and whether they submit I/O from multiple vCPUs. We recommend benchmarking your workloads to understand the effect of IOThread Virtqueue Mapping.

A companion blog post explores database performance with IOThread Virtqueue Mapping.

Conclusion

The new IOThread Virtqueue Mapping feature in RHEL 9.4 improves scalability of disk I/O for guests with many vCPUs. Enabling this feature on your KVM guests with virtio-blk devices can boost performance of I/O bound workloads.

The post Scaling virtio-blk disk I/O with IOThread Virtqueue Mapping appeared first on Red Hat Developer.

Read the whole story
bernhardbock
34 days ago
reply
Share this story
Delete
Next Page of Stories