116 stories
·
0 followers

RLMs in DSPy

1 Share

Recursive Language Models are a new strategy for dealing with long context problems. We've implemented them in DSPy so you can quickly and easily try them with your existing DSPy programs or with new tasks.

Many of us are familiar with the perils of context rot. As our contexts grow, LLM performance drops significantly for many types of tasks. For agentic and exploration tasks this is particularly problematic, as our context grows the longer the agent works.

Recursive Language Models, a new strategy developed by Alex Zhang and Omar Khattab, addresses the context rot problem by providing LLMs with a separate environment to store information (in this case, a Python instance), from which the LLM can dynamically load context into the token space as needed. This environment is persisted and shared among subagents, allowing the LLM to ask questions about and explore the information without loading it into its main context.

This simple harness - a shared environment where LLMs can recursively interact with input context as variables - proves to be incredibly effective when dealing with very large inputs. We've used RLMs to summarize hundreds of megabytes of logs, perform coding tasks across massive multi-project codebases, and source evidence across a large collection of books.

We have implemented the RLM pattern in DSPy, allowing you to quickly and easily try RLMs with your existing DSPy programs or with new tasks. Today we're going to walk through how RLMs work, to establish a mental model for when and how might want to apply them, then get you up and running with an example in DSPy.

RLMs Manage Two Buckets of Context

RLMs work by providing an LLM with a REPL-like interface (think: a Jupyter Notebook), where they can explore, analyze, and load information by writing Python code. There is the variable space (the information stored in the REPL) and the token space (the context extracted from the variable space).

In a normal coding agent, you might provide the following context:

Your inputs are the following: Context: {LONG_context}, Other Inputs: {LONG_other_inputs}

If your inputs are sufficiently long, you could already be triggering context rot. Or, if your context is really long, you might not even fit in the model's context window.

With an RLM, on the other hand, the following context is provided:

Your inputs are the following: Context, Other Inputs.

You can access them inside your repl as variables. The variables are `context` and `other_inputs` respectively.

Previews:
context: {context[:100]}
other_inputs: {other_inputs[:100]}

Then we would prompt the LLM to write code in whatever language you have implemented the REPL in, which for both Alex's and DSPy's implementations is Python.

Then you run the code, append the output to history, and repeat.

Recursively Prompting LLMs in the REPL

The "Recursion" in "RLM" describes the LLM's ability to prompt itself, which we allow it to do in the REPL. This ability is exposed as a function.

In the case of dspy.RLM, we implement a single sub_llm() call. The main LLM can prepare a prompt and task a sub LLM with working on some information in the variable space. The results are returned in the variable space, as with any other function in a REPL, which the LLM can choose or choose not to tokenize.

Part of the beauty of this is that how the LLM splits up the work is undefined. Given a list of 10 long documents, the LLM could choose to split the work into 10 subcalls, or combine the work and parse the outputs, chunk sequentially, etc.

This kinda sounds like Claude Code, or the way most coding agents work. They fire off subagents to do work, then return the output to the main context. It's similar, but there's a crucial difference: Claude Code, out of the box, doesn't save outputs to a variable space that it can manipulate. For example, a Claude Code subagent returns a blob of text back into the context by default.

If Claude Code were to adopt a pattern where subagents write their results to files, we could consider this an RLM pattern.

And this turns out to be the difference maker. By providing the LLMs with a shared space to explore and store information outside the token space, RLMs unlock some incredible capabilities. Context rot is mitigated and tasks that can't fit into a single context window are suddenly addressable.

DSPy is the Easiest Way to Try RLMs

By extending DSPy with the RLM based paradigm, we are able to increase the capabilities and enforce some structure onto the RLM call.

For example, dspy.RLM gets to take advantage of the structure of the provided Signature. If your inputs include typed parameters or arbitrary data structures, that information is immediately provided to the RLM. When passing only strings, we find RLMs will spend the first few iterations just exploring the shape of the information. Signatures help us avoid this step.

Perhaps the best feature of dspy.RLM is that it works with all your existing Signatures. No need to tweak them, redesign your parameters, or issue special instructions. dspy.RLM is simply a new inference time strategy (just like Predict or ChainOfThought) that we can modularly swap in or out.

The only detail to note is RLMs require LLMs with strong reasoning and coding capabilities. The RLM strategy leverages the coding skills of larger models to solve long context problems - that's the unlock. GPT-5 and Opus versions work great with RLMs, though we continue to be surprised at how effective Kimi K2 is as well, despite its low cost and speed.

An Example RLM with DSPy

Creating an RLM with DSPy is easy:

signature = "logs, question -> answer"
rlm = dspy.RLM(signature)
result = rlm(
    logs = all_my_logs
    question = "Did anyone ask my agent about ice cream this week?"
)

The only line above that's specific to RLMs is dspy.RLM, which is the Module we use instead of Predict, ChainOfThought, or ReAct.

When you call a program using the RLM module, DSPy creates and manages a local, isolated Python sandbox using Deno.

You can install Deno with: curl -fsSL <a href="https://deno.land/install.sh" rel="nofollow">https://deno.land/install.sh</a> | sh. See the Deno Installation Docs for more details.

Your inputs are loaded into this environment as variables and the LLM is given a prompt DSPy prepares.

In our example above, we're using a string signature, but dspy.RLM works perfectly well with class-based signatures:

class CodebaseSubset(dspy.Signature):
    """
    Find all of the files from the provided codebase that would be helpful for understanding the given feature.
    """
    code_tree: dict = dspy.InputField()
    feature: str = dspy.InputField()
    relevant_filepaths: List[str] = dspy.OutputField

codebase_subsetter = dspy.RLM(CodebaseUnderstanding)

What's important to note here is that all the input variables - in this case code_tree and feature - are treated the same way.

If you've read about RLM and/or tried Alex's library, you may be used to the pattern where an RLM is set up with one very long context resource (loaded into the REPL, of course), that is then used to answer a given query. It's helpful to realize that we don't need to follow this pattern - one big context and one question - with dspy.RLM. Every input can be large or small, it doesn't matter: they're all loaded into the REPL.

And as usual, DSPy helpfully provides your typed outputs in the response object. No need to worry about data extraction:

result = codebase_subsetter(
    code_tree = dspy_repo,
    feature = "RLM"
)
rlm_relevant_files = result.relevant_filepaths

We can also pass in Python functions as tools the LLM can call within the REPL:

def web_search(search_term):
    # Web search stuff

def github_search(search_term):
    # Gh search stuff

codebase_subsetter = dspy.RLM(
    CodebaseUnderstanding,
    tools = [web_search, github_search]
)

For harder problems, RLMs can run for quite awhile. There's a few things we can do to keep a leash on the AI and keep our wallet intact.

First, we can adjust the budget we give the RLM. We have two levers here:

  1. max_iterations: This specifies how many turns (comprised of reasoning and a REPL call) our RLM is given to complete the task. By default this is set to 10, but for many tasks 5 works well. Check your logs (or pass in verbose=true) and try a few runs to get a feel.
  2. max_llm_calls: This parameter defines how many sub-LLM calls the main RLM can fire off from the REPL. The reason this figure is separate from the parameter above is because the RLM can fire off many LLM calls from the same REPL turn.

Let me give you an example of max_llm_calls in practice:

In one task, after a couple iterations, the model has developed and tested a prompt that performed well when given a subset of the very large context. The main LLM did some quick math and realized the remaining 20 LLM calls it had budgeted was more than enough to process the entire large context, in 20 separate chunks. So it did.

The final lever we have to rein in costs is the ability to specify a different LLM as the sub_lm. For example:

codebase_subsetter = dspy.RLM(
    CodebaseUnderstanding,
    tools = [web_search, github_search],
    max_iterations = 5,
    max_llm_calls = 20,
    sub_lm = gpt_5_mini
)

Just set up the LLM as you would any other DSPy LLM.

Optimize Your RLM

dspy.RLM can be optimized like any other DSPy program. Behind the scenes, it's handled similarly to dspy.ReAct: tool descriptions and signature instructions are compiled together into an instruction block that is then optimized with GEPA, MiPRO, or whatever.

The way dspy.RLM works with signatures and optimizers is consistent and modular. Existing programs run with RLMs just by switching out the module. This is the killer feature of DSPy: when there's a new optimizer or test-time strategy, your existing signatures should just work. Applied AI moves fast; the tasks you define shouldn't have to change.

Use Cases for RLMs

The main use case for an RLM is tasks that require reasoning across long contexts. Below are five problem shapes where RLMs shine - each involves some combination of long input, fuzzy structure, and multi-step reasoning that would be painful to decompose by hand.

Given a large set of documents, an RLM can search through to find the documents that fit a given criteria. Downstream applications include:

  • Fuzzily filtering data or logs from a certain app or service
  • Finding outlier reviews in a large dataset
  • Scanning for incorrect traces from an LLM service
  1. Long context summarization/QA

An easy target use case for this is codebase QA. If you need to find all relevant files for a given feature, an RLM can do the grep et al styles of operations along with some things that are harder in bash such as AST parsing.

One of the primary benchmarks used by Alex is Browsecomp. Browsecomp is a multi-hop reasoning benchmark, requiring you to find a fact inside a corpus, then to chain multiple facts together from across the corpus in order to answer the ultimate claim.

Most complex QA tasks involve some kind of multi-hop reasoning, and we are encouraged by the improvements that RLMs can help offer in this area.

  1. Clustering and categorization

Given a long list of items, an RLM can investigate those items and come up with clusters based on what it sees. We see this as being especially useful in analyzing data from users - it could be reviews, traces, conversation intent, etc.

  1. Dynamic symbolic manipulation of long fuzzy contexts

It may be the case that you need to do some emergent decomposition based on fuzzy properties of the data. Let's say that in each document, you know that the date is referenced somewhere but you don't know where. It is very feasible to have an RLM investigate all the possible cases, and come up with a number of formats to extract, or even to use a sub_llm to extract the date from the file.

Read the whole story
bernhardbock
2 hours ago
reply
Share this story
Delete

Isometric NYC

1 Share
Read the whole story
bernhardbock
14 days ago
reply
Share this story
Delete

Fun-reliable side-channels for cross-container communication

1 Share

While exploring the Linux kernel we discovered a fun side-channel that allows for cross-container communication in the most common, default container deployment scenarios on modern kernels. This is cool because it doesn’t require sharing volume mounts, nor does it involve modifying any of the default namespaces (NET, PID, IPC, etc.), or adding special privileges (no new CAP_-abilities, nor changes to seccomp or AppArmor). It works out of the box with default Docker and Kubernetes configurations, and it even works with no network at all, as we demonstrate in this post by using docker run --network none sidechannel /h4x0rchat to showcase a full cross-container IRC-style chatroom implemented on top of this side-channel.

We originally set out to find this side-channel because we wanted a way for a given container to know if another instance of its same image was already running on the host. Consider a scenario where you want to collect environmental telemetry from your containers when they first start running. Now consider that, to handle real workloads, container deployments are often scaled up with a given image running many times over simultaneously on the same host.

Humoring further consideration, if you scale the same container image thousands of times over, and the environmental telemetry is effectively the same for each instance on the same host, you’ll probably want a way to throttle how many instances report back, to save compute time and bandwidth that would be otherwise wasted on duplicate reports. Finally, imagine that you work with many teams, all of which operate with varying requirements and constraints, and as such you can’t always control (or, maybe even never control) how these containers are deployed. If only there was a way the container could identify the presence of itself already running on the same host?

Because this side-channel circumvents the intended isolation behavior of containers, it could technically be considered a vulnerability, even through we see it more as functionality we previously wished we had.

Components

The first component of this side-channel involves nsfs (the namespace filesystem), which is a special filesystem made available to userland through /proc/<pid>/ns/. The nsfs is similar to procfs in that its entries are not actual files, but instead special file-like objects which can be used for interfacing with the kernel. In partciular, nsfs entries are like magical symlinks that point to namespace inode identifiers, with each namespace type being represented by its own named entry in the /proc/<pid>/ns/ directory. In practice, these magical symlinks can be used by opening a file descriptor to one and and passing it to setns, to enter a namespace, for example.

Unlike procfs, the nsfs entries are not unique across different mounts of the parent procfs containing the ns/* directory. This means that any namespace shared by multiple processes will result in them having the same file-like nsfs entry representing that namespace, reachable relative to each process at /proc/self/ns/<namespacename>.

The next component of this side-channel are time namespaces, which are for applying offsets to the system’s monotonic and boot-time clocks. The issue is not with how time namespaces are used, but in the fact that they are generally not used.

The utility of the time namespace applies only to niche scenarios like cross-host container migration, which is probably why (as far as I can tell) Docker doesn’t support setting the time namespace, and the documentation available instead instructs users to manually run unshare. In other words, not only are time namespaces shared, but there’s no easy way for the average container user to unshare them.

The important result of all of this is that by default, container and host processes all share the same /proc/self/ns/time entry, which more-or-less behaves like a file resource (or enough like one that it enables our side-channel).

It’s common that a single user namespace is shared by default across containers, and would also lend itself for exercising this same side-channel. However, some security conscious users set up separate user namespaces to reduce kernel attack surface a tiny bit, so we don’t expect it to be shared as ubiquitously as time.

Now, let’s talk about POSIX Advisory Locks, the official Linux docs for which can be read by running man fcntl and scrolling down to the “Advisory record locking” section. In short, POSIX advisory locks provide a cooperative (vs mandatory) file locking mechanism that operates on byte offset ranges (intervals) within a given file. These locks are “process-associated”, meaning that their acquisition and entire life-cycle is bound to a single process (and its threads). These locks are not inherited by child processes, and they clear once the owning process exits. By operating on intervals, these advisory locks allow for a more explicit expression of file content usage than whole-file locking mechanisms. For example, one process might hold a read-lock for byte range 10-200, and another might hold a write-lock for range 500-600 on the same file, and because those ranges don’t overlap, neither lock would contend with the other. Since these are cooperative, holding a lock doesn’t stop other processes from reading or writing the files, and instead only stops other processes from acquiring locks of a conflicting type that intersect with the same interval.

These advisory locks have some additional interesting properties, which, when combined with a shared file resource (or even pseudo-file resource, like /proc/self/ns/time) can facilitate a side-channel:

  1. A user only needs to have a file-like resource open for reading in order to acquire a read-lock (and conversely must have the file open for writing, in order to acquire a write-lock).
  2. The file doesn’t need to actually have readable content (note that /proc/self/ns/time does not actually have anything to read, for example).
  3. The lock intervals do not need to reflect the real size of the file, and are specified using off_t, which means there are effectively 63bits of space available in which a lock interval can be set (off_t is signed and locks cannot be placed below offset 0).
  4. A file open for reading can be queried to determine if a write-lock would hypothetically contend with any other lock, even if the querying process does not actually possess the privileges to open the file for writing.

These properties combined are enough to provide a basic cross-container side-channel primitive, because a process in one container can set a read-lock at some interval on /proc/self/ns/time, and a process in another container can observe the presence of that lock by querying for a hypothetically intersecting write-lock.

There are still yet more properties about these locks that can be used for synchronization across this side-channel, but before getting into those, presented below are small programs demonstrating cross-container communication using the fundamentals discussed above.

POSIX Advisory: Explicit Content

// setlock.c
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
 if (argc < 3) {
 printf("usage: %s <offset> <len>\n", argv[0]);
 exit(1);
 }
 off_t offset = atol(argv[1]);
 off_t len = atol(argv[2]);
 int fd = open("/proc/self/ns/time", O_RDONLY);
 if (fd < 0) {
 printf("failed to open /proc/self/ns/time\n");
 exit(1);
 }
 struct flock lock;
 memset(&lock, 0, sizeof(lock));
 lock.l_type = F_RDLCK;
 lock.l_whence = SEEK_SET;
 lock.l_start = offset;
 lock.l_len = len;
 if (fcntl(fd, F_SETLK, &lock) < 0) {
 printf("fcntl() failed\n");
 exit(1);
 }
 printf("lock set at %ld:%ld, press enter to exit\n", offset, offset+len);
 getchar();
}

The above setlock.c program takes two arguments, an offset and a length, which are used as the interval for an advisory read lock on /proc/self/ns/time. Below is a counterpart program which similarly takes two arguments, instead querying the interval for hypothetical contention, using an advisory write-lock:

// querylock.c
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
int main(int argc, char *argv[]) {
 if (argc < 3) {
 printf("usage: %s <offset> <len>\n", argv[0]);
 exit(1);
 }
 off_t offset = atol(argv[1]);
 off_t len = atol(argv[2]);
 int fd = open("/proc/self/ns/time", O_RDONLY);
 if (fd < 0) {
 printf("failed to open /proc/self/ns/time\n");
 exit(1);
 }
 struct flock lock;
 memset(&lock, 0, sizeof(lock));
 lock.l_type = F_WRLCK;
 lock.l_whence = SEEK_SET;
 lock.l_start = offset;
 lock.l_len = len;
 if (fcntl(fd, F_GETLK, &lock) < 0) {
 printf("fcntl() failed\n");
 exit(1);
 }
 if (F_UNLCK != lock.l_type) {
 printf("collision: %ld:%ld\n", lock.l_start, lock.l_start+lock.l_len);
 } else {
 printf("no lock intersects with %ld:%ld\n", offset, offset+len);
 }
} 

In querylock.c we set the struct flock.l_type to F_WRLOCK, but when calling fcntl() we specify the command argument as F_GETLK to get info about possible lock contention vs attempt to set a lock. If there is no contention, the struct flock member field l_type is updated by the kernel to contain F_UNLCK. Shown below, you can see that once a lock is set in one container, any other container (or any process on the host for that matter) can see it:

Synchronization

The ability to set and query for the presence of read-locks across containers is itself pretty cool, but to use this for proper communication, we would ideally have some way to synchronize how containers access the 63bit-space available in /proc/self/ns/time. Luckily POSIX advisory locks have some other nuances which we can use to achieve this:

  1. When the presence of a contending lock is found, the kernel updates the struct flock member field l_pid to contain the PID of the process holding the lock, or 0 if that process is in another PID namespace.

  2. If there are multiple processes with contending locks, the kernel selects and reports the PID of a “primary” lock holder. Ironically, the ordering for selecting this primary lock holder is not based on which of the contestants was first to acquire an intersecting lock, but instead by which of the contestants has held any advisory lock on the file the longest.

Given that the kernel imposes ordering when reporting the PID of lock owners, and that the ordering is preserved across PID namespaces (even if that means the owning PID is reported as 0), for a process to know if it is the “primary” lock holder, all it needs to do is create a child process to query the lock, and see if the owning PID is the parent process. Also, given how “primary” lock holders are determined by the kernel, to participate with “fairness” in this race, the process competing to acquire this lock should not hold any prior locks. For example, let’s say two containers both want to compete in a race for “ownership” of the offsets 500-501, they could each take the following steps to attempt to lock, and determine if they “won” the competition:

  1. The locking process, here called P1, holding no prior locks, acquires a read-lock at offsets 500-501.
  2. P1 fork()s to create process P2, which has no affiliation with P1’s lock state.
  3. P2 queries for the hypothetical write-lock contention at offset 500-501, the reply to which will always be true (given that P1 definitely has a lock, possibly others do too). P2 then compares the struct flock.l_pid field to see if it matches P1’s PID, if so then P1 won the race for ownership, otherwise it did not. If, instead, it sees 0 for the PID, it means a process in another PID namespace was first to get the lock P1 is not the owner.
  4. P2 tells P1 (either by pipe, or any form of IPC) the result, and now P1 is coordinated with all other container instances which are following this same protocol to race for ownership of byte offsets 500-501.

These additional properties provide us enough functionality to construct a sort of protocol for containers that normally are unaware of each other’s existence to synchronize with each other, build more traditional read-write-lock mechanisms, and to perform tasks like leader selection. To demonstrate that this is not purely theoretical and can be used for practical communication, we’ve written a cross-container h4x0rchat program built on top of this side-channel:

To support an arbitrary number of users with proper message ordering, and provide a real-time chat without absolutely slamming the CPU in tight query loops, h4x0rchat expands on the described synchronization primitive to create a system of ever-forward-rolling message slots. Clients sync to claim slots in order to post messages, and check for new messages periodically between reading stdin. Messages are written bit-by-bit, with each byte offset in the message slot representing a 1 or 0 depending on if a lock is held at that offset. A “ready” bit is used to indicate when a claimed message slot has been fully written. As long as everyone is following the same protocol, this chat avoids racy data collisions…mostly. A full walk-through of the h4x0rchat protocol, and its shortcomings, would be too much to unpack in this post, but we’re considering writing a follow-up if there is reader interest— as in, if both of our readers like it, ha!

NOTE: the demo setlock.c program shown earlier will interfere with this chat program, so if you see any error messages about there being “interference” when trying out h4x0rchat, make sure you’re not also running setlock!

The h4x0rchat, or any other communication mechanism built atop this same side-channel, is open to disruption (and likely complete denial-of-service) because there are no security guarantees over how other processes apply locks. For example, any process can acquire a lock that spans the entire space of all lockable offsets, and if they’re the first to hold such a lock they can ruin the party line for everyone. Maybe for a defender that’s a good thing? While this side-channel doesn’t present a dire threat to container security, there are definitely scenarios where it could support nefarious activity, and so we’ll close with some security considerations:

  1. You can use the demo setlock program to be the rude user who jams the whole party line by running our demo program with ./setlock 0 9223372036854775807, but if one or more users have held any lock before you, they might be able to devise a protocol for communicating still (just not with the same freedom and ease).

  2. We thought that it would be possible to write a simple AppArmor profile using deny /proc/*/ns/time rwklx, to deny access to /proc/*/ns/time. But, from a first pass of experiments, it seems that this doesn’t work. We will follow-up as we learn more—my gut is telling me that this some specific behavior related to nsfs, but who knows?

  3. You could also put in the grueling work of manually invoking unshare on the time namespace (this feels tedious, who’s got time for that??)

Special thanks to Robert Prast, Jay Beale, and Lee T. Hacker for their feedback.

Read the whole story
bernhardbock
85 days ago
reply
Share this story
Delete

Combining NVIDIA DGX Spark + Apple Mac Studio for 4x Faster LLM Inference with EXO 1.0

1 Share

We recently received early access to 2 NVIDIA DGX Spark™ units. NVIDIA calls it the world's smallest AI supercomputer. It has ~100 TFLOPs of FP16 performance with 128GB of CPU-GPU coherent memory at 273 GB/s.

With EXO, we've been running LLMs on clusters of Apple Mac Studios with M3 Ultra chips. The Mac Studio has 512GB of unified memory at 819 GB/s, but the GPU only has ~26 TFLOPs of FP16 performance.

The DGX Spark has 4x the compute, the Mac Studio has 3x the memory bandwidth.

What if we combined them? What if we used DGX Spark for what it does best and Mac Studio for what it does best, in the same inference request?

NVIDIA DGX Spark™ early access units

NVIDIA DGX Spark™ early access units (with quality control supervisor)

Mac Studio M3 Ultra stack

Mac Studio M3 Ultra stack used for LLM inference with EXO

What Determines LLM Inference Performance?

What you see as a user boils down to two numbers:

  • TTFT (time‑to‑first‑token): delay from sending a prompt to seeing the first token.
  • TPS (tokens per second): cadence of tokens after the first one appears.

Everything we do in the system exists to improve those two numbers. The reason they're hard to optimize together is that they're governed by two different phases of the same request: prefill and decode.

The lifecycle of a request (from the user's point of view)

  1. You send a prompt.
  2. You wait. Nothing appears. This is the prefill phase, and it determines TTFT.
  3. The first token appears.
  4. A stream of tokens follows. This is the decode phase, and it determines TPS.

What's happening under the hood in those two phases, and why do they behave so differently?

Figure 1: Request lifecycle showing prefill phase (yellow, determines TTFT) followed by decode phase (blue, determines TPS)

Prefill

is compute-bound

Prefill processes the prompt and builds a KV cache for each transformer layer. The KV cache consists of a bunch of vectors for each token in the prompt.

These vectors are stored during prefill so we don't need to recompute them during decode.

For large contexts, the amount of compute grows quadratically with the prompt length (Θ(s²)) since every token needs to attend to all the other tokens in the prompt.

With modern techniques like Flash Attention, the data moved can be made to grow linearly with the prompt length (Θ(s)).

So the ratio between the compute and the data moved, i.e. the arithmetic intensity, is linear in the prompt length.

This makes prefill with large contexts compute-bound.

Decode

is memory‑bound

Decode is the auto‑regressive loop after prefill. Each step generates one token by attending against the entire KV cache built so far.

In decode, we are doing vector-matrix multiplications which have lower arithmetic intensity than matrix-matrix multiplications.

This makes decode memory-bound.

Use different hardware for each phase

Once you separate the phases, the hardware choice is clear.

  • Prefill → high compute device.
  • Decode → high memory-bandwidth device.

Prefill

on DGX Spark, transfer KV,

decode

on M3 Ultra

If you prefill on one device and decode on another, you must send the KV cache across the network. The naive approach is to run prefill, wait for it to finish, transfer the KV cache, then start decode.

Figure 2: Naive split showing prefill (yellow), KV transfer (green), then decode (blue)

This adds a communication cost between the two phases. If the transfer time is too large, you lose the benefit.

Overlap communication with compute

The KV cache doesn't have to arrive as one blob at the end. It can arrive layer by layer.

As soon as Layer 1's prefill completes, two things happen simultaneously. Layer 1's KV starts transferring to the M3 Ultra, and Layer 2's prefill begins on the DGX Spark. The communication for each layer overlaps with the computation of subsequent layers.

Figure 3: Layer-by-layer pipeline showing prefill (yellow) and KV transfer (green) overlapping across layers. Decode (blue) starts immediately when all layers complete.

In practice, EXO transfers the KV vectors of a layer while the layer is being processed, since the KV vectors are computed before the heavy compute operations. To hide the communication overhead, we just need the layer processing time (tcomp) to be larger than the KV transfer time (tsend).

Full overlap is possible when the context is large enough

The compute time is tcomp = F / P, where F is the FLOPs per layer and P is machine FLOPs/s. For large contexts, F scales quadratically: F ∼ c1, where c1 is a model-dependent constant.

The transfer time is tsend = D / B, where D is KV data in bits and B is network bandwidth in bits/s. The KV cache has a constant number of vectors per token, so D ∼ q·c2·s, where q is quantization (4-bit, 8-bit, etc.) and c2 is model-dependent.

To fully hide communication, we need the transfer time to be less than the compute time: tsend < tcomp. This means P/B < F/(q·D) ∼ (c1/c2)·s/q. With DGX Spark at 100 TFLOPs FP16 and 10 GbE (10 Gbps) link between the DGX Spark and the M3 Ultra, the ratio P/B = 10,000. This means we need s > 10,000q/(c1/c2).

The constant K = c1/c2 depends on the attention architecture. For older models with multi-head attention (MHA) like Llama-2 7B, K = 2. For models with grouped query attention (GQA), K is larger: Llama-3 8B has K = 8, while Llama-3 70B and Qwen-2.5 72B have K = 16.

With 8-bit KV streaming and K = 16 (Llama-3 70B), the threshold is s > 5k tokens. For K = 8 (Llama-3 8B), it's s > 10k tokens. For K = 2 (Llama-2 7B), it's s > 40k tokens.

Benchmark results: Llama-3.1 8B with 8k context

Running Llama-3.1 8B (FP16) with an 8,192 token prompt and generating 32 tokens:

Configuration

Prefill

Time

Generation

Time
Total Time Speedup
DGX Spark 1.47s 2.87s 4.34s 1.9×
M3 Ultra Mac Studio 5.57s 0.85s 6.42s 1.0× (baseline)
DGX Spark + M3 Ultra 1.47s 0.85s 2.32s 2.8×

The combined setup achieves the best of both worlds: DGX Spark's fast prefill (3.8× faster than M3 Ultra) and M3 Ultra's fast generation (3.4× faster than DGX Spark), delivering 2.8× overall speedup compared to M3 Ultra alone.

EXO 1.0 does this automagically

Disaggregated prefill and decode, layer-by-layer KV streaming, and hardware-aware phase placement are all automated in EXO.

When you start EXO, it automatically discovers all devices connected in your ad-hoc mesh network and profiles each for compute throughput, memory bandwidth, memory capacity, and network characteristics.

Given a model and your topology, EXO plans which device should handle prefill, which should handle decode, whether to pipeline across layers, when to stream KV, and how to adapt if network conditions change. You don't write the schedule. You don't compute the thresholds. You just run the model, and EXO figures out how to make your heterogeneous cluster fast.

Inference is no longer constrained by what one box can do, but by what your whole cluster can do together.

NVIDIA DGX Spark and Mac Studio M3 Ultra connected together

NVIDIA DGX Spark and Mac Studio M3 Ultra working together for optimized inference

Read the whole story
bernhardbock
112 days ago
reply
Share this story
Delete

Verify Cosign bring-your-own PKI signature on OpenShift | Red Hat Developer

1 Share

Red Hat OpenShift 4.16 introduced ClusterImagePolicy and ImagePolicy as a tech preview feature for sigstore verification through the ClusterImagePolicy and ImagePolicy Custom Resource Definitions (CRDs). These initial implementations supported two policy types:

  • Fulcio CA with Rekor: Leverages Sigstore's certificate authority and transparency log for verification.
  • Public key: Uses Cosign-generated private and public key pairs.

In this article, we will introduce the bring-your-own PKI (BYO-PKI) signature verification through the ClusterImagePolicy and ImagePolicy API. This Developer Preview feature (available from 4.19) enables you to validate container images using an existing X.509 certificate while aligning with Cosign's BYO-PKI signing workflow. 

Cosign bring-your-own PKI signing

The following example generates the certificate chain using OpenSSL commands. We then use Cosign BYO-PKI to sign the image and attach the signature to the quay.io registry.

ClusterImagePolicy requires a subject alternative name (SAN) to authenticate the user’s identity, which can be either a hostname or an email address. In this case, both a hostname and an email address were specified when generating the certificate.

# Generate Root CA
openssl req -x509 -newkey rsa:4096 -keyout root-ca-key.pem -sha256 -noenc -days 9999 -subj "/C=ES/L=Valencia/O=IT/OU=Security/CN=Linuxera Root Certificate Authority" -out root-ca.pem
# Intermediate CA
openssl req -noenc -newkey rsa:4096 -keyout intermediate-ca-key.pem \
-addext "subjectKeyIdentifier = hash" \
-addext "keyUsage = keyCertSign" \
-addext "basicConstraints = critical,CA:TRUE,pathlen:2"  \
-subj "/C=ES/L=Valencia/O=IT/OU=Security/CN=Linuxera Intermediate Certificate Authority" \
-out intermediate-ca.csr
openssl x509 -req -days 9999 -sha256 -in intermediate-ca.csr -CA root-ca.pem -CAkey root-ca-key.pem -copy_extensions copy -out intermediate-ca.pem
# Leaf CA
openssl req -noenc -newkey rsa:4096 -keyout leaf-key.pem \
-addext "subjectKeyIdentifier = hash" \
-addext "keyUsage = digitalSignature" \
-addext "subjectAltName = email:qiwan@redhat.com,DNS:myhost.example.com" \
-subj "/C=ES/L=Valencia/O=IT/OU=Security/CN=Team A Cosign Certificate" -out leaf.csr
openssl x509 -req -in leaf.csr -CA intermediate-ca.pem -CAkey intermediate-ca-key.pem -copy_extensions copy -days 9999 -sha256 -out leaf.pem
# Bundle CA chain (Intermediate + Root)
cat intermediate-ca.pem root-ca.pem > ca-bundle.pem
# Sign the image using cosign
podman pull quay.io/libpod/busybox
podman tag quay.io/libpod/busybox quay.io/qiwanredhat/byo:latest
podman push --tls-verify=false --creds=<username>:<password> quay.io/qiwanredhat/byo:latest
IMAGE=quay.io/qiwanredhat/byo
PAYLOAD=payload.json
cosign generate $IMAGE >$PAYLOAD
openssl dgst -sha256 -sign leaf-key.pem -out $PAYLOAD.sig $PAYLOAD
cat $PAYLOAD.sig | base64 >$PAYLOAD.base64.sig
cosign attach signature $IMAGE \
	--registry-password=<password> \
	--registry-username=<username> \
	--payload $PAYLOAD \
	--signature $PAYLOAD.base64.sig \
	--cert leaf.pem \
	--cert-chain ca-bundle.pem

The next section will show how to configure ClusterImagePolicy to verify this signature.

Configure OpenShift for PKI verification

This section will guide you through verifying the quay.io/qiwanredhat/byo image. This involves enabling DevPreviewNoUpgrade features and configuring the ClusterImagePolicy CRD.

Enable Developer Preview features

First we have to enable the required Developer Preview features for your cluster by editing the FeatureGate CR named cluster

$ oc edit featuregate cluster
apiVersion: config.openshift.io/v1
kind: FeatureGate
metadata:
  name: cluster
spec:
  featureSet: DevPreviewNoUpgrade

Define ClusterImagePolicy

This section creates the following ClusterImagePolicy CR for image verification. In the CR spec, it specifies the image to be verified and the details of the PKI certificate. It also specifies the matchPolicy to MatchRepository because the image was signed with the repository (the value of docker-reference from payload.json) rather than a specific tag or digest. If not specified,  the default matchPolicy is MatchRepoDigestOrExact, which requires the signature docker-reference to match the image specified in the pod Spec.

apiVersion: config.openshift.io/v1alpha1
kind: ClusterImagePolicy
metadata:
  name: pki-quay-policy
spec:
  scopes:
  - quay.io/qiwanredhat/byo
  policy:
    rootOfTrust:
      policyType: PKI
      pki:
    	 caRootsData: <base64-encoded-root-ca>
    	 caIntermediatesData: <base64-encoded-intermediate-ca>
    	 pkiCertificateSubject:
      	   email: <a href="mailto:qiwan@redhat.com">qiwan@redhat.com</a>
      	   hostname: <a href="http://myhost.example.com">myhost.example.com</a>
    signedIdentity:
  	# set matchPolicy(default is MatchRepoDigestOrExact) since the above signature was signed on the repository, not a specific tag or digest
      matchPolicy: MatchRepository

This ClusterImagePolicy object will be rolled out to /etc/containers/policy.json, and update /etc/containers/registries.d/sigstore-registries.yaml to add an entry that enables sigstore verification on the quay.io/qiwanredhat/byo scope.

Validate signature requirements

Create the following test pod to confirm that CRI-O will verify the signature. To see the debug level log, follow this documentation to configure ContainerRuntimeConfig.

Create a test pod as follows:

kind: Pod
apiVersion: v1
metadata:
 generateName: img-test-pod-
spec:
 serviceAccount: default
 containers:
   - name: step-hello
 	command:
   	- sleep
   	- infinity
 	image: quay.io/qiwanredhat/byo:latest

Check CRI-O logs for verification.

sh-5.1# journalctl -u crio | grep -A 100 "Pulling image: quay.io/qiwanredhat"
Apr 21 08:09:07 ip-10-0-27-44 crio[2371]: time="2025-04-21T08:09:07.381322395Z" level=debug msg="IsRunningImageAllowed for image docker:quay.io/qiwanredhat/byo:latest" file="signature/policy_eval.go:274"
Apr 21 08:09:07 ip-10-0-27-44 crio[2371]: time="2025-04-21T08:09:07.381485828Z" level=debug msg=" Using transport \"docker\" specific policy section \"quay.io/qiwanredhat/byo\"" file="signature/policy_eval.go:150"

Policy enforcement failure modes and diagnostics

For an image to be accepted by CRI-O during container creation, all the signature requirements must be satisfied. Pod events should show SignatureValidationFailed from the kubelet on verification failures. The CRI-O log provides more details.

The following is the result of an attempt to deploy an unsigned image quay.io/qiwanredhat/byo:latest.

$ oc get pods
NAME                 READY   STATUS             RESTARTS   AGE
img-test-pod-sdk47   0/1     ImagePullBackOff   0          13m

Events:
  Type 	Reason      	Age               	From           	Message
  ---- 	------      	----              	----           	-------
  Normal   Scheduled   	13m               	default-scheduler  Successfully assigned default/img-test-pod-sdk47 to ip-10-0-56-56.us-east-2.compute.internal
  Normal   AddedInterface  13m               	multus         	Add eth0 [10.131.2.23/23] from ovn-kubernetes
  Normal   Pulling     	10m (x5 over 13m) 	kubelet        	Pulling image "quay.io/qiwanredhat/busybox-byo:latest"
  Warning  Failed      	10m (x5 over 13m) 	kubelet        	Failed to pull image "quay.io/qiwanredhat/busybox-byo:latest": SignatureValidationFailed: Source image rejected: A signature was required, but no signature exists
  Warning  Failed      	10m (x5 over 13m) 	kubelet        	Error: SignatureValidationFailed
  Normal   BackOff     	3m16s (x42 over 13m)  kubelet        	Back-off pulling image "quay.io/qiwanredhat/busybox-byo:latest"
  Warning  Failed      	3m16s (x42 over 13m)  kubelet        	Error: ImagePullBackOff

journalctl -u crio | grep "byo"
Apr 23 06:12:38 ip-10-0-56-56 crio[2366]: time="2025-04-23T06:12:38.141197504Z" level=debug msg="Fetching sigstore attachment manifest failed, assuming it does not exist: reading manifest sha256-8677cb90773f20fecd043e6754e548a2ea03a232264c92a17a5c77f1c4eda43e.sig in quay.io/qiwanredhat/byo: manifest unknown" file="docker/docker_client.go:1129"

Final thoughts

This article demonstrated how to perform signature verification on images signed with the Cosign's bring-your-own PKI feature in OpenShift using the ClusterImagePolicy CRD. We walked through the end-to-end process of signing an image with Cosign and BYO-PKI, followed by configuring OpenShift to verify that signature. 

As we progress toward general availability (GA) for this feature, organizations can leverage their existing PKI infrastructure to enhance the security and integrity of container images running on OpenShift.

Read the whole story
bernhardbock
151 days ago
reply
Share this story
Delete

Fil-C

1 Share
Read the whole story
bernhardbock
151 days ago
reply
Share this story
Delete
Next Page of Stories