Bernhard's shared items

Modern Node.js Patterns for 2025
Thursday August 7^th, 2025 at 9:10 AM

Node.js has undergone a remarkable transformation since its early days. If you’ve been writing Node.js for several years, you’ve likely witnessed this evolution firsthand—from the callback-heavy, CommonJS-dominated landscape to today’s clean, standards-based development experience.

The changes aren’t just cosmetic; they represent a fundamental shift in how we approach server-side JavaScript development. Modern Node.js embraces web standards, reduces external dependencies, and provides a more intuitive developer experience. Let’s explore these transformations and understand why they matter for your applications in 2025.

1. Module System: ESM is the New Standard

The module system is perhaps where you’ll notice the biggest difference. CommonJS served us well, but ES Modules (ESM) have become the clear winner, offering better tooling support and alignment with web standards.

The Old Way (CommonJS)

Let’s look at how we used to structure modules. This approach required explicit exports and synchronous imports:

// math.js
function add(a, b) {
 return a + b;
}
module.exports = { add };

// app.js
const { add } = require('./math');
console.log(add(2, 3));

This worked fine, but it had limitations—no static analysis, no tree-shaking, and it didn’t align with browser standards.

The Modern Way (ES Modules with Node: Prefix)

Modern Node.js development embraces ES Modules with a crucial addition—the node: prefix for built-in modules. This explicit naming prevents confusion and makes dependencies crystal clear:

// math.js
export function add(a, b) {
 return a + b;
}

// app.js
import { add } from './math.js';
import { readFile } from 'node:fs/promises'; // Modern node: prefix
import { createServer } from 'node:http';

console.log(add(2, 3));

The node: prefix is more than just a convention—it’s a clear signal to both developers and tools that you’re importing Node.js built-ins rather than npm packages. This prevents potential conflicts and makes your code more explicit about its dependencies.

Top-Level Await: Simplifying Initialization

One of the most game-changing features is top-level await. No more wrapping your entire application in an async function just to use await at the module level:

// app.js - Clean initialization without wrapper functions
import { readFile } from 'node:fs/promises';

const config = JSON.parse(await readFile('config.json', 'utf8'));
const server = createServer(/* ... */);

console.log('App started with config:', config.appName);

This eliminates the common pattern of immediately-invoked async function expressions (IIFE) that we used to see everywhere. Your code becomes more linear and easier to reason about.

2. Built-in Web APIs: Reducing External Dependencies

Node.js has embraced web standards in a big way, bringing APIs that web developers already know directly into the runtime. This means fewer dependencies and more consistency across environments.

Fetch API: No More HTTP Library Dependencies

Remember when every project needed axios, node-fetch, or similar libraries for HTTP requests? Those days are over. Node.js now includes the Fetch API natively:

// Old way - external dependencies required
const axios = require('axios');
const response = await axios.get('https://api.example.com/data');

// Modern way - built-in fetch with enhanced features
const response = await fetch('https://api.example.com/data');
const data = await response.json();

But the modern approach goes beyond just replacing your HTTP library. You get sophisticated timeout and cancellation support built-in:

async function fetchData(url) {
 try {
 const response = await fetch(url, {
 signal: AbortSignal.timeout(5000) // Built-in timeout support
 });

 if (!response.ok) {
 throw new Error(`HTTP ${response.status}: ${response.statusText}`);
 }

 return await response.json();
 } catch (error) {
 if (error.name === 'TimeoutError') {
 throw new Error('Request timed out');
 }
 throw error;
 }
}

This approach eliminates the need for timeout libraries and provides a consistent error handling experience. The AbortSignal.timeout() method is particularly elegant—it creates a signal that automatically aborts after the specified time.

AbortController: Graceful Operation Cancellation

Modern applications need to handle cancellation gracefully, whether it’s user-initiated or due to timeouts. AbortController provides a standardized way to cancel operations:

// Cancel long-running operations cleanly
const controller = new AbortController();

// Set up automatic cancellation
setTimeout(() => controller.abort(), 10000);

try {
 const data = await fetch('https://slow-api.com/data', {
 signal: controller.signal
 });
 console.log('Data received:', data);
} catch (error) {
 if (error.name === 'AbortError') {
 console.log('Request was cancelled - this is expected behavior');
 } else {
 console.error('Unexpected error:', error);
 }
}

This pattern works across many Node.js APIs, not just fetch. You can use the same AbortController with file operations, database queries, and any async operation that supports cancellation.

3. Built-in Testing: Professional Testing Without External Dependencies

Testing used to require choosing between Jest, Mocha, Ava, or other frameworks. Node.js now includes a full-featured test runner that covers most testing needs without any external dependencies.

Modern Testing with Node.js Built-in Test Runner

The built-in test runner provides a clean, familiar API that feels modern and complete:

// test/math.test.js
import { test, describe } from 'node:test';
import assert from 'node:assert';
import { add, multiply } from '../math.js';

describe('Math functions', () => {
 test('adds numbers correctly', () => {
 assert.strictEqual(add(2, 3), 5);
 });

 test('handles async operations', async () => {
 const result = await multiply(2, 3);
 assert.strictEqual(result, 6);
 });

 test('throws on invalid input', () => {
 assert.throws(() => add('a', 'b'), /Invalid input/);
 });
});

What makes this particularly powerful is how seamlessly it integrates with the Node.js development workflow:

# Run all tests with built-in runner
node --test

# Watch mode for development
node --test --watch

# Coverage reporting (Node.js 20+)
node --test --experimental-test-coverage

The watch mode is especially valuable during development—your tests re-run automatically as you modify code, providing immediate feedback without any additional configuration.

4. Sophisticated Asynchronous Patterns

While async/await isn’t new, the patterns around it have matured significantly. Modern Node.js development leverages these patterns more effectively and combines them with newer APIs.

Async/Await with Enhanced Error Handling

Modern error handling combines async/await with sophisticated error recovery and parallel execution patterns:

import { readFile, writeFile } from 'node:fs/promises';

async function processData() {
 try {
 // Parallel execution of independent operations
 const [config, userData] = await Promise.all([
 readFile('config.json', 'utf8'),
 fetch('/api/user').then(r => r.json())
 ]);

 const processed = processUserData(userData, JSON.parse(config));
 await writeFile('output.json', JSON.stringify(processed, null, 2));

 return processed;
 } catch (error) {
 // Structured error logging with context
 console.error('Processing failed:', {
 error: error.message,
 stack: error.stack,
 timestamp: new Date().toISOString()
 });
 throw error;
 }
}

This pattern combines parallel execution for performance with comprehensive error handling. The Promise.all() ensures that independent operations run concurrently, while the try/catch provides a single point for error handling with rich context.

Modern Event Handling with AsyncIterators

Event-driven programming has evolved beyond simple event listeners. AsyncIterators provide a more powerful way to handle streams of events:

import { EventEmitter, once } from 'node:events';

class DataProcessor extends EventEmitter {
 async *processStream() {
 for (let i = 0; i < 10; i++) {
 this.emit('data', `chunk-${i}`);
 yield `processed-${i}`;
 // Simulate async processing time
 await new Promise(resolve => setTimeout(resolve, 100));
 }
 this.emit('end');
 }
}

// Consume events as an async iterator
const processor = new DataProcessor();
for await (const result of processor.processStream()) {
 console.log('Processed:', result);
}

This approach is particularly powerful because it combines the flexibility of events with the control flow of async iteration. You can process events in sequence, handle backpressure naturally, and break out of processing loops cleanly.

5. Advanced Streams with Web Standards Integration

Streams remain one of Node.js’s most powerful features, but they’ve evolved to embrace web standards and provide better interoperability.

Modern Stream Processing

Stream processing has become more intuitive with better APIs and clearer patterns:

import { Readable, Transform } from 'node:stream';
import { pipeline } from 'node:stream/promises';
import { createReadStream, createWriteStream } from 'node:fs';

// Create transform streams with clean, focused logic
const upperCaseTransform = new Transform({
 objectMode: true,
 transform(chunk, encoding, callback) {
 this.push(chunk.toString().toUpperCase());
 callback();
 }
});

// Process files with robust error handling
async function processFile(inputFile, outputFile) {
 try {
 await pipeline(
 createReadStream(inputFile),
 upperCaseTransform,
 createWriteStream(outputFile)
 );
 console.log('File processed successfully');
 } catch (error) {
 console.error('Pipeline failed:', error);
 throw error;
 }
}

The pipeline function with promises provides automatic cleanup and error handling, eliminating many of the traditional pain points with stream processing.

Web Streams Interoperability

Modern Node.js can seamlessly work with Web Streams, enabling better compatibility with browser code and edge runtime environments:

// Create a Web Stream (compatible with browsers)
const webReadable = new ReadableStream({
 start(controller) {
 controller.enqueue('Hello ');
 controller.enqueue('World!');
 controller.close();
 }
});

// Convert between Web Streams and Node.js streams
const nodeStream = Readable.fromWeb(webReadable);
const backToWeb = Readable.toWeb(nodeStream);

This interoperability is crucial for applications that need to run in multiple environments or share code between server and client.

6. Worker Threads: True Parallelism for CPU-Intensive Tasks

JavaScript’s single-threaded nature isn’t always ideal for CPU-intensive work. Worker threads provide a way to leverage multiple cores effectively while maintaining the simplicity of JavaScript.

Background Processing Without Blocking

Worker threads are perfect for computationally expensive tasks that would otherwise block the main event loop:

// worker.js - Isolated computation environment
import { parentPort, workerData } from 'node:worker_threads';

function fibonacci(n) {
 if (n < 2) return n;
 return fibonacci(n - 1) + fibonacci(n - 2);
}

const result = fibonacci(workerData.number);
parentPort.postMessage(result);

The main application can delegate heavy computations without blocking other operations:

// main.js - Non-blocking delegation
import { Worker } from 'node:worker_threads';
import { fileURLToPath } from 'node:url';

async function calculateFibonacci(number) {
 return new Promise((resolve, reject) => {
 const worker = new Worker(
 fileURLToPath(new URL('./worker.js', import.meta.url)),
 { workerData: { number } }
 );

 worker.on('message', resolve);
 worker.on('error', reject);
 worker.on('exit', (code) => {
 if (code !== 0) {
 reject(new Error(`Worker stopped with exit code ${code}`));
 }
 });
 });
}

// Your main application remains responsive
console.log('Starting calculation...');
const result = await calculateFibonacci(40);
console.log('Fibonacci result:', result);
console.log('Application remained responsive throughout!');

This pattern allows your application to utilize multiple CPU cores while keeping the familiar async/await programming model.

7. Enhanced Development Experience

Modern Node.js prioritizes developer experience with built-in tools that previously required external packages or complex configurations.

Watch Mode and Environment Management

Development workflow has been significantly streamlined with built-in watch mode and environment file support:

{
 "name": "modern-node-app",
 "type": "module",
 "engines": {
 "node": ">=20.0.0"
 },
 "scripts": {
 "dev": "node --watch --env-file=.env app.js",
 "test": "node --test --watch",
 "start": "node app.js"
 }
}

The --watch flag eliminates the need for nodemon, while --env-file removes the dependency on dotenv. Your development environment becomes simpler and faster:

// .env file automatically loaded with --env-file
// DATABASE_URL=postgres://localhost:5432/mydb
// API_KEY=secret123

// app.js - Environment variables available immediately
console.log('Connecting to:', process.env.DATABASE_URL);
console.log('API Key loaded:', process.env.API_KEY ? 'Yes' : 'No');

These features make development more pleasant by reducing configuration overhead and eliminating restart cycles.

8. Modern Security and Performance Monitoring

Security and performance have become first-class concerns with built-in tools for monitoring and controlling application behavior.

Permission Model for Enhanced Security

The experimental permission model allows you to restrict what your application can access, following the principle of least privilege:

# Run with restricted file system access
node --experimental-permission --allow-fs-read=./data --allow-fs-write=./logs app.js

# Network restrictions 
node --experimental-permission --allow-net=api.example.com app.js
# Above allow-net feature not avaiable yet, PR merged in node.js repo, will be available in future release

This is particularly valuable for applications that process untrusted code or need to demonstrate security compliance.

Built-in Performance Monitoring

Performance monitoring is now built into the platform, eliminating the need for external APM tools for basic monitoring:

import { PerformanceObserver, performance } from 'node:perf_hooks';

// Set up automatic performance monitoring
const obs = new PerformanceObserver((list) => {
 for (const entry of list.getEntries()) {
 if (entry.duration > 100) { // Log slow operations
 console.log(`Slow operation detected: ${entry.name} took ${entry.duration}ms`);
 }
 }
});
obs.observe({ entryTypes: ['function', 'http', 'dns'] });

// Instrument your own operations
async function processLargeDataset(data) {
 performance.mark('processing-start');

 const result = await heavyProcessing(data);

 performance.mark('processing-end');
 performance.measure('data-processing', 'processing-start', 'processing-end');

 return result;
}

This provides visibility into application performance without external dependencies, helping you identify bottlenecks early in development.

9. Application Distribution and Deployment

Modern Node.js makes application distribution simpler with features like single executable applications and improved packaging.

Single Executable Applications

You can now bundle your Node.js application into a single executable file, simplifying deployment and distribution:

# Create a self-contained executable
node --experimental-sea-config sea-config.json

The configuration file defines how your application gets bundled:

{
 "main": "app.js",
 "output": "my-app-bundle.blob",
 "disableExperimentalSEAWarning": true
}

This is particularly valuable for CLI tools, desktop applications, or any scenario where you want to distribute your application without requiring users to install Node.js separately.

10. Modern Error Handling and Diagnostics

Error handling has evolved beyond simple try/catch blocks to include structured error handling and comprehensive diagnostics.

Structured Error Handling

Modern applications benefit from structured, contextual error handling that provides better debugging information:

class AppError extends Error {
 constructor(message, code, statusCode = 500, context = {}) {
 super(message);
 this.name = 'AppError';
 this.code = code;
 this.statusCode = statusCode;
 this.context = context;
 this.timestamp = new Date().toISOString();
 }

 toJSON() {
 return {
 name: this.name,
 message: this.message,
 code: this.code,
 statusCode: this.statusCode,
 context: this.context,
 timestamp: this.timestamp,
 stack: this.stack
 };
 }
}

// Usage with rich context
throw new AppError(
 'Database connection failed',
 'DB_CONNECTION_ERROR',
 503,
 { host: 'localhost', port: 5432, retryAttempt: 3 }
);

This approach provides much richer error information for debugging and monitoring, while maintaining a consistent error interface across your application.

Advanced Diagnostics

Node.js includes sophisticated diagnostic capabilities that help you understand what’s happening inside your application:

import diagnostics_channel from 'node:diagnostics_channel';

// Create custom diagnostic channels
const dbChannel = diagnostics_channel.channel('app:database');
const httpChannel = diagnostics_channel.channel('app:http');

// Subscribe to diagnostic events
dbChannel.subscribe((message) => {
 console.log('Database operation:', {
 operation: message.operation,
 duration: message.duration,
 query: message.query
 });
});

// Publish diagnostic information
async function queryDatabase(sql, params) {
 const start = performance.now();

 try {
 const result = await db.query(sql, params);

 dbChannel.publish({
 operation: 'query',
 sql,
 params,
 duration: performance.now() - start,
 success: true
 });

 return result;
 } catch (error) {
 dbChannel.publish({
 operation: 'query',
 sql,
 params,
 duration: performance.now() - start,
 success: false,
 error: error.message
 });
 throw error;
 }
}

This diagnostic information can be consumed by monitoring tools, logged for analysis, or used to trigger automatic remediation actions.

11. Modern Package Management and Module Resolution

Package management and module resolution have become more sophisticated, with better support for monorepos, internal packages, and flexible module resolution.

Import Maps and Internal Package Resolution

Modern Node.js supports import maps, allowing you to create clean internal module references:

{
 "imports": {
 "#config": "./src/config/index.js",
 "#utils/*": "./src/utils/*.js",
 "#db": "./src/database/connection.js"
 }
}

This creates a clean, stable interface for internal modules:

// Clean internal imports that don't break when you reorganize
import config from '#config';
import { logger, validator } from '#utils/common';
import db from '#db';

These internal imports make refactoring easier and provide a clear distinction between internal and external dependencies.

Dynamic Imports for Flexible Loading

Dynamic imports enable sophisticated loading patterns, including conditional loading and code splitting:

// Load features based on configuration or environment
async function loadDatabaseAdapter() {
 const dbType = process.env.DATABASE_TYPE || 'sqlite';

 try {
 const adapter = await import(`#db/adapters/${dbType}`);
 return adapter.default;
 } catch (error) {
 console.warn(`Database adapter ${dbType} not available, falling back to sqlite`);
 const fallback = await import('#db/adapters/sqlite');
 return fallback.default;
 }
}

// Conditional feature loading
async function loadOptionalFeatures() {
 const features = [];

 if (process.env.ENABLE_ANALYTICS === 'true') {
 const analytics = await import('#features/analytics');
 features.push(analytics.default);
 }

 if (process.env.ENABLE_MONITORING === 'true') {
 const monitoring = await import('#features/monitoring');
 features.push(monitoring.default);
 }

 return features;
}

This pattern allows you to build applications that adapt to their environment and only load the code they actually need.

The Path Forward: Key Takeaways for Modern Node.js (2025)

As we look at the current state of Node.js development, several key principles emerge:

Embrace Web Standards: Use node: prefixes, fetch API, AbortController, and Web Streams for better compatibility and reduced dependencies
Leverage Built-in Tools: The test runner, watch mode, and environment file support reduce external dependencies and configuration complexity
Think in Modern Async Patterns: Top-level await, structured error handling, and async iterators make code more readable and maintainable
Use Worker Threads Strategically: For CPU-intensive tasks, worker threads provide true parallelism without blocking the main thread
Adopt Progressive Enhancement: Use permission models, diagnostics channels, and performance monitoring to build robust, observable applications
Optimize for Developer Experience: Watch mode, built-in testing, and import maps create a more pleasant development workflow
Plan for Distribution: Single executable applications and modern packaging make deployment simpler

The transformation of Node.js from a simple JavaScript runtime to a comprehensive development platform is remarkable. By adopting these modern patterns, you’re not just writing contemporary code—you’re building applications that are more maintainable, performant, and aligned with the broader JavaScript ecosystem.

The beauty of modern Node.js lies in its evolution while maintaining backward compatibility. You can adopt these patterns incrementally, and they work alongside existing code. Whether you’re starting a new project or modernizing an existing one, these patterns provide a clear path toward more robust, enjoyable Node.js development.

As we move through 2025, Node.js continues to evolve, but the foundational patterns we’ve explored here provide a solid base for building applications that will remain modern and maintainable for years to come.

Read the whole story

· · · · · · · · · · · · · · · · · · · · · · · · · · · · · ·

bernhardbock

3 days ago

reply

L4S and the Future of Real-Time Performance in 5G and Beyond
Thursday July 24^th, 2025 at 3:35 PM

The 3G4G Blog

As mobile networks continue to evolve to support increasingly immersive and responsive services, the importance of consistent low latency has never been greater. Whether it is cloud gaming, extended reality, remote machine operation or real-time collaboration, all these applications rely on the ability to react instantly to user input. The slightest delay can affect the user experience, making the role of the network even more critical.

While 5G has introduced major improvements in radio latency and overall throughput, many time-critical applications are still affected by a factor that is often overlooked - queuing delay. This occurs when packets build up in buffers before they are forwarded, creating spikes in delay and jitter. Traditional methods for congestion control, such as those based on packet loss, are too slow to react, especially in mobile environments where network conditions can change rapidly.

Low Latency, Low Loss and Scalable Throughput (L4S), is a new network innovation designed to tackle this challenge. It is an Internet protocol mechanism developed through the Internet Engineering Task Force, and has recently reached standardisation. L4S focuses on preventing queuing delays by marking packets early when congestion is building, instead of waiting until buffers overflow and packets are dropped. The key idea is to use explicit signals within the network to guide congestion control at the sender side.

Applications that support L4S are able to reduce their sending rate quickly when congestion starts to appear. This is done by using ECN, or Explicit Congestion Notification, which involves marking rather than dropping packets. The result is a smooth and continuous flow of data, where latency remains low and throughput remains high, even in changing network conditions.

One of the significant benefits of L4S is its ability to support a wide range of real-time services at scale. Ericsson highlights how edge-based applications such as cloud gaming, virtual reality and drone control need stable low-latency connections alongside high bitrates. While over-the-top approaches to congestion control may work for general streaming, they struggle in mobile environments. This is due to variability in channel quality and radio access delays, which can cause sudden spikes in latency. L4S provides a faster and more direct way to detect congestion within the radio network, enabling better performance for these time-sensitive applications.

To make this possible, mobile networks need to support L4S in a way that keeps its traffic separate from traditional data flows. This involves using dedicated queues for L4S traffic to ensure it is not delayed behind bulk data transfers. In 5G, this is implemented through dedicated quality-of-service flows, allowing network elements to detect and handle L4S traffic differently. For example, if a mobile user is playing a cloud-based game, the network can identify this traffic and place it on an L4S-optimised flow. This avoids interference from other applications, such as file downloads or video streaming.

Nokia's approach further explains how L4S enables fair sharing of bandwidth between classic and L4S traffic without compromising performance. A dual-queue system allows both types of traffic to coexist while preserving the low-latency characteristics of L4S. This is especially important in scenarios where both legacy and L4S-capable applications are in use. In simulations and trials, the L4S mechanism has shown the ability to maintain very low delay even when the link experiences sudden reductions in capacity, which is common in mobile and Wi-Fi networks.

One of the important aspects of L4S is that it requires support both from the application side and within the network. On the application side, rate adaptation based on L4S can be implemented within the app itself, often using modern transport protocols such as QUIC or TCP extensions. Many companies, including device makers and platform providers, are already trialling support for this approach.

Within the network, L4S depends on the ability of routers and radio access equipment to read and mark ECN bits correctly. In mobile networks, the radio access network is typically the key bottleneck where marking should take place. This ensures that congestion is detected at the right point in the path, allowing for quicker response and improved performance.

Although L4S is distinct from ultra-reliable low-latency communication, it can complement those use cases where guaranteed service is needed in controlled environments. What makes L4S more versatile is its scalability and suitability for open internet and large-scale public network use. It can work across both fixed and mobile access networks, providing a common framework for interactive services regardless of access technology.

With L4S in place, it becomes possible to offer new kinds of applications that were previously limited by latency constraints. This includes lighter and more wearable XR headsets that can offload processing to the cloud, or port automation systems that rely on remote control of heavy equipment. Even everyday experiences, such as video calls or online gaming, stand to benefit from a more responsive and stable network connection.

Ultimately, L4S offers a practical and forward-looking approach to delivering the consistent low latency needed for the next generation of digital experiences. By creating a tighter feedback loop between the network and the application, and by applying congestion signals in a more intelligent way, L4S helps unlock the full potential of 5G and future networks.

This introductory video by CableLabs is a good starting point for anyone willing to dig deeper in the topic. This LinkedIn post by Dean Bubley and the comments are also worth a read.

PS: Just noticed that T-Mobile USA have announced earlier this week that they are the first to unlock L4S in wireless . You can read their blog post here and a promotional video is available in the Tweet below 👇

Your apps need more than speed—they need responsiveness.L4S is now live on our 5G Advanced network, delivering lower latency, less lag, and smarter performance for things like XR, video calls, and remote driving.
CTO @JohnSaw explains: https://t.co/4VMI3WbZE2 pic.twitter.com/mZ60nvDoM4
— T-Mobile Business (@TMobileBusiness) July 21, 2025

Read the whole story

· · · ·

bernhardbock

16 days ago

reply

NVIDIA Tensor Core Evolution: From Volta To Blackwell
Tuesday June 24^th, 2025 at 10:45 AM

SemiAnalysis

In our AI Scaling Laws article from late last year, we discussed how multiple stacks of AI scaling laws have continued to drive the AI industry forward, enabling greater than Moore’s Law growth in model capabilities as well as a commensurately rapid reduction in unit token costs. These scaling laws are driven by training and inference optimizations and innovations, but advancements in compute capabilities transcending Moore’s Law have also played a critical role.

One this front, in the AI Scaling Laws article, we revisited the decades-long debate around compute scaling, recounting the end of Dennard Scaling in the late 2000s as well as the end of classic Moore’s Law pace cost per transistor declines by the late 2010s. Despite this, compute capabilities have continued to improve at a rapid pace, with the baton being passed to other technologies such as advanced packaging, 3D stacking, new transistor types and specialized architectures such as the GPU.

When it comes to AI and deep learning, GPU compute capabilities have improved at a faster than Moore’s law pace, consistently delivering remarkable “Huang’s Law” performance improvements year after year. The technology that is at the heart of driving this improvement is the Tensor Core.

Though the Tensor Core is unquestionably the bedrock upon which the foundations of modern AI and machine learning are built, it is not well understood, even by many experienced practitioners in the field. The rapid evolution of GPU architecture and programming models that run on this architecture means that it is increasingly challenging for Machine Learning researchers and scientists to keep up with the latest changes to Tensor Cores and grasp the implications of these changes.

In this report, we will introduce the core features of the major datacenter GPUs, first explaining important first principles of performance engineering. We will then trace the evolution of Nvidia’s Tensor Core architectures and programming models, highlighting the motivations behind this evolution. Our end goal is to provide a resource for understanding Nvidia’s GPU architecture and offer intuitive insights into their architectural evolution. Only after explaining each architecture can we explain the beauty of the Blackwell tensor core and the new memory hierarchy of it.

It is important that we explain that a solid grasp of computer architecture is a prerequisite for being able to follow many of the explanations and discussions in this article, and this article will provide a brief section about CUDA programming as a refresher rather than explaining foundational concepts of GPU architecture. Instead, we build on the forefront of Tensor Core knowledge, extending understanding of this cutting-edge technology by documenting what is currently tribal knowledge into accessible, structured insight through detailed explanation.

Just as a university will teach 101 courses as well as 4000 level courses, different articles at SemiAnalysis will cater to varying levels of understanding of the subject matter as well as to readers in different vocations and specializations.

We would like to thank our collaborators:

Jay Shah, Colfax Research: Terrific CUTLASS tutorials and numerous meetings meticulously checking the technical details
Ben Spector, Stanford Hazy Research: Offered great insights into programming model change and writing advice
Tri Dao, Princeton and Together AI: Reviewed drafts and gave detailed feedback
Neil Movva, Together AI: Reviewed drafts and offered insights into GPU kernel writing
Charles Frye, Modal: Pedagogical GPU Glossary and general review of the draft
Simon Guo, Stanford PhD student: Illustrated the cover picture and reviewed the draft
NVIDIA: Shared context around the progression of Tensor Core designs. Teams include:
Many other GPU wizards

SemiAnalysis will be posting exclusive content on Instagram Reels and TikTok starting next week. Follow our socials to get the latest insights on the AI and GPU industry.

For a fixed problem size, Amdahl’s Law specifies the maximum speedup you can obtain by parallelizing with more compute resources. Concretely, scaling compute resources only drives down the execution time of the parallel portion, so the performance improvement is bounded by the serial portion. To quantify it, the maximum performance improvement is:

where S is the parallel work execution time and p is the speedup of the parallelizable work. In an ideal world where the parallel portion is perfectly parallelized, the speedup p can be the number of processing units.

Strong and weak scaling describe the performance improvement of scaling compute resources for different problem setups. Strong scaling refers to scaling compute resources to solve a fixed-size problem, and Amdahl’s Law quantifies the speedup of strong scaling. On the other hand, weak scaling refers to scaling compute resources to solve larger problems at a constant time. For example, processing a 4x larger image in the same time using 4x more compute resources. We recommend this blog post for more detailed explanations.

Strong and weak scaling imply different performance improvements across problem sizes. Strong scaling offers speedup for all problem sizes, while weak scaling only guarantees performance improvement when we use more compute to solve a larger problem.

Data movement is a sin because in terms of runtime and scaling, computation is cheap and data movement is expensive. Data movement is fundamentally slower because modern DRAM cells operate at tens of nanoseconds, while transistors switch at sub-nanosecond speed. Regarding scaling, while computation speed gains have slowed since the 2000s, memory speed has improved slower, creating the memory wall.

In this section, we introduce the main Nvidia GPU architectures that use Tensor Cores, namely the Tesla V100 GPU, A100 Tensor Core GPU, H100 Tensor Core GPU, as well as the Blackwell GPU. We have also included a pre-Tensor Core section as a refresher for the CUDA programming model. We will briefly go over the major features and changes that are relevant to understanding the Tensor Core, and we defer the details to other sources, which we link in each subsection.

Parallel Thread Execution (PTX) is a virtual instruction set that abstracts over GPU generations. A PTX program describes a kernel function that is executed with a large number of GPU threads, which are executed on the GPU’s hardware execution units, i.e. CUDA cores. Threads are organized as a grid, and a grid consists of cooperative thread arrays (CTAs). PTX threads can access data from multiple state spaces, which are memory storage areas with different characteristics. Specifically, threads have per-thread registers, threads within a CTA have shared memory, and all threads can access global memory. For more information, please read this section of the CUDA documentation.

The GPU architecture is built around an array of streaming multiprocessors (SMs). An SM consists of scalar processing cores, a multithreaded instruction unit, and an on-chip shared memory. An SM maps each thread to a scalar processing core (also known as a CUDA core), and the multithreaded instruction unit manages threads in groups of 32 parallel threads called warps.

At instruction issue time, the instruction unit selects a warp and issues an instruction to the threads of the warp. This execution method is called single-instruction, multiple threads (SIMT). Similar to single-instruction, multiple data (SIMD), SIMT controls multiple processing elements with a single instruction, but unlike SIMD, SIMT specifies a single thread behavior instead of vector width. For more information, please read this section of the CUDA documentation.

Streaming Assembler (SASS) is the architecture-specific instruction set that PTX virtualizes over. See the CUDA binary utilities documentation for more information. Unfortunately, SASS is not well documented due to NVIDIA hiding their architecture ISA details from their competitors.

As deep learning became more prominent, the industry noticed that ML workloads were in need of hardware acceleration. Early in 2015, Google deployed TPUv1 for accelerating their internal ML workloads, and in 2017, Nvidia introduced dedicated hardware for matrix math. Although GPUs consume a small amount of energy when issuing instructions (~30pJ) because of their simple hardware pipeline, simple floating point operations like consume even less energy at only 1.5pJ. This creates a 20x overhead of power needed for instructions vs for the floating point operation itself. As a result, performing a lot of floating point operations for matrix multiplication is power inefficient. To amortize the instruction overhead, we need to use complex instructions that can perform more computation per instruction. To this end, Nvidia designed the half-precision matrix multiply and accumulate () instruction, a specialized instruction that performs half-precision matrix multiplication. The corresponding dedicated hardware to execute this instruction is the Tensor Core, introduced in the Tesla V100 GPU of Volta architecture in 2017. The Volta tensor core was added very late into development of the Volta architecture, only a handful of months before tape out, a testament to how fast Nvidia can pivot their architecture.

Given a matrix, the multiply and accumulate (MMA) instruction computes D = A * B + C:

A is an M by K matrix
B is a K by N matrix
C and D are M by N matrices

We denote the matrix shapes as or MxNxK.

To perform the full computation, we first load matrices A, B, and C from shared memory to thread registers, so that each thread holds fragments of the matrices. Second, we execute the MMA instruction, which reads the matrices from thread registers, performs computation on Tensor Cores, and stores the result to thread registers. Finally, we store the results from thread registers back to shared memory. The full computation is collectively performed by multiple threads, meaning that every step requires a synchronization between the collaborating threads.

An SM of a Tesla V100 GPU contains 8 Tensor Cores, grouped in partitions of two. Each Tensor Core is capable of computing an equivalent of 4x4x4 matrix multiplication per cycle, which amounts to 1024 FLOPs per cycle per SM.

NVIDIA designed PTX instruction mma to target the lower level instructions. On Volta architecture, an MMA instruction performs an 8x8x4 matrix multiplication, and a quadpair of 8 threads participate in the operation by collectively holding the input and output matrices. Here T0 refers to thread 0, [T0, T1, T2, T3] and [T16, T17, T18, T19] are threadgroups, and the 2 threadgroups form a quadpair.

In terms of data types, Volta Tensor Cores support FP16 inputs with FP32 accumulation in correspondence with NVIDIA’s mixed-precision training technique. This technique showed it is possible to train models at lower precision without losing model accuracy.

To fully understand the MMA layout, please refer to Citadel’s microbenchmarking paper, Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking. To see the interleaved layout pattern for Volta Tensor Core MMAs, please read the slides Programming Tensor Cores: Native Tensor Cores with CUTLASS. Finally, for other information of the Volta architecture, please refer to the whitepaper NVIDIA Tesla V100 GPU Architecture.

Turing architecture includes the 2nd generation Tensor Cores, an enhanced version of Volta Tensor Cores, adding INT8 and INT4 precision support. Turing Tensor Cores support a new warp-level synchronous MMA, which we will discuss in the next section. Turing Tensor Cores also enabled Deep Learning Super Sampling (DLSS), marking the start of NVIDIA applying deep learning to gaming graphics. Interested readers can refer to NVIDIA’s blog post NVIDIA Turing Architecture In-Depth and the Turing architecture whitepaper.

With Ampere, NVIDIA introduced asynchronous data copy, a way of copying data directly from global memory to shared memory in an asynchronous fashion. To load data from global memory to shared memory on Volta, threads must first load data from global memory to registers, and then store it to shared memory. However, MMA instructions have high register usage and must share the register file with data-loading operations, causing high register pressure and wasting memory bandwidth for copying data in and out of RF.

Async data copy mitigates this issue by fetching data from global memory (DRAM) and directly storing it into shared memory (with optional L1 access), freeing up more registers for MMA instructions. Data loading and compute can happen asynchronously which is more difficult from a programming model perspective but unlocks higher performance.

This feature is implemented as PTX instruction thread-level async copy cp.async (documentation). The corresponding SASS is LDGSTS, asynchronous global to shared memory copy. The exact synchronization methods are async-group and mbarrier-based completion mechanisms, detailed here.

Ampere has 4 Tensor Cores per SM, and each Tensor Core is capable of performing 512 FLOPs per cycle, amounting to 2048 Dense FLOPs per cycle per SM, doubling the performance of Volta.

While Volta requires a quadpair of 8 threads to participate in an MMA operation, Ampere requires a full warp of 32 threads. Having MMA instructions warp-wide simplifies the thread layout & reducing RF pressure for Ampere. For instance, here is the thread and data layout for mixed-precision floating point of shape 16x8x16:

NVIDIA introduced in Ampere, an enhanced vectorized load operation. Like , is warp-wide, meaning that a warp of threads collectively loads a matrix. Compared to issuing multiple load instructions, this reduces address generation register use, lowering register pressure. See the CUDA documentation for more information.

loads data to registers in a layout that matches Tensor Core’s data layout. Compared to Volta’s interleaved pattern (See Programming Tensor Cores: Native Tensor Cores with CUTLASS), a simpler thread and data layout greatly improves the programming ergonomics. Watch the GTC talk Developing CUDA Kernels to Push Tensor Cores to the Absolute Limit on NVIDIA A100 to learn more about how exactly Ampere’s memory loading is coherent with Tensor Core.

Ampere MMA features Brain Floating Point Format (BF16), which has become the de facto standard for half-precision data types. BF16 provides the same 8-bit exponent range as FP32 but with a 7-bit mantissa, allowing FP32-level dynamic range at half the storage cost. BF16 also removes the need for loss scaling in mixed-precision training.

As the number of SMs grew, the size disparity between an SM and the whole GPU increased. To offer a finer granularity of control between CTAs (map to SMs) and the grid (maps to the whole GPU), on Hopper, NVIDIA added a new thread hierarchy level, thread block cluster, which maps to a group of SMs physically located in the same graphics processing cluster (GPC). Thread block cluster is also called cooperative grid array (CGA) and referred to as cluster in the CUDA documentation (See here for more information).

CTAs in a thread block cluster are guaranteed to be co-scheduled on SMs in the same GPC and distributed one CTA per SM by default. The shared memory partitions of those SMs form a distributed shared memory (DSMEM). A thread can access the shared memory from another SM with low latency through the dedicated SM-to-SM network (without going through L2 cache). By exposing the GPC hardware execution unit to the programming model, programmers can reduce data movement and improve the data locality.

To improve data fetch efficiency, NVIDIA added the Tensor Memory Accelerator (TMA) to each Hopper SM. TMA is a dedicated hardware unit that accelerates asynchronous data transfers of large quantities between global and shared memory (bulk asynchronous copies).

A single thread in a CTA can initiate a TMA copy operation. TMA frees up threads to execute other independent work, handling address generation and offering additional benefits such as out-of-bounds handling. In PTX, the corresponding instruction is , detailed in this CUDA documentation section.

However, for small requests, TMA loads have higher latency than regular async data copies because of the address generation overhead. Thus, NVIDIA recommends programmers to use TMAs for large data copies to amortize the overhead. For example, in LLM inference, TMA is not suitable for workloads that load KV cache in small chunks, but works well when each chunk is a multiple of 16 bytes. For more concrete examples of this, see SGLang prefix caching, paper FlashInfer section 3.2.1, paper Hardware-Efficient Attention for Fast Decoding section 4.2, and ThunderKittens MLA decode.

TMA also supports a mode of loading data called multicast, where TMA loads data from global memory to shared memory of multiple SMs in a thread block cluster, specified by a multicast mask. Instead of issuing multiple global memory loads loading the same piece of data into multiple SMs, multicast completes it in one load. Specifically, multiple CTAs in a thread block cluster load a portion of the data into their corresponding SMEMs and share the data through DSMEM. This reduces L2 cache traffic and subsequently reduces HBM traffic. We recommend reading Jay Shah’s TMA tutorial for more details.

NVIDIA introduced a new type of MMA with Hopper, warpgroup-level MMA (). is warpgroup-wide, meaning that a warpgroup of 4 warps collectively performs an MMA operation. supports a wider range of shapes. For example, mixed-precision MMA supports , where N can be multiples of 8 from 8 to 256. lowers to a new set of SASS: . In another example, half-precision instructions lowers to . See this CUDA documentation section for the details of MMA shapes and data types.

While all threads in a warpgroup collectively hold the output matrix in their registers, Hopper Tensor Cores can directly load operands from shared memory instead of registers, saving register space and bandwidth. Specifically, operand matrix A can reside in either registers or shared memory, while operand matrix B can only be accessed through shared memory. See the CUDA documentation wgmma section for the details of ’s completion mechanism, SMEM layout, and more.

For data types, Hopper introduced 8-bit floating-point data types (E4M3 and E5M2) with FP32 accumulation. In practice, the accumulation path was implemented as a 22-bit fixed-point format (13-bit mantissa plus sign and exponent bits), limiting the dynamic range compared to true 32-bit accumulation. Due to the reduced tensor core precision, every N_c accumulations has to happen in the CUDA core to prevent constraining training accuracy. (See this paper section 3.3.2). This reduced precision accumulation improves efficiency, but comes at the cost of accuracy.

For more information on the Hopper Architecture, see the following:

For examples of how to program Hopper GPUs, see:

The extreme register pressure did not let up on Hopper, which motivated Tensor Memory (TMEM), a new piece of memory specialized for Tensor Core operations. On every SM, TMEM has 128 rows (lanes) and 512 columns of 4-byte cells, totaling to 256 KB, which is also the size of the register file on an SM.

TMEM has a restricted memory access pattern. Specifically, it takes a warpgroup to access the whole TMEM, and each warp in a warpgroup can only access a specific set of lanes. By limiting the memory access pattern, hardware designers can reduce the number of access ports, saving chip space. On the other hand, this design also means that epilogue operations need a warpgroup to operate. Unlike shared memory, programmers have to explicitly manage TMEM, including allocation, deallocation, and copying data in and out of TMEM.

Two CTAs in a thread block cluster form a CTA pair if their CTA ranks in their thread block cluster differ by the last bit, e.g. 0 and 1, 4 and 5. A CTA pair maps to a Texture Processing Cluster (TPC), which consists of two SMs and combines with other TPCs to form a GPC. When Blackwell Tensor Core operations perform at a CTA pair granularity, the two CTAs are able to share input operands. This sharing reduces both SMEM capacity and bandwidth requirements.

Tensor Core 5th Generation MMA instruction ( in PTX) fully moved away from using registers for holding matrices. Operands now reside in shared memory and Tensor Memory.

Specifically, suppose the MMA computes D = A * B + D: Not using thread registers removes the complex data layouts and frees up thread register space for other work such as epilogue operations. Unlike using a warpgroup to initiate an MMA operation, has single thread semantics, meaning that a single thread initiates an MMA operation. This removes the role of warps from issuing MMA.

One notable MMA variant is MMA.2SM, which uses 2 SMs to collectively perform an MMA operation. MMA.2SM executes at the CTA-pair level granularity, and since has single thread semantics, a single thread in the leader CTA of the CTA pair launches MMA.2SM. Here we illustrate data path organization layout A. Layout A shows MMA.2SM doubles the M dimension compared to the 1SM version (layout D), so the two SMs load different matrix A and D tiles. In addition, MMA.2SM splits matrix B, halving the amount of data loaded.

Matrix B is shared across the two SMs, meaning tiles B0 and B1 need to be communicated across the DSMEM. Although there is a bandwidth difference between DSMEM and SMEM, the effects on the coordination are minimal because we are loading smaller tiles. That said, we suspect that on Blackwell the communication bandwidth between SMs in a TPC is higher than DSMEM’s, so MMA.2SM leverages this to achieve better performance.

5th-gen Tensor Cores can also perform convolutions in addition to general matrix multiplication. supports weight stationary patterns with a collector buffer, which caches matrix B for reuse. For more information, please refer to the CUDA documentation and the corresponding weight stationary MMA instruction.

In terms of supported data types, Blackwell supports microscaling floating-point format (MXFP), including MXFP8, MXFP6, and MXFP4. See this paper for details. Blackwell also supports NVIDIA’s own NVFP4 format, which is known for being more accurate than MXFP4. This is likely because of its smaller block size, different scaling factor data format, and the two-level quantization method (See this GitHub issue). See this paper for data format comparisons.

With Blackwell, since FP8 and FP6 have the same theoretical throughput, we believe that they share physical circuits in Tensor Cores. In contrast, CDNA4 has 2x the FP6 throughput compared to FP8 because their FP6 units share data paths with FP4 instead. We believe that UDNA will switch to having FP6 units share with FP8 instead.

Ampere featured 2:4 structured sparsity, which in theory doubled the Tensor Core throughput. It achieves this by pruning the weight matrix such that for every 4 elements, 2 of them are zero. In this format, the matrix is compressed by removing zero elements, and an additional metadata index matrix records their positions, roughly halving the memory usage and bandwidth.

According to this microbenchmarking paper from cracked chinese engineers, Ampere’s structured sparsity can realize 2x speedup for large shape MMA operations at the instruction level. It also shows that in Hopper, structured sparsity instructions can reach 2x speedup and save up to 2x on memory bandwidth used to load weights.

Unfortunately, 2:4 structured sparsity GEMMs kernels are unable to reach anywhere close to 2x speedup compared to their dense counterparts on hopper. This is due to difficulties in doing structured pruning while maintaining model accuracy, cuSPARSELt kernels being unoptimized, and TDP limitations. Except for Chinese AI labs and a limited number of experimental western research papers, most AI labs ignore 2:4 structured sparsity for production inferencing and focus on quantization & distillation. Meta is experimenting with it in Llama, but that is a dead end path in many cases as well.

Furthermore, there is a lack of closed or open models that have shown performance improvements with 2:4 FP8 structured sparsity or 4:8 FP4 structured sparsity while maintaining zero accuracy loss & a general lack of resources dedicated to structured pruning. We recommend that NVIDIA should stop with Jensen math structured sparsity flops in keynotes & marketing material unless they start consistently showing SOTA open models being able to take advantage of structured pruning for inferencing. A good first step would be to do structured sparsity on DeepSeek and also show that performance can stack on top of other techniques like distillation & quantization like NVFP4.

In its fifth‑generation Tensor Cores, NVIDIA introduced pair‑wise 4 : 8 structured sparsity for the NVFP4 data type. In this scheme, every eight elements are grouped into four consecutive pairs, and exactly two of those pairs must contain non‑zero values while the remaining two are pruned to zero. Because NVFP4 is a sub‑byte data type, we believe this constraint motivated NVIDIA to adopt the pair‑wise 4 : 8 pattern. Although 4 : 8 sparsity may appear more permissive than the earlier 2 : 4 pattern, the added pair‑wise requirement means it is not, in practice, a more relaxed constraint for ML engineers seeking to preserve model accuracy while pruning.

Over generations, NVIDIA scaled the Tensor Core size more aggressively than the number of Tensor Cores. NVIDIA chose scaling the tensor core size rather than number of cores because it suits the performance characteristics of matrix multiplication better. Specifically, when scaling the problem size, matrix multiplication computation grows cubically, but data movement grows quadratically, meaning the arithmetic intensity grows linearly. O(n) arithmetic intensity, combined with the fact that data movement is more expensive than computation, incentivized the tensor core size increase.

However, both scaling core size and number of cores come at the cost of the quantization effects. Specifically, having a large number of cores suffer from the tile quantization effect, and having a large core size leads to wave quantization effect. The wave quantization effect occurs when the number of work units isn’t fully divisible by the number of workers, causing utilization to drop when processing the final, smaller batch of work. Increasing tensor core size is essentially increasing the work unit size, resulting in low utilization for small matrices (See this ThunderKittens blog post).

The linear growth in arithmetic intensity also motivates the increase in MMA shape. Having larger MMA shapes enhances the operand sharing granularity. Specifically, launching fewer larger tiles would increase the data reuse, saving memory footprint and bandwidth of RF and SMEM. For architectures before Blackwell, this led to increasing the number of threads to collectively perform an MMA operation, from a quadpair of 8 threads (Volta), to a warp of 32 threads (Ampere), and then a warpgroup of 128 threads (Hopper).

Shared memory increased almost every generation, while register file size stayed constant. The reason for this is that Tensor Core throughput increase requires a deeper staging buffer.

Because Tensor Cores consume data much faster than global memory can load, we use a staging memory to buffer data, so memory loading can run ahead of MMA operations. Tensor Core throughput doubled every generation, but global memory load latency didn’t decrease and in fact increased. As a result, we need to increase the staging memory size for buffering more data. To implement this, NVIDIA chose shared memory as the staging memory for Tensor Cores, which explains why shared memory increased but register file size remained constant.

However, Blackwell’s shared memory size didn’t increase from Hopper. This is because tcgen05 MMA can leverage 2 SMs, so each SM’s shared memory only needs to load half of the operands. Thus, Blackwell’s shared memory size effectively doubled.

NVIDIA’s staging memory choice also explains why operand locations gradually moved away from registers to shared memory. That said, NVIDIA added TMEM on Blackwell to support the increased Tensor Core throughput. Since TMEM is placed closer to Tensor Cores, it can be more power efficient. In addition, having a separate memory increases the aggregate memory bandwidth for saturating the Tensor Cores.

Among all operands, matrix D always stays in TMEM. We can take advantage of TMEM’s power efficiency with this design because matrix D is more frequently accessed than matrix A and B. For example, to compute a tile in a naive tiled matrix multiplication, matrix D tile is accessed 2Kt times (Kt reads and Kt writes. Kt: The number of tiles along the K dimension), whereas matrix A tiles and matrix B tiles are accessed only once.

The “H” in stands for half precision since it is a 16 bit format while “Q” in stands for quarter precision (8 bit) since 8 bits is a quarter of a full precision (32 bits). “O” stands for “Octal” which means one eighth of 32 bits as is FP4.

MMA instructions seemingly jumped from synchronous to asynchronous. In reality, MMA instructions gradually became asynchronous at the SASS level because of the need to overlap instructions.

At SASS level, an MMA operation involves executing one instruction to load matrix tiles from shared memory to the register file, and then two instructions to perform MMA. During execution, the two instructions are issued asynchronously, and block the register usage with hardware interlocks. Since hardware interlocks disallows overlapping LDSM instructions, sequential execution of one and two instructions creates a small bubble in the instruction issue pipeline. However, Tensor Cores have become so fast that this bubble causes non-negligible amount of performance loss, which calls for an asynchronous completion mechanism for MMA.

Hopper supports asynchronous completion mechanism commit and fence for . When instructions are issued, there are no hardware interlocks to guard register usage. Instead, the compiler schedules for the next MMA and uses instruction to keep the next waiting. With Blackwell, the MMA operation is fully asynchronous. Instructions for loading into Tensor Memory (tcgen05.ld / tcgen05.st / tcgen05.cp) are all explicitly asynchronous.

Throughout each successive generation of NVIDIA Tensor Cores, NVIDIA continues to add lower precision data types, starting from 16-bit to 4-bits. This is because deep learning workloads are extremely tolerant of low precision. This is especially true for inference, where even lower precision can be used than during training. Low precision is more power efficient, takes up less silicon floor space and achieves higher compute throughput. In newer generations, we also see NVIDIA removing FP64 support to prioritize low precision data types under silicon area and power budgets.

Interestingly, the prioritization also affected integer data type support. Since Hopper, INT4 data types are deprecated, and on Blackwell Ultra, we see lower INT8 compute throughput. This is caused by the delayed popularity of low-precision integer data types. Although Turing supported INT8 and INT4, it wasn’t until 4 years later that new inference quantization methods were able to exploit the compactness of INT4 for serving LLMs. By that time, NVIDIA had already deprecated INT4 on Hopper .

Next, we will talk about how the programming model evolved, including the transition from high-occupancy to single-occupancy, the increase in explicit asynchronous execution, and how those designs relate to NVIDIA betting on strong scaling.

If readers like to learn the basics of CUDA programming model, hardware, and concepts, GPU Glossary by Modal is a great resource for everything before Blackwell. To understand the big ideas of CUDA, we recommend all of Stephen Jones’ GTC talks (playlist here). To get a deeper understanding of the memory features, GTC talk CUDA Techniques to Maximize Memory Bandwidth and Hide Latency explains the memory features of Volta, Ampere, and Hopper, and Advanced Performance Optimization in CUDA dives deep into memory models. Finally, for Blackwell-specific resources, we recommend GTC talk Programming Blackwell Tensor Cores with CUTLASS, Colfax research CUTLASS articles (latest one here), and the CUTLASS kernel examples.

Read the whole story

· · · · · · · · · · · · · · · · · · · ·

bernhardbock

46 days ago

reply

Anthropic: How we built our multi-agent research system
Tuesday June 24^th, 2025 at 9:15 AM

Simon Willison's Weblog

Anthropic: How we built our multi-agent research system. OK, I'm sold on multi-agent LLM systems now.

I've been pretty skeptical of these until recently: why make your life more complicated by running multiple different prompts in parallel when you can usually get something useful done with a single, carefully-crafted prompt against a frontier model?

This detailed description from Anthropic about how they engineered their "Claude Research" tool has cured me of that skepticism.

Reverse engineering Claude Code had already shown me a mechanism where certain coding research tasks were passed off to a "sub-agent" using a tool call. This new article describes a more sophisticated approach.

They start strong by providing a clear definition of how they'll be using the term "agent" - it's the "tools in a loop" variant:

A multi-agent system consists of multiple agents (LLMs autonomously using tools in a loop) working together. Our Research feature involves an agent that plans a research process based on user queries, and then uses tools to create parallel agents that search for information simultaneously.

Why use multiple agents for a research system?

The essence of search is compression: distilling insights from a vast corpus. Subagents facilitate compression by operating in parallel with their own context windows, exploring different aspects of the question simultaneously before condensing the most important tokens for the lead research agent. [...]

Our internal evaluations show that multi-agent research systems excel especially for breadth-first queries that involve pursuing multiple independent directions simultaneously. We found that a multi-agent system with Claude Opus 4 as the lead agent and Claude Sonnet 4 subagents outperformed single-agent Claude Opus 4 by 90.2% on our internal research eval. For example, when asked to identify all the board members of the companies in the Information Technology S&P 500, the multi-agent system found the correct answers by decomposing this into tasks for subagents, while the single agent system failed to find the answer with slow, sequential searches.

As anyone who has spent time with Claude Code will already have noticed, the downside of this architecture is that it can burn a lot more tokens:

There is a downside: in practice, these architectures burn through tokens fast. In our data, agents typically use about 4× more tokens than chat interactions, and multi-agent systems use about 15× more tokens than chats. For economic viability, multi-agent systems require tasks where the value of the task is high enough to pay for the increased performance. [...]

We’ve found that multi-agent systems excel at valuable tasks that involve heavy parallelization, information that exceeds single context windows, and interfacing with numerous complex tools.

The key benefit is all about managing that 200,000 token context limit. Each sub-task has its own separate context, allowing much larger volumes of content to be processed as part of the research task.

Providing a "memory" mechanism is important as well:

The LeadResearcher begins by thinking through the approach and saving its plan to Memory to persist the context, since if the context window exceeds 200,000 tokens it will be truncated and it is important to retain the plan.

The rest of the article provides a detailed description of the prompt engineering process needed to build a truly effective system:

Early agents made errors like spawning 50 subagents for simple queries, scouring the web endlessly for nonexistent sources, and distracting each other with excessive updates. Since each agent is steered by a prompt, prompt engineering was our primary lever for improving these behaviors. [...]

In our system, the lead agent decomposes queries into subtasks and describes them to subagents. Each subagent needs an objective, an output format, guidance on the tools and sources to use, and clear task boundaries.

They got good results from having special agents help optimize those crucial tool descriptions:

We even created a tool-testing agent—when given a flawed MCP tool, it attempts to use the tool and then rewrites the tool description to avoid failures. By testing the tool dozens of times, this agent found key nuances and bugs. This process for improving tool ergonomics resulted in a 40% decrease in task completion time for future agents using the new description, because they were able to avoid most mistakes.

Sub-agents can run in parallel which provides significant performance boosts:

For speed, we introduced two kinds of parallelization: (1) the lead agent spins up 3-5 subagents in parallel rather than serially; (2) the subagents use 3+ tools in parallel. These changes cut research time by up to 90% for complex queries, allowing Research to do more work in minutes instead of hours while covering more information than other systems.

There's also an extensive section about their approach to evals - they found that LLM-as-a-judge worked well for them, but human evaluation was essential as well:

We often hear that AI developer teams delay creating evals because they believe that only large evals with hundreds of test cases are useful. However, it’s best to start with small-scale testing right away with a few examples, rather than delaying until you can build more thorough evals. [...]

In our case, human testers noticed that our early agents consistently chose SEO-optimized content farms over authoritative but less highly-ranked sources like academic PDFs or personal blogs. Adding source quality heuristics to our prompts helped resolve this issue.

There's so much useful, actionable advice in this piece. I haven't seen anything else about multi-agent system design that's anywhere near this practical.

They even added some example prompts from their Research system to their open source prompting cookbook. Here's the bit that encourages parallel tool use:

<use_parallel_tool_calls> For maximum efficiency, whenever you need to perform multiple independent operations, invoke all relevant tools simultaneously rather than sequentially. Call tools in parallel to run subagents at the same time. You MUST use parallel tool calls for creating multiple subagents (typically running 3 subagents at the same time) at the start of the research, unless it is a straightforward query. For all other queries, do any necessary quick initial planning or investigation yourself, then run multiple subagents in parallel. Leave any extensive tool calls to the subagents; instead, focus on running subagents in parallel efficiently. </use_parallel_tool_calls>

And an interesting description of the OODA research loop used by the sub-agents:

Research loop: Execute an excellent OODA (observe, orient, decide, act) loop by (a) observing what information has been gathered so far, what still needs to be gathered to accomplish the task, and what tools are available currently; (b) orienting toward what tools and queries would be best to gather the needed information and updating beliefs based on what has been learned so far; (c) making an informed, well-reasoned decision to use a specific tool in a certain way; (d) acting to use this tool. Repeat this loop in an efficient way to research well and learn based on new results.

Read the whole story

· · · · · ·

bernhardbock

47 days ago

reply

Tips on prompting ChatGPT for UK technology secretary Peter Kyle
Friday June 6^th, 2025 at 11:00 AM

Simon Willison's Weblog

3rd June 2025

Back in March New Scientist reported on a successful Freedom of Information request they had filed requesting UK Secretary of State for Science, Innovation and Technology Peter Kyle’s ChatGPT logs:

New Scientist has obtained records of Kyle’s ChatGPT use under the Freedom of Information (FOI) Act, in what is believed to be a world-first test of whether chatbot interactions are subject to such laws.

What a fascinating precedent this could set!

They picked out some highlights they thought were particularly newsworthy. Personally I’d have loved to see that raw data to accompany the story.

Among the questions Kyle asked of ChatGPT was this one:

Why is AI adoption so slow in the UK small and medium business community?

(I pinged the New Scientist reporter, Chris Stokel-Walker, to confirm the exact wording here.)

This provides an irresistible example of the “jagged frontier” of LLMs in action. LLMs are great at some things, terrible at others and the difference between the two is often not obvious at all.

Experienced prompters will no doubt have the same reaction I did: that’s not going to give an accurate response! It’s worth digging into why those of us with a firmly developed sense of intuition around LLMs would jump straight to that conclusion.

The problem with this question is that it assumes a level of omniscience that even the very best LLMs do not possess.

At the very best, I would expect this prompt to spit out the approximate average of what had been published on that subject in time to be hoovered up by the training data for the GPT-4o training cutoff of September 2023.

(Here’s what I got just now running it against GPT-4o.)

This illustrates the first lesson of effective LLM usage: know your training cutoff dates. For many queries these are an essential factor in whether or not the LLM is likely to provide you with a useful answer.

Given the pace of change in the AI landscape, an answer based on September 2023 training data is unlikely to offer useful insights into the state of things in 2025.

It’s worth noting that there are tools that might do better at this. OpenAI’s Deep Research tool for example can run a barrage of searches against the web for recent information, then spend multiple minutes digesting those results, running follow-up searches and crunching that together into an impressive looking report.

(I still wouldn’t trust it for a question this broad though: the report format looks more credible than it is, and can suffer from misinformation by omission which is very difficult to spot.)

Deep Research only rolled out in February this year, so it is unlikely to be the tool Peter Kyle was using given likely delays in receiving the requested FOIA data.

What I would do instead

Off the top of my head, here are examples of prompts I would use if I wanted to get ChatGPT’s help digging into this particular question:

Brainstorm potential reasons that UK SMBs might be slow to embrace recent advances in AI. This would give me a starting point for my own thoughts about the subject, and may highlight some things I hadn’t considered that I should look into further.
Identify key stakeholders in the UK SMB community who might have insights on this issue. I wouldn’t expect anything comprehensive here, but it might turn up some initial names I could reach out to for interviews or further research.
I work in UK Government: which departments should I contact that might have relevant information on this topic? Given the size and complexity of the UK government even cabinet ministers could be excused from knowing every department.
Suggest other approaches I could take to research this issue. Another brainstorming prompt. I like prompts like this where “right or wrong” doesn’t particularly matter. LLMs are electric bicycles for the mind.
Use your search tool: find recent credible studies on the subject and identify their authors. I’ve been getting some good results from telling LLMs with good search tools—like o3 and o4-mini—to evaluate the “credibility” of sources they find. It’s a dumb prompting hack but it appears to work quite well—you can watch their reasoning traces and see how they place more faith in papers from well known publications, or newspapers with strong reputations for fact checking.

Prompts that do make sense

From the New Scientist article:

As well as seeking this advice, Kyle asked ChatGPT to define various terms relevant to his department: antimatter, quantum and digital inclusion. Two experts New Scientist spoke to said they were surprised by the quality of the responses when it came to ChatGPT’s definitions of quantum. “This is surprisingly good, in my opinion,” says Peter Knight at Imperial College London. “I think it’s not bad at all,” says Cristian Bonato at Heriot-Watt University in Edinburgh, UK.

This doesn’t surprise me at all. If you ask a good LLM for definitions of terms with strong, well established meanings you’re going to get great results almost every time.

My rule of thumb used to be that if a friend who had just read the Wikipedia page on a subject could answer my question then an LLM will be able to answer it too.

As the frontier models have grown stronger I’ve upgraded that rule of thumb. I now expect a good result for any mainstream-enough topic for which there was widespread consensus prior to that all-important training cutoff date.

Once again, it all comes down to intuition. The only way to get really strong intuition as to what will work with LLMs is to spend a huge amount of time using them, and paying a skeptical eye to everything that they produce.

Treating ChatGPT as an all knowing Oracle for anything outside of a two year stale Wikipedia version of the world’s knowledge is almost always a mistake.

Treating it as a brainstorming companion and electric bicycle for the mind is, I think, a much better strategy.

Should the UK technology secretary be using ChatGPT?

Some of the reporting I’ve seen around this story has seemed to suggest that Peter Kyle’s use of ChatGPT is embarrassing.

Personally, I think that if the UK’s Secretary of State for Science, Innovation and Technology was not exploring this family of technologies it would be a dereliction of duty!

The thing we can’t tell from these ChatGPT logs is how dependent he was on these results.

Did he idly throw some questions at ChatGPT out of curiosity to see what came back, then ignore that entirely, engage with his policy team and talk to experts in the field to get a detailed understanding of the issues at hand?

Or did he prompt ChatGPT, take the results as gospel and make policy decisions based on that sloppy interpretation of a two-year stale guess at the state of the world?

Those are the questions I’d like to see answered.

Read the whole story

· · · · · ·

bernhardbock

64 days ago

reply

Introduction#
Thursday April 3^rd, 2025 at 6:10 AM

Module Federation is an architectural pattern for the decentralization of JavaScript applications (similar to microservices on the server-side). It allows you to share code and resources among multiple JavaScript applications (or micro-frontends). This can help you:

Reduce code duplication
Improve code maintainability
Lower the overall size of your applications
Enhance the performance of your applications

✨ What is Module Federation 2.0?#

Module Federation 2.0 differs from the Module Federation built into Webpack5 by providing not only the core features of module export, loading, and dependency sharing but also additional dynamic type hinting, Manifest, Federation Runtime, and Runtime Plugin System. These features make Module Federation more suitable for use as a micro-frontend architecture in large-scale Web applications.

🔥 Features#

Module Federation has the following features:

🎯 Use Cases#

Module Federation is suitable for the following scenarios:

Large Applications: For large applications, you can break the application into multiple micro-frontends and use Module Federation to share code and resources between them.
Microfrontend Architecture: Module Federation is an ideal tool for building microfrontend architectures.
Multi-team Development: Module Federation can assist multiple teams in collaboratively developing large applications.

🕠 History of Module Federation#

Module Federation is a new feature introduced in Webpack 5, but its history dates back to 2017. At that time, the Webpack team began exploring a way to share code between multiple applications.

In 2018, Webpack 4.20 was released, introducing module hooks, which laid the foundation for the development of Module Federation.
In 2019, Webpack 5 was released, officially introducing the Module Federation feature.

Module Federation has become a powerful tool for building modern web applications.

🕰️ The Future of Module Federation#

Module Federation aims to become an architectural method for building large web applications, similar to microservices in the backend. Module Federation will provide more capabilities to meet the foundational needs of large web application decentralization, currently including these parts:

Providing comprehensive Devtool tools
Offering more high-level framework capabilities like Router, Sandbox, SSR
Providing best practices for large web applications based on Module Federation

Follow Us#

✨ Next Steps#

You might want to:

Read the whole story

· ·

bernhardbock

129 days ago

reply

Modern Node.js Patterns for 2025 Thursday August 7th, 2025 at 9:10 AM

1. Module System: ESM is the New Standard

The Old Way (CommonJS)

The Modern Way (ES Modules with Node: Prefix)

Top-Level Await: Simplifying Initialization

2. Built-in Web APIs: Reducing External Dependencies

Fetch API: No More HTTP Library Dependencies

AbortController: Graceful Operation Cancellation

3. Built-in Testing: Professional Testing Without External Dependencies

Modern Testing with Node.js Built-in Test Runner

4. Sophisticated Asynchronous Patterns

Async/Await with Enhanced Error Handling

Modern Event Handling with AsyncIterators

5. Advanced Streams with Web Standards Integration

Modern Stream Processing

Web Streams Interoperability

6. Worker Threads: True Parallelism for CPU-Intensive Tasks

Background Processing Without Blocking

7. Enhanced Development Experience

Watch Mode and Environment Management

8. Modern Security and Performance Monitoring

Permission Model for Enhanced Security

Built-in Performance Monitoring

9. Application Distribution and Deployment

Single Executable Applications

10. Modern Error Handling and Diagnostics

Structured Error Handling

Advanced Diagnostics

11. Modern Package Management and Module Resolution

Import Maps and Internal Package Resolution

Dynamic Imports for Flexible Loading

The Path Forward: Key Takeaways for Modern Node.js (2025)

L4S and the Future of Real-Time Performance in 5G and Beyond Thursday July 24th, 2025 at 3:35 PM

NVIDIA Tensor Core Evolution: From Volta To Blackwell Tuesday June 24th, 2025 at 10:45 AM

Anthropic: How we built our multi-agent research system Tuesday June 24th, 2025 at 9:15 AM

Tips on prompting ChatGPT for UK technology secretary Peter Kyle Friday June 6th, 2025 at 11:00 AM

What I would do instead

Prompts that do make sense

Should the UK technology secretary be using ChatGPT?

Introduction# Thursday April 3rd, 2025 at 6:10 AM

✨ What is Module Federation 2.0?#

🔥 Features#

🎯 Use Cases#

🕠 History of Module Federation#

🕰️ The Future of Module Federation#

Follow Us#

✨ Next Steps#

Modern Node.js Patterns for 2025
Thursday August 7^th, 2025 at 9:10 AM

L4S and the Future of Real-Time Performance in 5G and Beyond
Thursday July 24^th, 2025 at 3:35 PM

NVIDIA Tensor Core Evolution: From Volta To Blackwell
Tuesday June 24^th, 2025 at 10:45 AM

Anthropic: How we built our multi-agent research system
Tuesday June 24^th, 2025 at 9:15 AM

Tips on prompting ChatGPT for UK technology secretary Peter Kyle
Friday June 6^th, 2025 at 11:00 AM

Introduction#
Thursday April 3^rd, 2025 at 6:10 AM