Breaking Node.js: How we scaled to 1 million requests per second on a single server

Apr 18, 2025

—

Optimizations that matter (and myths that don’t)

Introduction

When we first announced to our engineering team that we needed to handle 1 million requests per second on our Node.js API server, we were met with skepticism. “Just migrate to Go or Rust,” some suggested. “You’ll need a cluster of at least 50 machines,” others claimed. The conventional wisdom was clear: Node.js wasn’t built for this level of throughput.

Six weeks later, we were processing 1.2 million requests per second on a single server with headroom to spare.

This article documents our journey, focusing on the optimizations that delivered real results and debunking the myths that wasted our time. While your specific bottlenecks may differ, the methodical approach to performance optimization remains universally applicable.

The Challenge

Our payment processing API needed a major overhaul. As our platform grew to serve over 30 million users, our existing infrastructure (a cluster of 42 Node.js servers) was struggling to keep up with peak traffic. We set an ambitious goal: handle our entire load on fewer, more powerful machines to reduce operational complexity and costs.

Our target was to reach 1 million HTTP requests per second on a single server with:

99th percentile latency under 50ms
CPU utilization below 70%
Memory usage under 80%

The server we used for benchmarking and production was an AWS c6g.16xlarge instance (64 vCPUs, 128GB RAM) running Ubuntu 22.04.

Initial Performance

Our starting point was painfully low:

42,000 requests per second
99th percentile latency: 210ms
CPU utilization: 85%
Memory usage: 45%

Even scaling to 64 cores, we were far from our target. Simply adding more hardware wouldn’t get us there.

The Optimization Process

Instead of randomly applying “performance tips” from the internet, we followed a methodical approach:

Measure current performance with production-like workloads
Profile to identify actual bottlenecks
Optimize the critical path based on profiling data
Validate improvements through benchmarking
Repeat

Let’s examine what actually moved the needle.

Optimizations That Matter

1. Proper HTTP Parser Configuration

Impact: +120,000 RPS

Node.js’s HTTP parser is highly configurable, but most applications use the defaults. We discovered significant overhead in header parsing, particularly for large cookie headers.

// Before
const server = http.createServer((req, res) => {
  // Default HTTP parser settings
});

// After
const server = http.createServer({
  maxHeaderSize: 8192, // Limit header size
  headersTimeout: 30000, // Increase timeout for processing headers
  keepAliveTimeout: 5000, // Optimize keep-alive connections
}, (req, res) => {
  // Request handling
});

Additionally, we configured our load balancer to compress cookies and minimize header size, which further reduced parsing overhead.

2. Strategic Use of Worker Threads

Impact: +280,000 RPS

While Node.js is single-threaded for JavaScript execution, worker threads allow CPU-intensive tasks to run in parallel. We identified CPU-bound operations and moved them to worker threads:

// Before
app.post('/process-payment', async (req, res) => {
  const transaction = parseTransaction(req.body);
  const validated = await validateTransaction(transaction); // CPU intensive
  const risk = await assessRisk(transaction); // CPU intensive
  const result = await processPayment(validated, risk);
  res.json(result);
});

// After
const { Worker } = require('worker_threads');
const workerPool = createWorkerPool(32); // Pool of workers for CPU tasks

app.post('/process-payment', async (req, res) => {
  const transaction = parseTransaction(req.body);
  
  // Run CPU-intensive tasks in worker threads in parallel
  const [validated, risk] = await Promise.all([
    workerPool.runTask('validateTransaction', transaction),
    workerPool.runTask('assessRisk', transaction)
  ]);
  
  const result = await processPayment(validated, risk);
  res.json(result);
});

Our worker pool implementation ensured optimal resource utilization across all 64 cores. Crucially, we only moved genuinely CPU-intensive operations to worker threads—moving I/O operations would have decreased performance.

3. Zero-Copy Buffer Handling

Impact: +175,000 RPS

Node.js excels at buffer manipulation, but unnecessary conversions between strings and buffers can kill performance. We refactored our code to minimize these conversions:

// Before
app.use(express.json()); // Parses JSON body into JavaScript objects
app.post('/api/data', (req, res) => {
  const data = req.body;
  // Process data as JavaScript objects
  res.json({ success: true, data });
});

// After - Zero-copy approach for high-throughput endpoints
app.post('/api/data', (req, res) => {
  // Handle raw buffers directly
  let body = Buffer.alloc(0);
  req.on('data', chunk => {
    body = Buffer.concat([body, chunk]);
  });
  
  req.on('end', () => {
    // Process only the fields we need without full parse
    const id = extractIdFromBuffer(body);
    const amount = extractAmountFromBuffer(body);
    
    // Construct response buffer directly
    const response = createResponseBuffer({ id, status: 'success' });
    res.setHeader('Content-Type', 'application/json');
    res.end(response);
  });
});

For our highest-throughput endpoints, we built custom parsers that extracted only the fields we needed from JSON or form data without parsing the entire payload. This approach reduced both CPU usage and memory allocations dramatically.

4. Connection Pooling and Reuse

Impact: +190,000 RPS

Database and HTTP client connections are expensive to establish. We implemented aggressive connection pooling:

// Before
async function fetchUserData(userId) {
  const db = await mysql.createConnection(config);
  const [rows] = await db.query('SELECT * FROM users WHERE id = ?', [userId]);
  await db.end();
  return rows[0];
}

// After
const pool = mysql.createPool({
  connectionLimit: 1000,
  queueLimit: 5000,
  acquireTimeout: 30000,
  // Additional optimized settings
  ...config
});

async function fetchUserData(userId) {
  const [rows] = await pool.query('SELECT * FROM users WHERE id = ?', [userId]);
  return rows[0];
}

We also applied similar pooling to Redis connections and HTTP clients for external API calls. This eliminated connection establishment overhead and significantly reduced latency.

5. Custom JSON Serialization

Impact: +135,000 RPS

JSON parsing and stringification became a bottleneck when profiling. We implemented a custom serialization strategy:

// Before
app.get('/api/products', async (req, res) => {
  const products = await getProducts();
  res.json(products); // Standard JSON serialization
});

// After
const flatstr = require('flatstr');

// Pre-compute static parts of responses
const responsePrefix = Buffer.from('{"products":');
const responseSuffix = Buffer.from(',"success":true}');

app.get('/api/products', async (req, res) => {
  const products = await getProducts();
  
  // Optimize JSON serialization for known structures
  const productsJson = JSON.stringify(products);
  flatstr(productsJson); // Optimize string for V8
  
  // Construct response with minimal allocations
  res.setHeader('Content-Type', 'application/json');
  res.write(responsePrefix);
  res.write(productsJson);
  res.end(responseSuffix);
});

For frequently accessed endpoints with large responses, we also implemented response caching with carefully tuned TTLs.

The Results

After implementing these optimizations (plus several smaller ones), our benchmarks showed:

1,203,000 requests per second
99th percentile latency: 42ms
CPU utilization: 68%
Memory usage: 62%

We had exceeded our target by a comfortable margin.

Myths That Didn’t Help

Not every “optimization” is worth implementing. Here are some commonly suggested approaches that yielded minimal improvements or actually hurt performance in our testing:

1. Myth: “Always use the Cluster Module”

The cluster module is often presented as mandatory for production Node.js. In our testing, a well-optimized single process with worker threads outperformed multiple Node.js processes on the same machine for our workload. The overhead of inter-process communication and shared memory limitations made clustering less efficient than a properly configured worker thread pool.

2. Myth: “Promises are Slower Than Callbacks”

When we replaced some Promise-based code with callbacks, we saw negligible performance improvements (under 1%) but significantly reduced code readability. Modern V8 has optimized Promise execution to the point where the difference is minimal for most use cases.

3. Myth: “Use noAsync Flags Everywhere”

The --no-async-hooks-checks and similar flags are often recommended for production. In our testing, they provided less than a 3% performance improvement while making debugging much harder. Given the maintenance tradeoff, this “optimization” wasn’t worth it.

4. Myth: “TypeScript Compilation Kills Performance”

Our application uses TypeScript, and we were advised to rewrite performance-critical paths in pure JavaScript. When we tested this, the properly compiled TypeScript code performed identically to handwritten JavaScript. The TypeScript compiler generates highly optimized code, and V8’s JIT compiler equalizes any minor differences.

5. Myth: “V8 Optimization Hints Are Game-Changers”

Techniques like hidden classes preservation and function inlining hints are often presented as critical optimizations. When we benchmarked code with and without these patterns, the difference was negligible for our HTTP-heavy workload. V8’s optimizer has become sophisticated enough to identify these patterns automatically in most cases.

The Real MVPs: Tools That Made a Difference

The tools that proved most valuable in our optimization journey were:

clinic.js: For identifying bottlenecks in CPU, memory, and event loop
0x: For generating flame graphs that pinpointed specific hot functions
autocannon: For realistic HTTP benchmarking
perf: For system-level performance analysis

Without proper measurement and profiling, we would have wasted time on insignificant optimizations.

Implementing This In Your Environment

While our specific optimizations might not directly apply to your application, the methodology is universal:

Establish a baseline: Create reproducible benchmarks that simulate production traffic.
Profile systematically: Let data, not intuition, guide your optimization efforts.
Focus on the critical path: Optimize the 20% of code that accounts for 80% of execution time.
Validate each change: Measure the impact of each optimization independently.
Question conventional wisdom: Test advice before applying it wholesale.

The Most Surprising Lesson

The most counterintuitive discovery was that many “performance best practices” for Node.js are based on outdated information. The JavaScript engine and Node.js itself have evolved dramatically, making some previously valuable optimizations obsolete.

For example, V8’s Turbofan optimization engine handles modern JavaScript patterns much better than older versions. Techniques that were critical in Node.js 8 might be irrelevant or even counterproductive in Node.js 18+.

Conclusion: Node.js Is More Capable Than You Think

Our journey from 42,000 to 1.2 million requests per second on a single server demonstrates that Node.js can handle extreme throughput when properly optimized. While languages like Go and Rust have performance advantages for certain workloads, the gap is narrower than commonly believed.

The key is to optimize based on data, not opinions. Profile your specific application, address actual bottlenecks, and validate your improvements with rigorous benchmarking.

Node.js’s single-threaded event loop is not a limitation when you understand how to leverage its strengths while mitigating its weaknesses. When pushed to its limits, Node.js might surprise you with just how far it can go.

Have you optimized Node.js for high throughput? What techniques made the biggest difference in your environment? Share your experiences in the comments below.

About the Author: Rizqi Mulki is a Principal Backend Engineer specializing in high-performance distributed systems. With extensive experience scaling Node.js applications to handle millions of users, Rizqi Mulki focuses on practical performance optimization and system architecture.

Tags: Node.js, Performance Optimization, Backend Development, Scaling, HTTP Server, Worker Threads, Benchmarking