Optimizations that matter (and myths that don’t)
Introduction
When we first announced to our engineering team that we needed to handle 1 million requests per second on our Node.js API server, we were met with skepticism. “Just migrate to Go or Rust,” some suggested. “You’ll need a cluster of at least 50 machines,” others claimed. The conventional wisdom was clear: Node.js wasn’t built for this level of throughput.
Six weeks later, we were processing 1.2 million requests per second on a single server with headroom to spare.
This article documents our journey, focusing on the optimizations that delivered real results and debunking the myths that wasted our time. While your specific bottlenecks may differ, the methodical approach to performance optimization remains universally applicable.
The Challenge
Our payment processing API needed a major overhaul. As our platform grew to serve over 30 million users, our existing infrastructure (a cluster of 42 Node.js servers) was struggling to keep up with peak traffic. We set an ambitious goal: handle our entire load on fewer, more powerful machines to reduce operational complexity and costs.
Our target was to reach 1 million HTTP requests per second on a single server with:
- 99th percentile latency under 50ms
- CPU utilization below 70%
- Memory usage under 80%
The server we used for benchmarking and production was an AWS c6g.16xlarge instance (64 vCPUs, 128GB RAM) running Ubuntu 22.04.
Initial Performance
Our starting point was painfully low:
- 42,000 requests per second
- 99th percentile latency: 210ms
- CPU utilization: 85%
- Memory usage: 45%
Even scaling to 64 cores, we were far from our target. Simply adding more hardware wouldn’t get us there.
The Optimization Process
Instead of randomly applying “performance tips” from the internet, we followed a methodical approach:
- Measure current performance with production-like workloads
- Profile to identify actual bottlenecks
- Optimize the critical path based on profiling data
- Validate improvements through benchmarking
- Repeat
Let’s examine what actually moved the needle.
Optimizations That Matter
1. Proper HTTP Parser Configuration
Impact: +120,000 RPS
Node.js’s HTTP parser is highly configurable, but most applications use the defaults. We discovered significant overhead in header parsing, particularly for large cookie headers.
// Before
const server = http.createServer((req, res) => {
// Default HTTP parser settings
});
// After
const server = http.createServer({
maxHeaderSize: 8192, // Limit header size
headersTimeout: 30000, // Increase timeout for processing headers
keepAliveTimeout: 5000, // Optimize keep-alive connections
}, (req, res) => {
// Request handling
});
Additionally, we configured our load balancer to compress cookies and minimize header size, which further reduced parsing overhead.
2. Strategic Use of Worker Threads
Impact: +280,000 RPS
While Node.js is single-threaded for JavaScript execution, worker threads allow CPU-intensive tasks to run in parallel. We identified CPU-bound operations and moved them to worker threads:
// Before
app.post('/process-payment', async (req, res) => {
const transaction = parseTransaction(req.body);
const validated = await validateTransaction(transaction); // CPU intensive
const risk = await assessRisk(transaction); // CPU intensive
const result = await processPayment(validated, risk);
res.json(result);
});
// After
const { Worker } = require('worker_threads');
const workerPool = createWorkerPool(32); // Pool of workers for CPU tasks
app.post('/process-payment', async (req, res) => {
const transaction = parseTransaction(req.body);
// Run CPU-intensive tasks in worker threads in parallel
const [validated, risk] = await Promise.all([
workerPool.runTask('validateTransaction', transaction),
workerPool.runTask('assessRisk', transaction)
]);
const result = await processPayment(validated, risk);
res.json(result);
});
Our worker pool implementation ensured optimal resource utilization across all 64 cores. Crucially, we only moved genuinely CPU-intensive operations to worker threads—moving I/O operations would have decreased performance.
3. Zero-Copy Buffer Handling
Impact: +175,000 RPS
Node.js excels at buffer manipulation, but unnecessary conversions between strings and buffers can kill performance. We refactored our code to minimize these conversions:
// Before
app.use(express.json()); // Parses JSON body into JavaScript objects
app.post('/api/data', (req, res) => {
const data = req.body;
// Process data as JavaScript objects
res.json({ success: true, data });
});
// After - Zero-copy approach for high-throughput endpoints
app.post('/api/data', (req, res) => {
// Handle raw buffers directly
let body = Buffer.alloc(0);
req.on('data', chunk => {
body = Buffer.concat([body, chunk]);
});
req.on('end', () => {
// Process only the fields we need without full parse
const id = extractIdFromBuffer(body);
const amount = extractAmountFromBuffer(body);
// Construct response buffer directly
const response = createResponseBuffer({ id, status: 'success' });
res.setHeader('Content-Type', 'application/json');
res.end(response);
});
});
For our highest-throughput endpoints, we built custom parsers that extracted only the fields we needed from JSON or form data without parsing the entire payload. This approach reduced both CPU usage and memory allocations dramatically.
4. Connection Pooling and Reuse
Impact: +190,000 RPS
Database and HTTP client connections are expensive to establish. We implemented aggressive connection pooling:
// Before
async function fetchUserData(userId) {
const db = await mysql.createConnection(config);
const [rows] = await db.query('SELECT * FROM users WHERE id = ?', [userId]);
await db.end();
return rows[0];
}
// After
const pool = mysql.createPool({
connectionLimit: 1000,
queueLimit: 5000,
acquireTimeout: 30000,
// Additional optimized settings
...config
});
async function fetchUserData(userId) {
const [rows] = await pool.query('SELECT * FROM users WHERE id = ?', [userId]);
return rows[0];
}
We also applied similar pooling to Redis connections and HTTP clients for external API calls. This eliminated connection establishment overhead and significantly reduced latency.
5. Custom JSON Serialization
Impact: +135,000 RPS
JSON parsing and stringification became a bottleneck when profiling. We implemented a custom serialization strategy:
// Before
app.get('/api/products', async (req, res) => {
const products = await getProducts();
res.json(products); // Standard JSON serialization
});
// After
const flatstr = require('flatstr');
// Pre-compute static parts of responses
const responsePrefix = Buffer.from('{"products":');
const responseSuffix = Buffer.from(',"success":true}');
app.get('/api/products', async (req, res) => {
const products = await getProducts();
// Optimize JSON serialization for known structures
const productsJson = JSON.stringify(products);
flatstr(productsJson); // Optimize string for V8
// Construct response with minimal allocations
res.setHeader('Content-Type', 'application/json');
res.write(responsePrefix);
res.write(productsJson);
res.end(responseSuffix);
});
For frequently accessed endpoints with large responses, we also implemented response caching with carefully tuned TTLs.
The Results
After implementing these optimizations (plus several smaller ones), our benchmarks showed:
- 1,203,000 requests per second
- 99th percentile latency: 42ms
- CPU utilization: 68%
- Memory usage: 62%
We had exceeded our target by a comfortable margin.
Myths That Didn’t Help
Not every “optimization” is worth implementing. Here are some commonly suggested approaches that yielded minimal improvements or actually hurt performance in our testing:
1. Myth: “Always use the Cluster Module”
The cluster module is often presented as mandatory for production Node.js. In our testing, a well-optimized single process with worker threads outperformed multiple Node.js processes on the same machine for our workload. The overhead of inter-process communication and shared memory limitations made clustering less efficient than a properly configured worker thread pool.
2. Myth: “Promises are Slower Than Callbacks”
When we replaced some Promise-based code with callbacks, we saw negligible performance improvements (under 1%) but significantly reduced code readability. Modern V8 has optimized Promise execution to the point where the difference is minimal for most use cases.
3. Myth: “Use noAsync Flags Everywhere”
The --no-async-hooks-checks
and similar flags are often recommended for production. In our testing, they provided less than a 3% performance improvement while making debugging much harder. Given the maintenance tradeoff, this “optimization” wasn’t worth it.
4. Myth: “TypeScript Compilation Kills Performance”
Our application uses TypeScript, and we were advised to rewrite performance-critical paths in pure JavaScript. When we tested this, the properly compiled TypeScript code performed identically to handwritten JavaScript. The TypeScript compiler generates highly optimized code, and V8’s JIT compiler equalizes any minor differences.
5. Myth: “V8 Optimization Hints Are Game-Changers”
Techniques like hidden classes preservation and function inlining hints are often presented as critical optimizations. When we benchmarked code with and without these patterns, the difference was negligible for our HTTP-heavy workload. V8’s optimizer has become sophisticated enough to identify these patterns automatically in most cases.
The Real MVPs: Tools That Made a Difference
The tools that proved most valuable in our optimization journey were:
- clinic.js: For identifying bottlenecks in CPU, memory, and event loop
- 0x: For generating flame graphs that pinpointed specific hot functions
- autocannon: For realistic HTTP benchmarking
- perf: For system-level performance analysis
Without proper measurement and profiling, we would have wasted time on insignificant optimizations.
Implementing This In Your Environment
While our specific optimizations might not directly apply to your application, the methodology is universal:
- Establish a baseline: Create reproducible benchmarks that simulate production traffic.
- Profile systematically: Let data, not intuition, guide your optimization efforts.
- Focus on the critical path: Optimize the 20% of code that accounts for 80% of execution time.
- Validate each change: Measure the impact of each optimization independently.
- Question conventional wisdom: Test advice before applying it wholesale.
The Most Surprising Lesson
The most counterintuitive discovery was that many “performance best practices” for Node.js are based on outdated information. The JavaScript engine and Node.js itself have evolved dramatically, making some previously valuable optimizations obsolete.
For example, V8’s Turbofan optimization engine handles modern JavaScript patterns much better than older versions. Techniques that were critical in Node.js 8 might be irrelevant or even counterproductive in Node.js 18+.
Conclusion: Node.js Is More Capable Than You Think
Our journey from 42,000 to 1.2 million requests per second on a single server demonstrates that Node.js can handle extreme throughput when properly optimized. While languages like Go and Rust have performance advantages for certain workloads, the gap is narrower than commonly believed.
The key is to optimize based on data, not opinions. Profile your specific application, address actual bottlenecks, and validate your improvements with rigorous benchmarking.
Node.js’s single-threaded event loop is not a limitation when you understand how to leverage its strengths while mitigating its weaknesses. When pushed to its limits, Node.js might surprise you with just how far it can go.
Have you optimized Node.js for high throughput? What techniques made the biggest difference in your environment? Share your experiences in the comments below.
About the Author: Rizqi Mulki is a Principal Backend Engineer specializing in high-performance distributed systems. With extensive experience scaling Node.js applications to handle millions of users, Rizqi Mulki focuses on practical performance optimization and system architecture.
Tags: Node.js, Performance Optimization, Backend Development, Scaling, HTTP Server, Worker Threads, Benchmarking
Leave a Reply