Introduction
Scaling a database system to handle more than one million transactions per second (TPS) presents significant challenges even for the most seasoned database architects. This case study examines how our team at GlobalTech Financial Services transformed our MariaDB implementation from a standard setup struggling with 10,000 TPS to a high-performance system capable of processing over 1 million transactions per second.
Background
The Business Challenge
GlobalTech Financial Services operates a payment processing platform that experienced exponential growth over 18 months, with our customer base expanding from regional markets to global operations. This growth brought a dramatic increase in transaction volume, threatening to overwhelm our existing database infrastructure.
Initial System Metrics:
- Average throughput: ~8,000 TPS
- Peak loads: ~12,000 TPS
- 99th percentile response time: 850ms (well above our 200ms SLA)
- Database size: 4TB
- Hardware: 8-node cluster with 16 cores and 128GB RAM per node
The Technical Challenge
Our analysis identified several bottlenecks:
- Connection management: High connection overhead with thousands of ephemeral connections
- Query efficiency: Suboptimal query patterns causing excessive disk I/O
- Schema design: Normalized schema requiring multiple joins for common operations
- Resource contention: Locks and waits causing processing delays
- Hardware limitations: Storage I/O becoming a significant bottleneck
The Solution Architecture
Phase 1: Foundation Optimization
Before making radical changes, we optimized the existing setup:
Connection Pooling Implementation
We implemented ProxySQL as a connection pooling layer:
# Sample ProxySQL configuration excerpt
mysql_servers =
(
{
address="mariadb-node1.internal"
port=3306
hostgroup=0
max_connections=2000
max_replication_lag=5
},
# Additional nodes configuration...
)
mysql_users =
(
{
username="app_user"
password="****"
default_hostgroup=0
max_connections=1000
transaction_persistent=1
}
)
This reduced connection overhead dramatically, allowing each application server to maintain persistent connections to the database.
Query Optimization
We analyzed and optimized the top 20 most resource-intensive queries:
-- Before optimization
SELECT c.customer_id, c.name, c.email,
o.order_id, o.amount, o.status,
p.payment_method, p.transaction_id
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
JOIN payments p ON o.order_id = p.order_id
WHERE o.status = 'PROCESSING'
ORDER BY o.created_at DESC;
-- After optimization
SELECT c.customer_id, c.name, c.email,
o.order_id, o.amount, o.status,
p.payment_method, p.transaction_id
FROM orders o
FORCE INDEX (idx_status_created)
JOIN customers c ON o.customer_id = c.customer_id
JOIN payments p ON o.order_id = p.order_id
WHERE o.status = 'PROCESSING'
ORDER BY o.created_at DESC
LIMIT 1000;
Key optimizations included:
- Changing join order to start with the most filtered table
- Adding composite indexes to support common query patterns
- Using FORCE INDEX where necessary
- Implementing LIMIT clauses to prevent runaway result sets
Initial Schema Optimization
We implemented strategic denormalization for hot tables:
-- Adding frequently accessed columns directly to orders table
ALTER TABLE orders
ADD COLUMN customer_name VARCHAR(255) AFTER customer_id,
ADD COLUMN customer_email VARCHAR(255) AFTER customer_name;
-- Creating triggers to maintain data consistency
DELIMITER //
CREATE TRIGGER after_customer_update
AFTER UPDATE ON customers
FOR EACH ROW
BEGIN
IF OLD.name != NEW.name OR OLD.email != NEW.email THEN
UPDATE orders
SET customer_name = NEW.name, customer_email = NEW.email
WHERE customer_id = NEW.customer_id;
END IF;
END//
DELIMITER ;
Phase 2: Horizontal Scaling Implementation
With the foundation optimized, we moved to a horizontal scaling strategy:
Sharding Strategy
We implemented a hybrid sharding approach:
- Customer-based sharding for account data
- Transaction ID-based sharding for transactional data
Our sharding function used consistent hashing:
function getShardId($key, $shardCount = 128) {
$crc = crc32($key);
$shardId = abs($crc % $shardCount);
return $shardId;
}
Schema for Sharded Environment
We created a shard management database and modified application logic to route queries:
CREATE TABLE shard_map (
shard_id INT NOT NULL,
db_host VARCHAR(255) NOT NULL,
db_port INT NOT NULL DEFAULT 3306,
db_name VARCHAR(64) NOT NULL,
shard_status ENUM('ACTIVE','MAINTENANCE','REBALANCING','INACTIVE') NOT NULL,
weight INT NOT NULL DEFAULT 100,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (shard_id)
);
Data Distribution and Rebalancing
For initial data distribution, we developed a custom migration tool that:
- Created shard schemas on target servers
- Transferred data in batches based on sharding keys
- Verified data integrity post-migration
- Switched traffic gradually to the new sharded system
Phase 3: Advanced Optimizations
With our sharded architecture in place, we implemented advanced optimizations:
Columnar Storage for Analytics
For reporting workloads, we implemented ColumnStore engine:
CREATE TABLE transaction_analytics (
transaction_id BIGINT NOT NULL,
customer_id BIGINT NOT NULL,
amount DECIMAL(20,2) NOT NULL,
currency VARCHAR(3) NOT NULL,
transaction_type VARCHAR(20) NOT NULL,
status VARCHAR(15) NOT NULL,
processing_time_ms INT NOT NULL,
transaction_date DATE NOT NULL,
created_at TIMESTAMP NOT NULL
) ENGINE=ColumnStore;
Memory-Optimized Tables
For high-velocity data, we used memory tables with automated archiving:
CREATE TABLE active_sessions (
session_id VARCHAR(64) NOT NULL,
user_id BIGINT NOT NULL,
start_time TIMESTAMP NOT NULL,
last_activity TIMESTAMP NOT NULL,
session_data JSON,
PRIMARY KEY (session_id),
KEY (user_id),
KEY (last_activity)
) ENGINE=MEMORY;
Custom MariaDB Server Configuration
We fine-tuned MariaDB configuration for high throughput:
# Buffer settings
innodb_buffer_pool_size = 96G
innodb_log_buffer_size = 256M
# I/O optimization
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = O_DIRECT
innodb_io_capacity = 20000
innodb_io_capacity_max = 40000
# Concurrency settings
innodb_thread_concurrency = 0
innodb_read_io_threads = 16
innodb_write_io_threads = 16
max_connections = 5000
# Transaction isolation
transaction_isolation = READ-COMMITTED
# Table settings
table_open_cache = 16384
table_definition_cache = 8192
# Thread pooling
thread_handling = pool-of-threads
thread_pool_size = 32
Hardware Optimization
We moved to an NVMe-based storage infrastructure:
- NVMe SSDs in RAID 10 configuration
- Separate volumes for data, logs, and temporary files
- Direct I/O enabled to bypass OS caching
Implementation Results
Performance Metrics After Optimization
After completing all three phases, our system achieved:
- Sustained throughput: 1.2M TPS
- Peak capacity: 1.8M TPS
- 99th percentile response time: 45ms
- Database size: 18TB distributed across shards
- Hardware: 32-node cluster with 32 cores and 256GB RAM per node
Scalability Tests
We conducted stress tests to verify scalability:
Concurrent Users | Transactions per Second | Avg. Response Time | 99th Percentile |
---|---|---|---|
100,000 | 350,000 | 18ms | 35ms |
250,000 | 820,000 | 22ms | 42ms |
500,000 | 1,250,000 | 28ms | 45ms |
750,000 | 1,650,000 | 35ms | 52ms |
1,000,000 | 1,820,000 | 38ms | 65ms |
Cost-Benefit Analysis
The optimization project required:
- 8 months of engineering effort
- $1.2M in hardware investments
- 20% increase in monthly infrastructure costs
Benefits realized:
- 98% reduction in response time
- 150x increase in throughput capacity
- Supported 10x customer growth without performance degradation
- Eliminated performance-related customer complaints
- Extended hardware lifecycle by 2 years due to more efficient resource utilization
Key Lessons Learned
Technical Insights
- Connection management matters: Connection pooling provided immediate relief without schema changes.
- Sharding complexity: Choosing the right sharding key is critical; our initial attempt using transaction date caused hot spots and required reworking.
- Query patterns dictate indexes: Understanding query patterns is more important than generalized indexing strategies.
- Hardware isn’t always the answer: Software optimizations provided 70% of our gains; hardware accelerated already-optimized software.
- Monitoring is essential: Comprehensive monitoring allowed us to identify issues before they impacted customers:
-- Example of monitoring query we run every minute
SELECT
SUBSTRING_INDEX(SUBSTRING_INDEX(host, ':', 1), '[', 1) AS client_ip,
COUNT(*) AS connection_count,
SUM(IF(command != 'Sleep', 1, 0)) AS active_count,
AVG(IF(command != 'Sleep', time, NULL)) AS avg_active_time
FROM information_schema.processlist
GROUP BY client_ip
ORDER BY active_count DESC;
Organizational Learnings
- DevOps collaboration: Cross-functional teams between database administrators and application developers led to better solutions that addressed both database and application issues.
- Incremental migration: Our phased approach allowed business continuity while improving performance.
- Documentation matters: Detailed documentation of database design decisions streamlined onboarding and troubleshooting.
Conclusion
Scaling MariaDB to handle 1M+ transactions per second required a multifaceted approach combining:
- Foundation optimization of the existing system
- Horizontal scaling through strategic sharding
- Advanced optimizations at both hardware and software levels
- Continuous monitoring and performance tuning
The key to success was not just technical implementation but the methodical, phased approach that validated improvements at each step before moving forward. While not every organization requires this level of throughput, the principles and patterns we applied are valuable for any team looking to scale their database infrastructure efficiently.
Our journey demonstrates that MariaDB can indeed handle extreme transaction volumes when properly architected and optimized, making it a viable alternative to more complex and costly database solutions for high-throughput scenarios.
Leave a Reply