JOptimize’s Post

731 followers

3w Edited

OFFSET pagination is fine… until page 10,000 Your SQL query returns 50 rows. So why does it get slower every week? Because the database is not optimizing for the 50 rows you keep. It is paying for the rows you skip. In a Java + Spring Boot API audit, I saw an admin endpoint paginating a large orders table like this: SELECT id, customer_id, status, created_at FROM orders ORDER BY created_at DESC LIMIT 50 OFFSET 500000; Application side looked harmless: PageRequest.of(page, 50, Sort.by(Sort.Direction.DESC, "createdAt")); At first, everything was fine. Then the table grew to around 12 million rows. And deep pagination became a real problem. 🚨 What happened For early pages: page 1: fast page 10: still fine page 100: acceptable But for deep pages, performance degraded badly: page 1 → ~40 ms page 1000 → ~220 ms page 10000 → ~1.8 s to 3 s Even though the API still returned only 50 rows. Why? Because with offset pagination, the database often has to: scan/index-walk through a huge number of rows discard the first N rows only then return the next 50 So OFFSET 500000 does not mean: “jump instantly to row 500001” It often means: “walk past 500000 rows, then give me 50” 💥 Real production impact On this endpoint: p95 latency kept increasing as data grew DB CPU spiked during heavy admin usage replicas got unnecessary read pressure users felt the app was “randomly slow” The worst part: it looked perfectly fine in dev with a small dataset. ❌ Typical Java code Page<Order> page = orderRepository.findAll( PageRequest.of(pageNumber, 50, Sort.by("createdAt").descending()) ); This is convenient. But on very large tables, convenience gets expensive. ✅ Better approach: keyset pagination Instead of asking for: page 10000 Ask for: the next 50 rows after this last seen value Example: SELECT id, customer_id, status, created_at FROM orders WHERE created_at < :lastCreatedAt ORDER BY created_at DESC LIMIT 50; And if created_at is not unique, use a tie-breaker: SELECT id, customer_id, status, created_at FROM orders WHERE (created_at, id) < (:lastCreatedAt, :lastId) ORDER BY created_at DESC, id DESC LIMIT 50; ✅ Why it scales better With a proper index like: INDEX (created_at DESC, id DESC) the database can seek directly to the next position. That means: no huge skip cost more stable latency much better behavior on large datasets ⚠️ Important nuance Offset pagination is not always wrong. It is often fine for: small tables internal tools with low volume shallow pagination cases where jumping directly to page 7 matters more than raw performance But for: millions of rows infinite scroll APIs under load deep historical navigation keyset pagination usually wins. 🧠 Takeaway OFFSET pagination does not fail when your query gets bigger. It fails when your data gets bigger. Are you still paginating large tables with LIMIT ... OFFSET ... in production? https://www.joptimize.io/ #JavaDev #SpringBoot #PostgreSQL #JavaPerformance #Backend

1 Comment

JOptimize 3w

https://www.joptimize.io/

To view or add a comment, sign in

More Relevant Posts

Ruslan Mukhamadiarov
2w
Report this post
🔥 𝗗𝗜𝗦𝗧𝗜𝗡𝗖𝗧 𝗱𝗶𝗱𝗻’𝘁 𝗳𝗶𝘅 𝘆𝗼𝘂𝗿 𝗾𝘂𝗲𝗿𝘆. 𝗜𝘁 𝗷𝘂𝘀𝘁 𝗵𝗶𝗱 𝘁𝗵𝗲 𝗽𝗿𝗼𝗯𝗹𝗲𝗺. You saw duplicates → added DISTINCT → result “looks correct”. But the database still did the wrong work. Here are 4 real cases where DISTINCT lies to you 👇 1️⃣ 𝗢𝗻𝗲-𝘁𝗼-𝗺𝗮𝗻𝘆 𝗝𝗢𝗜𝗡 (𝗶𝗻𝘀𝘁𝗲𝗮𝗱 𝗼𝗳 𝗘𝗫𝗜𝗦𝗧𝗦) SELECT o.* FROM orders o JOIN order_items oi ON oi.order_id = o.id WHERE oi.status = 'ACTIVE'; 🤕 duplicates 🚑 DISTINCT 🔍 Problem: You multiplied rows. One order → many items → many rows. ✅ Fix: WHERE EXISTS (...) 👉 If you don’t need child data — don’t JOIN it. 2️⃣ 𝗖𝗮𝗿𝘁𝗲𝘀𝗶𝗮𝗻 𝗺𝘂𝗹𝘁𝗶𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻 (𝗺𝘂𝗹𝘁𝗶𝗽𝗹𝗲 𝗝𝗢𝗜𝗡𝘀) FROM orders o JOIN order_items oi ON oi.order_id = o.id JOIN payments p ON p.order_id = o.id 3 items × 4 payments = 12 rows 🤕 data explosion 🚑 DISTINCT 🔍 Problem: You multiplied relationships. ✅ Fix: WHERE EXISTS (...) AND EXISTS (...) or split queries. 👉 Multiple one-to-many JOINs = red flag. 3️⃣ 𝗝𝗢𝗜𝗡 𝗙𝗘𝗧𝗖𝗛 𝗲𝘅𝗽𝗹𝗼𝘀𝗶𝗼𝗻 (𝗝𝗣𝗔) SELECT o FROM Order o JOIN FETCH o.items JOIN FETCH o.payments 🤕 looks fine in Java 🚑 DISTINCT 🔍 Problem: SQL still does 3×4 = 12 rows. Hibernate deduplicates objects, not SQL work. ✅ Fix: fetch one collection only load others separately / batch 👉 JOIN FETCH hides the explosion, not removes it. 4️⃣ 𝗪𝗿𝗼𝗻𝗴 𝗝𝗢𝗜𝗡 𝗰𝗼𝗻𝗱𝗶𝘁𝗶𝗼𝗻 JOIN payments p ON p.user_id = u.id (should be order_id) 🤕 random duplicates 🚑 DISTINCT 🔍 Problem: Wrong relationship → quasi-cartesian result. ✅ Fix: verify FK verify cardinality 👉 DISTINCT can’t fix wrong logic. 🧠 𝗧𝗵𝗲 𝗽𝗮𝘁𝘁𝗲𝗿𝗻 In all cases: you created extra rows then removed them DISTINCT = post-processing, not a fix 💬 𝗧𝗮𝗸𝗲𝗮𝘄𝗮𝘆 If DISTINCT is fixing your result, your query is already wrong. Or simpler: You didn’t remove duplicates. You paid the database to create them and clean them up. #backend #java #sql #databases #performance #systemdesign #softwareengineering #jpa
5 Comments
Like Comment
To view or add a comment, sign in
Yuvaraj V
3w
Report this post
Day 21: 🧑💻 Database Sharding Horizontal Partitioning to Scale Beyond One Server (Java + Spring Boot) What Is Database Sharding? Sharding is horizontal partitioning — splitting a large dataset across multiple databases (shards), each holding a subset of the data. A shard key determines which shard a record belongs to. Without Sharding: 500M users → 1 database server Full table scan → slow Index too large for RAM → slow Max: vertical scale (bigger machine = expensive) With Sharding: 500M users → 4 shard databases Shard 0: userId hash 0-24% → 125M rows Shard 1: userId hash 25-49% → 125M rows Shard 2: userId hash 50-74% → 125M rows Shard 3: userId hash 75-99% → 125M rows Each query hits ONE shard → fast ✅ 🏗️ Four Sharding Strategies 1. Hash Sharding (Most Common) java int shardIndex = Math.abs(userId.hashCode() % totalShards); // userId "u-123" → hash → shard 2 // userId "u-456" → hash → shard 0 // Even distribution, no hotspots 2. Range Sharding java // userId < 1M → Shard 0 // userId 1M-2M → Shard 1 // userId > 2M → Shard 2 // Good for time-series: Jan→Mar shard 0, Apr→Jun shard 1 3. Directory Sharding java // Lookup table: userId → shardId Map<String, Integer> shardDirectory = cache.get("shard-map"); int shardId = shardDirectory.get(userId); // Flexible but directory = single point of failure 4. Geo Sharding java // India users → Shard ap-south-1 // US users → Shard us-east-1 // Europe users → Shard eu-west-1 // Low latency for users in same region Configuration — 4 Shard DataSources yaml # application.yml spring: datasource: shard0: url: jdbc:postgresql://shard0-db:5432/users username: app shard1: url: jdbc:postgresql://shard1-db:5432/users username: app shard2: url: jdbc:postgresql://shard2-db:5432/users username: app shard3: url: jdbc:postgresql://shard3-db:5432/users username: app Key Takeaways: Sharding = split data horizontally across multiple databases Shard key = the field used to decide which shard (choose carefully!) Hash sharding = most common — even distribution, no range queries Same key = same shard — always route same userId to same shard Cross-shard = expensive — avoid COUNT, JOIN, ORDER BY across shards Try first: table partitioning → read replicas → THEN sharding Sharding is hard to undo — choose shard key once and commit Rule: if a single table > 100M rows AND performance is suffering → consider sharding #JavaInProduction #RealWorldJava #Java #SpringBoot #BackendDevelopment #ProductionIssues #DataStructures #DSA #SystemDesign #SoftwareEngineering #JavaDeveloper #Programming #Sharding #Database
Like Comment
To view or add a comment, sign in
Pratham Mehta
2w Edited
Report this post
What is Connection Pooling? Before understanding what it is, let's first understand what the problem is. Imagine you have 10 API services running horizontally behind a Load Balancer. Each time a user requests some data — maybe their profile, chat history, or feed — whichever API service handles that request opens a brand new connection to the database. Every single time, these steps happen: TCP handshake (3 way handshake 2 way to turn down) Authentication / authorization Memory allocation on both client and server Session setup Now think about the scale. You have 10 API servers, each handling say 100 requests/sec. That's 1000 new connections being created every second — each one going through all 4 steps above before even touching your data. This causes real problems: High latency — the user waits for connection setup before their query even runs (adds 20–100ms+ per request) Database gets overwhelmed — databases have a hard connection limit (PostgreSQL defaults to 100). 1000 connections/sec will crush it Wasted resources — CPU and memory burned on setup/teardown, not actual queries Traffic spikes kill you — a sudden surge means thousands of simultaneous connection attempts, a thundering herd that can bring the DB down entirely So the core problem is simple: opening a fresh DB connection per request is slow, expensive, and doesn't scale. This is exactly the problem Connection Pooling solves. So What is Connection Pooling? Instead of opening and closing a connection on every request, you create a pool of connections once at startup and reuse them. The pool keeps, say, 20 connections open and alive. When a request comes in: It borrows a connection from the pool Runs the query Returns the connection back — it stays open, ready for the next request At startup, the pool opens N connections to the DB and keeps them alive When a request needs the DB, it borrows a connection from the pool After the query, it returns the connection — the connection stays open If all connections are busy, new requests wait in a queue (with a timeout) Types of Connection Poolers In-process — built into the library/driver (e.g., HikariCP for Java, SQLAlchemy pool for Python). Lives inside your app. External/sidecar — a separate process that proxies DB connections (e.g., PgBouncer for PostgreSQL, ProxySQL for MySQL). Shared across multiple app instances. Real-world Impact Without pooling, a simple app might spend 50–90% of query time just on connection setup. With pooling, that overhead drops to near zero for most requests, and you can serve far more traffic with the same database resources. The most critical scenario is serverless or short-lived processes (Lambda functions, containers), where every invocation would otherwise create a fresh connection — an external pooler like PgBouncer becomes essential there NOTE: “Connection pooling improves performance by reusing database connections and prevents exhausting PostgreSQL’s connection limit. #systemdesign .
4 Comments
Like Comment
To view or add a comment, sign in
Peter Samson
4w
Report this post
Are you building with Claude Code? Here is something to add to your Claude.md file to put a unique tag in each SQL statement. It works well, and you'll be VERY happy when attempting to identify the source of a long-running SQL statement. Exit and restart Claude to ensure it takes effect. Once working and tested on a few new SQL statements, tell Claude to go back and refactor all existing code and add the SQL tags based on the standard. Paste in everything below, and adjust relative to your project and database engine: ## SQL Tagging - Every SQL statement in Python code (batch jobs, APIs, shared modules) must include a comment tag as the first line of the SQL string identifying the source module and function/step. - Format: `/* module:function */` for batch jobs and shared code; `/* api:namespace:operation */` for API endpoints. - Examples: ```python # Batch job / shared code """ /* obsprocess:processpending */ SELECT "SystemObservationUUID" ... """ # API endpoint """ /* api:admin:get_logs */ SELECT "LogUUID" ... """ ``` - Tags appear in `pg_stat_statements`, `pg_stat_activity`, RDS Performance Insights, and PostgreSQL logs — enabling instant identification of which code path is responsible for a slow or blocking query. - When a shared function (e.g., `commonobs.py`) is called by multiple batch jobs, tag with the shared module name (e.g., `/* commonobs:ootreesync */`), not the calling job. - Tags must be kept in sync when SQL is refactored or moved between modules. - **Rename rule:** When renaming a batch job, API namespace, or shared Python module, scan all SQL strings across the codebase for tags referencing the old name and update them. SQL tags are plain text inside string literals — they will NOT be caught by IDE rename refactoring tools.

1 Comment
Like Comment
To view or add a comment, sign in
Kushal Unune
1w Edited
Report this post
How SQL might lie to you. The short version: SQL equality checks ignore trailing spaces. Java doesn't. Your DB client hides them. If you don't sanitize at system boundaries, dirty data will silently break your pipeline while tests pass. The investigation trail: Steps 1-2 - No errors, data "looked" perfect. A batch service produced zero output with no exceptions, just a log: 0 records processed. The natural assumption was an empty input. But a direct query WHERE CATEGORY = 'Active' returned thousands of rows. Even a DISTINCT query check looked clean. The database said: the data is here and it's fine. Steps 3-4 - The code was fine. The test data wasn't. Since data existed, I suspected the Java code. But the logic was a simple "Active".equals(j.getCategory()). So I seeded the integration tests, copying rows straight from the DB client's UI grid. Every test passed. What I only realized later: that was the contamination point. DB clients silently strip trailing spaces on display, so I had handed the test suite pre-sanitized strings. Step 5 - Inspecting the raw payload. If the DB had data and the code worked, something in between was wrong. Checking the raw JSON payload revealed it: the value wasn't "Active" it was "Active ". Every field had trailing spaces. This was a VARCHAR column, VARCHAR doesn't pad, so the spaces had been physically inserted by an upstream ETL. Step 6 - Why did the DB hide this? Per the ANSI SQL standard, when comparing strings of unequal length with =, the shorter string is space-padded to match. So 'Active ' = 'Active' evaluates to TRUE. The database wasn't lying exactly it was following spec. But the spec is surprising. Step 7 - The full delivery chain: DB stores dirty VARCHAR strings JDBC extracts the exact dirty string Spring JdbcTemplate maps it directly to the DTO Jackson serializes it verbatim No layer sanitizes. But "Active".equals("Active ") returns FALSE. Every record was silently filtered out. Key takeaways: 1️⃣ SQL and Java equals() speak different languages. SQL space-pads for comparison; Java is strictly character-by-character. 2️⃣ Standard SQL diagnostics lie. DISTINCT, =, and GROUP BY are all space-insensitive. Detect dirty data with LENGTH(col) != LENGTH(RTRIM(col)). 3️⃣ DB clients sanitize display output. UI-copied test data will pass against clean strings while prod fails against dirty ones. 4️⃣ Trim at the boundary, not in the logic. Configure Jackson or your RowMapper to trim at deserialization don't scatter .trim() calls across business logic. Over to you: What is the most frustrating "silent bug" you've chased down? Where do you enforce data sanitization the DB, the API boundary, or business logic? #Java #DB2 #SQL #JDBC #Spring #JdbcTemplate #H2 #BDD #Debugging #DataEngineering #BackendEngineering

3 Comments
Like Comment
To view or add a comment, sign in
Sanjay Mishra
1mo Edited
Report this post
Why should UI and Java developers have all the fun with AI? 😄 This one feels personal - having worked with data for years, I’ve always wanted to chat with my database code similar to how UI and Backend( java and Python) developer does. Now it’s finally happening. Databases are evolving from storage systems to intelligent, interactive collaborators. About time data folks have some fun too 😉 Here in this article on DZone I cover all you need to do that. let me know if you have any questions . #AI #SQL #DataEngineering #Oracle https://lnkd.in/ePZ85dki

Chat with Your Oracle Database: SQLcl MCP + GitHub Copilot dzone.com

2 Comments
Like Comment
To view or add a comment, sign in
Volodymyr Baiun
1w
Report this post
I read two JetBrains articles recently that seem to completely contradict each other. One says: ❌ don't use data classes for entities in Kotlin. The other says: ✅ data classes are a great fit for entities. Same company. Opposite advice. The difference? One is about JPA. The other is about Spring Data JDBC. Here's why it matters. ── Spring Data JPA (Hibernate) manages entities as tracked, mutable objects. To do that, it needs to: → create proxy subclasses for lazy loading (class must be non-final) → call a no-arg constructor when loading from DB → inject fields via reflection after construction (fields must be mutable) Kotlin's data class breaks all three of these. Final. Immutable. No no-arg constructor. ── Spring Data JDBC is a completely different philosophy. No dirty checking. No proxies. No lazy loading. You call save() → SQL runs. You call findById() → result maps to your object. That's it. And because it uses constructor-based mapping, data classes are a natural fit. Immutable val fields? Great. final class? No problem. copy() to update instead of mutation? That's exactly the pattern. ── So "don't use data classes for entities" really means "...when using JPA." It's not a rule about Kotlin + databases in general. The two frameworks look similar from the outside — both are Spring Data, both talk to relational DBs. But they have fundamentally different models for how objects and databases interact. Once you see that, the contradiction disappears. 📝 Wrote a full breakdown on Medium — link in the comments. #Kotlin #SpringBoot #SpringData #JPA #Backend
4 Comments
Like Comment
To view or add a comment, sign in
GA4Dataform by Superform Labs

825 followers
2w
Report this post
What is the config block in Dataform? The config block of a SQLX file is the instruction manual beyond the actual SQL query. It defines how and what should be materialized. Let's look at a few examples! The simplest config you will ever see is: config { type: "table" } This tells Dataform: "I want to materialize the output of this query as a table that is rebuilt every time this action runs." But we can get a bit crazy with JavaScript and a config block can also look like this: config { type: require("includes/core/modules/ga4/helpers").helpers.getModuleConfig('ga4').CUSTOM_LINEAGE.ga4_events_custom === "incremental" ? "incremental" : "view", schema: dataform.projectConfig.vars.OUTPUTS_DATASET, tags:["module_ga4", "events"], description: "Custom lineage: intercept ga4_events before it flows into downstream tables. Edit to add/modify columns.", ...(require("includes/core/modules/ga4/helpers").helpers.getModuleConfig('ga4').CUSTOM_LINEAGE.ga4_events_custom === "incremental" ? { onSchemaChange: "EXTEND", bigquery: { partitionBy: "event_date", clusterBy: [...((require("includes/core/modules/ga4/helpers").helpers.getModuleConfig("ga4").CLUSTER_BY || {}).ga4_events || []).slice(0, 2), "event_name", "session_id"], labels: require("includes/core/helpers.js").helpers.storageLabels() } } : {}), ...require("includes/core/helpers.js").helpers.isModuleEnabled('ga4') } This tells a bit different story as you can imagine, but after all the compiled JavaScript mumbo jumbo it will read something like: "I want to materialize a date-partitioned (bigquery.partitionBy), clustered (bigquery.clusterBy) incremental table (type) in the outputs dataset (schema) with a defined table description (description) and labels (bigquery.labels). If there is a new field in the output query that didn't exist before, add it to the existing table's schema before running the incremental INSERT (onSchemaChange)". It may look complex in the config block, but in the config.js file, you just see: CUSTOM_LINEAGE: { ga4_events_custom: 'incremental', int_ga4_sessions_custom: 'view', ga4_sessions_custom: false } This is one of the key pillars that provides the foundation for GA4Dataform since it allows us to control how we want the pipeline to behave from a ~single place. To be fair, it is nothing groundbreaking. But when you build your project with this in mind, you will be surprised how much easier it makes maintaining your code as it grows. #GA4 #Dataform #BigQuery #DataEngineering #Analytics #GCP #GoogleCloud #MarketingAnalytics #GoogleAnalytics4

2 Comments
Like Comment
To view or add a comment, sign in
Keerthika Selvam
3w Edited
Report this post
I spent hours staring at this SQL query confused 😅 SELECT u.name FROM users u WHERE NOT EXISTS ( SELECT 1 FROM products p WHERE p.category = 'electronics' AND NOT EXISTS ( SELECT 1 FROM order_items oi JOIN orders o ON oi.order_id = o.id WHERE o.user_id = u.id AND oi.product_id = p.id ) ); My first thought: "We want users who bought ALL electronics products — so why are we using NOT EXISTS?" That one question opened up everything. Here's what I finally understood 👇 SQL does not have a "FOR ALL" keyword. You can't directly ask: "Did this user buy every electronics product?" So you flip the question: "Is there any electronics product this user did NOT buy?" Then negate it: "No such product exists" = user bought everything ✅ That's the power of Double NOT EXISTS. 3 levels work together: → Main query loops through every USER → Outer subquery loops through every ELECTRONICS PRODUCT → Inner subquery checks: did this user buy this product? If even ONE product is missed → outer catches it → user excluded ❌ If ZERO products are missed → outer returns nothing → user included ✅ The rule I'll never forget: EXISTS = "at least one" → partial match NOT EXISTS + NOT EXISTS = "every single one" → complete match SQL thinking is not English thinking. Sometimes you have to flip the question to get the answer. Currently building my SQL + Java backend skills targeting product companies. Keep this — it is perfect Has SQL ever made you flip your thinking completely? Drop it below 👇 #SQL #BackendDevelopment #Java #LearningInPublic #WomenInTech

3 Comments
Like Comment
To view or add a comment, sign in
Meet Bhagdev
3w
Report this post
Big news for Java and Spring developers: Spring Data Valkey is now generally available. This is a big step for teams that want to use Valkey in Spring apps without changing the familiar developer experience. You still get the Spring Data model you already know, with support for templates, repositories, cache abstractions, and Spring Boot workflows. It is also designed to make migration from Spring Data Redis push button. Head over to Mas Kubo's announcement to learn more https://lnkd.in/ga2CFfkq

Valkey: Announcing Spring Data Valkey: A New Season for High-Performance Spring Applications valkey.io
Like Comment
To view or add a comment, sign in

731 followers

View Profile Connect

JOptimize’s Post

More Relevant Posts

Explore content categories