Design for the table size you'll have, not the one you have today | OpenSource DB posted on the topic | LinkedIn

View organization page for OpenSource DB

1,891 followers

3w

A question for every dev who's ever designed a database table: Did you design it for 10,000 rows or 10 million? Because when schemas are designed for the demo. For the MVP. For "let's just ship it and optimize later." And "later" arrives as a 3 AM P1 page, 18 months down the road, when the table that "works fine" has grown 1000x and suddenly nothing works fine. We call these the haunting patterns. Schema decisions that feel harmless at small scale and become structural nightmares at large scale. Data Drop #6 covers the big three: → UUIDs as primary keys — Random values fragment your B-tree indexes. At 500M rows, your index is bloated, your writes scatter across random pages, and your cache hit ratio craters. Sequential IDs exist for a reason. → "Just make it nullable" — The path of least resistance at design time. The source of a thousand bugs at query time. NULL doesn't equal NULL. Your aggregations silently skip rows. Your joins produce unexpected results. Nullable should be a conscious choice, not a default. → The EAV trap — Entity-Attribute-Value: the schema pattern that says "I don't want to commit to a data model." Congratulations, you now have a key-value store with the performance of a relational database and the flexibility of neither. Design for the table size you're going to have. Not the one you have today. Data Drop #6. Day 6 of 23. #AprilDataDrops #PostgreSQL #DataDrop6 #SchemaDesign #Database #Performance #OpenSourceDB OpenSource DB | Lahari Giddi

Transcript

You can't index your way out of a bad schema. You can't throw hardware at a bad schema. A bad schema is a decision that compounds overtime, getting more expensive to fix with every row you add. Let's talk about the three most common schema mistakes and how to avoid them. Schema design is the architectural blueprint of your database. Get it wrong early and you will spend years patching symptoms instead of fixing root causes. The three mistakes we see. Most often #1 using random Uuids as primary keys leading to btree fragmentation, and #2 making every column nullable, which causes semantic ambiguity plus optimizer headaches #3 the entity attribute value pattern EAB, which is like turning a relational database into a key value store. Let us consider a customer story. An ecommerce platform used UUID V4 as the primary key for that. Products table. At 5 million rows, performance was fine. At 50 million rows, inserts were getting slower and slower. The B tree index on the PK was massively fragmented because UUID V4 values or random. Every insert lands at a random position in the tree causing constant page splits. They switched to UIV 7 which is time order Uids and saw a 40% improvement in insert throughput and I-24. 25% reduction in index size after a reindex. So now in demo let us see how UUID causes B3 fragmentation. So create a 2 tables, one with UID V4 primary key and the other with a sequential big serial primary key. Insert 1,000,000 rows into each table. Now compare the index sizes using PGP under score relation under score size. So where UUID occupies 38 MB and big serial occupies 21 MB. Now let us also create an another table with uiv 7. Check the index size after inserting 1,000,000 rows into the table. So here the here after checking the index size you can see that it is 30 MB so which means with UUID V4 index is bloated and lookups are slower. With U IDB 7 index is compact, lookups are efficient similar to big serial. Last name. Second example we will see how the optimizer can't use index efficiently because the column is nullable. So you are creating a table employees nullable and also we are creating an another table named employees NOT NULLABLE. So from the above 2 tables we see that indexes can be used on nullable columns but the optimizer sometimes avoid them due to uncertain statistics. Adding NOT null makes the column semantically stronger. And helps the planner to trust the index. Tell us about the EV antipattern in product attributes table. Each row stores a single attribute of an entity. To answer a simple query we need multiple scans and typecasting. In product stable, each at in the other table where we created the table named products. Each attribute has a dedicated column with proper types. Queries are straightforward and uses index effectively. Before your next create table, ask three questions. Am I using sequential friendly primary keys? Are my nulls intentional or lazy? Am I modeling actual relationships or just dumping key value pairs?

To view or add a comment, sign in

More Relevant Posts

Jay Patel
1w
Report this post
How Do You Optimize a Query with Multiple Joins, Filters, and Pagination? 🤔 If you think pagination will optimize DB performance, then bro you are 𝘄𝗿𝗼𝗻𝗴 — because pagination is only applied at the end of the filtration and joins. That does not really optimize the query. It only optimizes the DB network bandwidth, nothing else. 📡 In this case, we need to carefully analyze the query and database structure. 🧠 If we join all tables first and then apply filters and pagination, let’s see what the database actually does. • First, it joins the tables and creates a temporary table in memory. 🗂️ • Then it applies filters. 🔎 • And at the end, it applies pagination. 📄 All the memory and CPU utilization happens during joins and filters. ⚙️ If there is a GROUP BY clause, it will require even more processing power. 📊 If the dataset is too large, our DB processor can 𝗲𝗮𝘀𝗶𝗹𝘆 𝘀𝗽𝗶𝗸𝗲. 📈 Now the question is: 𝗵𝗼𝘄 𝗱𝗼 𝘄𝗲 𝗼𝗽𝘁𝗶𝗺𝗶𝘇𝗲 𝗶𝘁? 🤔 If our DB design and query allow step-by-step filtering, like: • First apply filters on the user table • Then apply filters on another table • And so on... Then we can reduce the join data stage by stage by using the 𝗪𝗜𝗧𝗛 clause, which is known as a 𝗖𝗼𝗺𝗺𝗼𝗻 𝗧𝗮𝗯𝗹𝗲 𝗘𝘅𝗽𝗿𝗲𝘀𝘀𝗶𝗼𝗻 (𝗖𝗧𝗘). 🧩 CTEs help reduce the temporary table size during joins, and our filters can be applied faster. ⚡ Then the next stage joins only with the already filtered data. Finally, we can apply pagination. 📄 If a join is only required for data viewing and not for filtering, we can also apply that 𝗷𝗼𝗶𝗻 𝗮𝗳𝘁𝗲𝗿 𝗽𝗮𝗴𝗶𝗻𝗮𝘁𝗶𝗼𝗻, when the dataset is already very small. This also helps optimize our query. 🚀 I have been using the WITH clause (CTE) in many of my large queries, and it has helped me a lot in improving query performance. 💡 #realMoneyLearnings #Databases #SQL #MySQL #DatabasePerformance #QueryOptimization #BackendEngineering #SoftwareEngineering #SystemDesign #TechLearning #LearningInPublic
10 Comments
Like Comment
To view or add a comment, sign in
Daniel L.
2w
Report this post
Your Database Is Lying to You About Performance 🗄️ It works fine in staging. 50ms response times, clean query plans, zero complaints. Then you hit production. 3 million rows later - everything falls apart. Here's what nobody tells you about database optimization: 1. An index on the wrong column is worse than no index. Indexes cost you on every INSERT and UPDATE. If your query doesn't use it, you're paying the write penalty for nothing. Run EXPLAIN ANALYZE. Look at what's actually being scanned. 2. N+1 queries are silent killers. One endpoint. Looks innocent. Under the hood it fires 1 query to get users, then 1 query per user to get their orders. 200 users = 201 queries. Your ORM is very good at hiding this from you. 3. LIKE '%keyword%' ignores every index you have. Leading wildcard = full table scan. Every time. If you need full-text search - use full-text search (PostgreSQL's tsvector, Elasticsearch, whatever fits). Don't fight SQL with SQL. 4. Pagination with OFFSET doesn't scale. OFFSET 100000 LIMIT 20 doesn't skip 100000 rows - it reads them all and throws them away. Use keyset pagination. Cursor-based. Your DBAs will stop avoiding eye contact with you. 5. The query that runs in 10ms on 10k rows runs in 40 seconds on 10M. Linear doesn't stay linear. Test with production-scale data. Always. 6. Slow query log is your best friend you never talk to. Enable it. Most performance issues announce themselves long before they become incidents. The database is rarely the problem. The queries are. What's the most expensive query bug you've ever shipped to production? 👇 #Database #SQL #BackendEngineering #Performance #SoftwareEngineering
5 Comments
Like Comment
To view or add a comment, sign in
Jay Patel
2w
Report this post
I used to think indexes were the 𝗲𝗮𝘀𝗶𝗲𝘀𝘁 way to 𝗳𝗶𝘅 𝘀𝗹𝗼𝘄 𝗾𝘂𝗲𝗿𝗶𝗲𝘀. Query slow? 👉 Add an index. Another query slow? 👉 Add another index. For a while, it actually works. ⚡ Queries become faster. 📊 Dashboards load quickly. Everyone is happy. But something interesting starts happening later. 🐢 Writes begin to slow down. INSERT, UPDATE, and DELETE operations take longer than expected. And the reason is simple: Every time data changes, the database must also update every related index. So if a table has too many indexes, each write operation becomes heavier. ⚖️ That’s the 𝘁𝗿𝗮𝗱𝗲-𝗼𝗳𝗳 many developers discover a bit late. Indexes are powerful, but creating them blindly can introduce new problems. Some common side effects: ✅ 𝗣𝗿𝗼𝘀 of adding indexes 🔎 Faster search and filtering (WHERE) 🔗 Faster joins between tables 📈 Better performance for sorting and grouping 🗂️ Large datasets become manageable ⚠️ 𝗖𝗼𝗻𝘀 of adding indexes 𝗯𝗹𝗶𝗻𝗱𝗹𝘆 🐌 Slower inserts, updates, and deletes 💾 Extra disk space for each index ⚙️ More work for the database to maintain them ❓ Some indexes may never even get used That’s why indexing is less about adding more, and 𝘮𝘰𝘳𝘦 𝘢𝘣𝘰𝘶𝘵 𝘢𝘥𝘥𝘪𝘯𝘨 𝘵𝘩𝘦 𝘳𝘪𝘨𝘩𝘵 𝘰𝘯𝘦𝘴. 𝘼 𝙜𝙤𝙤𝙙 𝙞𝙣𝙙𝙚𝙭 𝙪𝙨𝙪𝙖𝙡𝙡𝙮 𝙘𝙤𝙢𝙚𝙨 𝙛𝙧𝙤𝙢 𝙪𝙣𝙙𝙚𝙧𝙨𝙩𝙖𝙣𝙙𝙞𝙣𝙜 𝙝𝙤𝙬 𝙩𝙝𝙚 𝙙𝙖𝙩𝙖 𝙞𝙨 𝙖𝙘𝙩𝙪𝙖𝙡𝙡𝙮 𝙦𝙪𝙚𝙧𝙞𝙚𝙙. 🧠 Databases reward thoughtful design. Blind optimization rarely stays optimal for long. #realMoneyLearnings #Databases #MySQL #SQL #DatabasePerformance #BackendEngineering #SoftwareEngineering #SystemDesign #PerformanceOptimization #DatabaseIndexes #LearningInPublic

18 Comments
Like Comment
To view or add a comment, sign in
Anuj Tiwari
3w
Report this post
I was debugging a slow query. Turns out… the database was just taking the scenic route 🌳 Everything looked fine. Indexes? Present. Query? Clean. Still… latency said “not today.” Went one layer deeper. Not the query. Not the data. 👉 The tree underneath it. Databases like MySQL and MongoDB don’t magically fetch data. They walk a path. And if that path is messy… your query is basically sightseeing before doing real work. 💡 What actually fixed it: Not rewriting the query 10 times. Instead: Rebuilding / optimizing indexes (hello B-Trees 👀) Fixing index choice based on access patterns Dropping useless indexes that were bloating traversal Checking if data distribution was skewing the tree ⚙️ The shift: I stopped asking: ❌ “Is my query efficient?” and started asking: ✅ “Is my data structure forcing extra work?” Because in production: Bad query → obvious problem Bad tree → silent performance killer And yeah… Sometimes the fix isn’t clever code. It’s just admitting: 👉 your tree grew wild and needs pruning. Next time your query is slow, don’t just optimize SQL. Trim the tree—you might save milliseconds (and your weekend). #BackendEngineering #SystemDesign #PerformanceEngineering #Databases #DSA #SoftwareEngineering
Like Comment
To view or add a comment, sign in
Abhinav Sinha
3w
Report this post
Why you can't "just" concurrently index partitioned tables in Postgres We’ve been working on partitioning some of our fastest-growing tables recently. Since these are high-traffic, write-heavy tables, we tried to add indexes concurrently to avoid downtime. That's when we hit a classic wall: ERROR: cannot create index on partitioned table "..." concurrently. The "Why": CREATE INDEX CONCURRENTLY is designed to avoid locking your table. But since a partitioned table is a virtual parent, Postgres can’t coordinate a "no-lock" build across multiple child tables at once from a single command. It defaults to an Access Exclusive Lock, which, in a production environment, is a dealbreaker for performance. The "Create-and-Link" Pattern we used instead: • The Shell: Run CREATE INDEX ... ON ONLY parent_table. This adds the metadata to the parent but keeps it Invalid at first. • The Background Build: Run CREATE INDEX CONCURRENTLY on each individual partition. This builds the index without blocking our production traffic. • The Link: Attach each partition index to the parent: ALTER INDEX parent_idx ATTACH PARTITION child_idx. The Result: Once the final partition is linked, the parent index automatically becomes Valid. We got zero downtime, and any future partitions will now inherit this index automatically. It's a bit more manual work, but it's a lifesaver for keeping a high-growth system responsive. #PostgreSQL #DatabaseEngineering #SystemDesign #Backend #Scalability #SDE

2 Comments
Like Comment
To view or add a comment, sign in
Hamza Maqbool
2w Edited
Report this post
Most developers add database indexes expecting instant magic speed… …but many accidentally slow down their entire system instead. Here’s exactly how database indexing works under the hood — and why it’s a double-edged sword: Indexes are separate data structures that store a sorted map of your column values and point directly to the actual rows in the table. Instead of scanning every single row (a slow full table scan), the database can quickly jump to the right data — often in just a few steps. The Major Advantages: Lightning-fast reads: B-Tree indexes (the default in most databases) give O(log n) search time. They efficiently handle equality (=), range queries (>, <, BETWEEN), sorting, and JOINs. Specialized indexes unlock extra power: Hash indexes deliver true O(1) speed for exact matches, Bitmap indexes excel with low-cardinality data in analytics, and GiST/GIN handle full-text or spatial searches beautifully. Result: Queries that dragged for seconds now return in milliseconds, even on million-row tables. The Real Trade-Offs (Where It Hurts): Extra storage cost: Indexes can easily double or triple the size of your table. Slower writes: Every INSERT, UPDATE, or DELETE has to update all related indexes. This adds significant overhead and disk I/O, especially on high-write workloads. Maintenance burden: Choosing the wrong index type (like Hash for range queries) or creating too many indexes wastes space and can actually hurt performance. The smart approach: Focus indexes on columns frequently used in WHERE, ORDER BY, or JOIN conditions — especially on read-heavy tables. Regularly check which indexes are actually being used and drop the unused ones. Test changes carefully. Mastering this trade-off is what turns good backend systems into highly scalable ones. What’s your biggest indexing win — or the hardest lesson you learned about indexes? Drop it in the comments 👇 I read every single one. #DatabaseEngineering #SQL #PerformanceOptimization #BackendDevelopment #PostgreSQL #MySQL #DataEngineering #SystemDesign
Like Comment
To view or add a comment, sign in
Sachin Shah
6d
Report this post
The most important engineering decision isn't which framework to use. It's your database schema. What I've learned building with PostgreSQL: 1. Design for the queries you'll actually run. Write your 10 most critical queries first. Then design the schema. 2. Normalise - but know when to stop. 3NF is great until you're doing 7-table JOINs for a simple dashboard. 3. Index what you filter and sort on. Every column in WHERE, ORDER BY, or JOIN on high-traffic tables needs an index. 4. Use UUIDs as primary keys for public-facing data. 5. Add created_at and updated_at to every table. Every. Single. Table. 6. Never delete data. Add a deleted_at column instead. Soft deletes are recoverable. What's the hardest schema decision you've had to make? Follow Sachin Shah - backend engineering decisions that matter in production. #PostgreSQL #BackendDevelopment #SystemDesign #SoftwareEngineering #BuildInPublic #SQL #TechTips
2 Comments
Like Comment
To view or add a comment, sign in
Mohit Bodkhe
3w
Report this post
🚀 Database Indexing (Part 1): The Foundation of Fast Queries Before scaling systems with partitioning or distributed caching, the first step is Database Indexing. If your queries are slow, you’re likely missing the right indexes. 🔹 What is Database Indexing? Database Indexing is a technique used to improve query performance by creating a structure that allows faster data lookup. 👉 Like a book index — jump directly to the data instead of scanning everything. 🔹 How It Works Without Index ❌ ➡ Full Table Scan (O(n)) With Index ✅ ➡ Faster Lookup (O(log n)) 🔹 Types of Indexes 1️⃣ B-Tree Index (Most Common) Default index in most databases Supports: Equality (=) Range (>, <, BETWEEN) Sorting 2️⃣ Hash Index Best for exact match (=) Very fast lookup 👉 Limitation: ❌ No range queries ❌ No sorting 3️⃣ Composite Index Multiple columns Example: (user_id, created_at) 👉 Follows left-to-right rule 4️⃣ Unique Index Ensures no duplicate values Example: email, username 5️⃣ Full-Text Index Used for search functionality Example: product search, keyword search 🔹 Benefits ✅ Faster query execution ✅ Efficient searching ✅ Reduced full table scans ✅ Better performance for large datasets 💬 In Part 2, I’ll cover real-world problems, trade-offs, and best practices. #Database #BackendDevelopment #Java #SQL #Performance #Optimization
Like Comment
To view or add a comment, sign in
Purakhnath Jyani
4w
Report this post
Two things from last night's SQL session that I will not forget. First: a database is just a disk. All the complexity of schemas, queries, indexes and transactions is software built on top of a file sitting on your hard drive. That one realization made databases feel less magical and more understandable. Second: never DELETE without a WHERE clause. Ever. Without WHERE you are not deleting a row. You are deleting everything. One missing condition and your entire table is gone. No undo button. We also covered SQL vs NoSQL differences, creating and altering tables, primary keys, filtering with WHERE, AND OR logic, pattern matching with LIKE and ILIKE, pagination with LIMIT and OFFSET, aggregations like COUNT SUM AVG MIN MAX and multi column aggregations. A lot in one session. Grateful to Hitesh Choudhary and Piyush Garg and the Chai Aur Code team for sessions that make you think. #SQL #Database #Backend #ChaiCode
Like Comment
To view or add a comment, sign in
Guilherme de Jesus
1w
Report this post
Many developers model everything as a table in PostgreSQL without knowing there's a feature that can make queries faster and simplify the database structure in certain cases: Composite Types. The idea is simple: you define a structured data type and use it as a column in any table When it makes sense to use: - The data has no identity of its own, it only exists alongside the parent record - You'll never query that data in isolation - You don't need history, auditing or traceability - It's a fixed, immutable value that just enriches the main row When it doesn't make sense and a 1:1 table is better: - The data can be referenced by other tables - You need history (the address changed and you want to keep the previous one) - It will grow over time and get new columns - Other parts of the application need to access that data directly The main tradeoff is: Composite Type eliminates the JOIN and makes reads faster, but you give up flexibility. If requirements change and that data needs to become its own entity, the migration is a pain. *If you use Prisma, pay attention: it doesn't natively map composite types, it treats the column as `Unsupported`. To read the data, you need to unpack the fields directly in SQL via `$queryRaw`. The rule I use to decide: if the data is a value that describes something, make it a type. If the data is an entity with a life of its own, make it a table. Follow me for more technical posts like this. #PostgreSQL #SQL #Database #Backend #FullStack #DatabaseDesign #Prisma
3 Comments
Like Comment
To view or add a comment, sign in

OpenSource DB

1,891 followers

View Profile Connect

More from this author

Explore content categories