Optimize Search Operations with HashSet

🚨 𝗔 𝗺𝗶𝘀𝘁𝗮𝗸𝗲 𝗜 𝘀𝘁𝗶𝗹𝗹 𝘀𝗲𝗲 𝘁𝗵𝗮𝘁 𝗺𝗮𝗸𝗲𝘀 𝗺𝗮𝗻𝘆 𝘀𝘆𝘀𝘁𝗲𝗺𝘀 𝘀𝗹𝗼𝘄… 𝗲𝘃𝗲𝗻 𝗶𝗻 𝗽𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 Poorly designed search operations. A while ago, I reviewed a system where: 👉 Each query took several seconds 👉 The data volume was constantly growing And the problem wasn’t the infrastructure… It was this 👇 🔎 Linear search over an unordered collection ➡️ O(n) complexity ➡️ Every request required scanning the entire dataset 😬 👉 This pattern is common, for example, in offline-first applications where data is downloaded once and then queried in memory. The solution was simple: 👉 Change the data access structure (HashSet) Results: ⚡ From seconds → milliseconds 🚀 No changes to business logic 🚀 No need to scale servers 🧠 𝗛𝗲𝗿𝗲’𝘀 𝘁𝗵𝗲 𝗸𝗲𝘆 (𝗮𝗽𝗽𝗹𝗶𝗲𝗱 𝘁𝗵𝗲𝗼𝗿𝘆): Not all search operations are the same: 🔹 Unordered collection → O(n) 🔹 Binary Search (sorted data) → O(log n) 🔹 HashSet → O(1) 🚀 🔹 TreeSet → O(log n) + keeps ordering 🎯 𝐏𝐫𝐚𝐜𝐭𝐢𝐜𝐚𝐥 𝐜𝐨𝐧𝐜𝐥𝐮𝐬𝐢𝐨𝐧: 👉 The problem is not working in memory 👉 The problem is not designing data access properly 💡 Quick rule of thumb: 🔹 Speed → HashSet 🔹 Order → TreeSet 🔹 Already sorted data → Binary Search 💬 Curious to know: Have you seen this issue in real-world apps or offline systems? #SoftwareEngineering #Java #BackendDevelopment #SystemDesign #PerformanceOptimization #DataStructures #Algorithms

To view or add a comment, sign in

More Relevant Posts

Ollayor Sabirov
2w
Report this post
Choosing the wrong data structure can make your code 100x slower. Here is how to pick the right one! Every data structure has a specific use case. Using the wrong one is like using a hammer to cut wood. Array ✅ Fast random access by index (O(1)) ❌ Fixed size, slow insertions/deletions Use case: When you know the size and need fast lookups Queue (FIFO) ✅ First In, First Out operations Use case: Task scheduling, breadth-first search, handling requests Stack (LIFO) ✅ Last In, First Out operations Use case: Undo/redo, function calls, depth-first search, expression evaluation Linked List ✅ Fast insertions/deletions (O(1) at head) ❌ Slow search (O(n)) Use case: When you need frequent insertions/deletions, implementing queues/stacks Tree ✅ Hierarchical data, fast search in balanced trees (O(log n)) Use case: File systems, databases, decision trees, BST for sorted data Graph ✅ Represents relationships between entities Use case: Social networks, maps/routing, recommendation systems Matrix ✅ 2D data representation Use case: Image processing, game boards, mathematical computations Max Heap ✅ Fast access to maximum element (O(1)) Use case: Priority queues, finding top K elements, median streaming Trie ✅ Fast prefix searches (O(m) where m is string length) Use case: Autocomplete, spell checkers, IP routing HashMap ✅ Fast key-value lookups (O(1) average) Use case: Caching, counting occurrences, fast lookups HashSet ✅ Fast membership checks, no duplicates (O(1) average) Use case: Removing duplicates, checking existence Pro tip: The best data structure is not always the most complex one. Sometimes a simple array is all you need. Which data structure do you find yourself using the most? Share below! #DataStructures #Programming #Java #BackendDevelopment #Algorithms #SoftwareDevelopment
Like Comment
To view or add a comment, sign in
Mohamed Emad
2w
Report this post
I almost shipped a performance disaster to production today. 💀 We talk a lot about the classic "N+1 query problem" when reading data (forgetting .Include()). But today, I faced something much worse: 𝗧𝗵𝗲 𝗡+𝟭 𝗪𝗿𝗶𝘁𝗲 𝗽𝗿𝗼𝗯𝗹𝗲𝗺. I was building an endpoint to sync a user’s subjects and skills. The logic was standard: 1️⃣ Loop through the incoming payload. 2️⃣ If it's new, insert it. 3️⃣ If it exists, update it. 4️⃣ If it’s missing, soft-delete it. At first glance, the code looked fine. But looking closer, I realized I was calling _repository.Insert() and .Update() INSIDE the loop. If a user updated 50 subjects, my API would trigger 50 separate database round-trips. This is known as "Row-By-Row Agony" (RBAR). This destroys performance under load. And the worst part? It works perfectly fine in development… until real traffic hits. ⚠️ 👉 𝗛𝗼𝘄 𝗜 𝗳𝗶𝘅𝗲𝗱 𝗶𝘁: I stopped treating the database like an array. Instead of executing queries inside the loop, I created in-memory lists (toInsert, toUpdate, toDelete). Inside the loop, I simply populated these lists. Then, completely outside the loop, I used .BulkInsert() and .BulkUpdate(). 𝗧𝗵𝗲 𝗥𝗲𝘀𝘂𝗹𝘁? What could have been 50+ database round-trips was reduced to exactly 3 efficient batch operations. 🚀 💡 𝗟𝗲𝘀𝘀𝗼𝗻 𝗹𝗲𝗮𝗿𝗻𝗲𝗱: Your application memory is incredibly fast. Your database network is incredibly slow. Do the heavy lifting in memory, and hit the database only when you are ready to process everything at once. How do you handle complex master-detail updates in your APIs? Do you rely on standard EF Core change tracking, or do you prefer explicit bulk operations? 👇 #csharp #dotnet #softwarearchitecture #efcore #backend #performance #database
35 Comments
Like Comment
To view or add a comment, sign in
Sunil Garg
2w
Report this post
🚨 Database Integrity: Small Mistake, Big Disaster! Ever inserted data into multiple tables… and one query failed? Now your DB is half-updated, inconsistent, and debugging becomes a nightmare 😅 👉 This is where Transactions save you. 💡 Real Scenario: Placing an order Insert into orders Insert into order_items Update inventory If step 3 fails and you didn’t use a transaction… ❌ Order created ❌ Items missing ❌ Inventory unchanged Welcome to data corruption. 👉 If anything fails → auto rollback 👉 Either everything saves OR nothing saves 🧠 Things to Take Care Of: Use foreign keys (order_id → orders.id) Keep transactions short (avoid locks) Never call external APIs inside transactions Handle retries for deadlocks Log failures clearly 🔥 Golden Rule: “If your operation spans multiple tables → ALWAYS use a transaction.” 😄 Think of it like this: Either the entire food order arrives 🍔🍟🥤 OR you cancel the whole thing… not just fries 😆 📌 Follow for more real backend engineering insights 💬 Suggest what topic you want next (IVR, AI calls, scaling, etc.) #Backend #NestJS #Database #SystemDesign #SoftwareEngineering #TechArchitecture #Developers #Coding #CleanCode
Like Comment
To view or add a comment, sign in
Naveen kumar Dugar
3w
Report this post
Day 1- first run, completely frozen , no crash, no output, no error , just a black screen. traced the full execution path: _onCreate() opens godown_inventory.db → starts transaction → calls _seedData() → _seedData() calls IdGenerator → IdGenerator opens id_sequences.db → SQLite lock — waiting for first transaction → DEADLOCK two databases - one waiting for the other to release a lock that would never be released. fix: Future<void> _onCreate(Database db, int version) async { await db.transaction((txn) async { // CREATE TABLE statements only — no IdGenerator here }); await _seedData(db); // AFTER transaction closes } the better long-term fix: replace IdGenerator entirely with the uuid package. no database, no sequences, no locks . trade-off: loses the human-readable format that staff actually use. -------------------------------------------------------------- #Flutter #SQLite #Debugging #BuildInPublic #MobileEngineering
Like Comment
To view or add a comment, sign in
Abubakar Mustafa
4w
Report this post
Stop writing "perfect" queries that die in production. I used to treat SQL indexing as a "Day 2" task. Something to optimize "later" once the data grew. I was wrong. I wrote a query that was clean and logical. It passed every dev-environment test. Then it hit a table with 10 million rows. The result? • CPU usage spiked to 98%. • Execution time went from milliseconds to "go grab a coffee." • The logic was perfect—the performance was a disaster. That was my wake-up call. An index isn’t a tuning step; it’s a structural requirement. The 3 Mindset Shifts: Scale over Correctness: If it doesn’t work at 10M rows, it doesn’t work. Period. The Full Table Scan is the enemy: Without an index, your database is reading every single page. With one, it’s a direct flight to the data. Indexing isn't free: Bad or redundant indexes are just as dangerous as none. They bloat storage and slow down your INSERTS. My new "Pre-Flight" checklist: • What is the SARGability of my WHERE clause? • Are my JOIN columns indexed on both sides? • Am I creating a new index, or can I refine an existing one? I’m currently diving deeper into the nuances of Clustered vs. Non-Clustered structures and learning to read execution plans like a map. The lesson: Performance isn’t an afterthought. It’s part of the design. #SQL #SQLServer #DatabaseDesign #DataEngineering #BackendDevelopment #TechJourney #PerformanceTuning #Programming

1 Comment
Like Comment
To view or add a comment, sign in
Richie Cahill
1mo
Report this post
I spent the last week building a data science environment on my homelab from scratch. 150GB of social media post data, a full congressional record, and a PostgreSQL instance tuned to eat all of it. Here's how it went. The dataset was 1.7 million tiny JSONL files. Python can parse fast, but opening 1.7M files means 1.7M syscalls — that's the actual bottleneck, not the data size. So I skipped Python entirely for the merge step. One bash pipeline: find ... -name '*.jsonl' -print0 | xargs -0 cat | split -C 1G cat and split are C programs doing buffered I/O. No interpreter overhead, no per-file open/close. That turned 1.7M files into ~150 clean 1GB chunks in about 3 hours. From there, Python took over. Each chunk gets read with orjson, loaded into PostgreSQL via COPY over a Unix socket — no TCP overhead. The table uses declarative range partitioning by week, 208 partitions spanning 2023–2026, all managed through SQLAlchemy and Alembic migrations. The ingestion pipeline uses a staging table pattern: COPY a batch into a temp table, then INSERT INTO ... SELECT ... ON CONFLICT DO NOTHING into the partitioned table. When a batch fails, it splits in half and retries recursively until it isolates the single bad row, which gets logged to a failed_ingestion table. No silent data loss, no full-batch failures. ALL of this planning meant 250 million rows ingested in about 30 minutes. Same database also holds a full congressional dataset — 164,753 bills with their full text, vote records, legislator profiles, and social media accounts. Proper relational models with foreign keys and cascading deletes, loaded from congress-tracker YAML/JSON sources. One of the things this led to was moving my PostgreSQL WAL to its own ZFS dataset on Optane drives — because I was seeing 1GB/s writes when I ingested the first time. All open source: https://lnkd.in/eu2xm2Md I needed all of this data because Matt, the data scientist I'm working with, is a wonderful crazy person that said we need more data. #DataEngineering #PostgreSQL #ZFS #NixOS #Python #Infrastructure

seting up new ds envirment by RichieCahill · Pull Request #285 · RichieCahill/dotfiles github.com

3 Comments
Like Comment
To view or add a comment, sign in
Mihir Inamdar
1mo Edited
Report this post
Case Study EP-01: Multi-Agent T2SQL Running Fully Local — No API Key Needed 🤖 Text-to-SQL isn't new. Making it actually work in production is. Most implementations throw one prompt at one model and call it done. This one splits the work properly across four agents using AgentScope: schema understanding, query generation, validation, and execution. Each agent has one job. Nothing gets stuffed into a single overloaded prompt hoping for the best. Local inference runs on a quantized model via Ollama. Fast enough to use daily, small enough for your laptop. Default is Qwen, but the part I enjoyed building was making it model-agnostic. Pull any GGUF-compatible model, change one field in model_configs.json, and the entire pipeline reroutes. No agent code touched. Mistral, Phi-3, Llama3, CodeLlama, whatever your hardware can handle. The full stack is AgentScope for orchestration, Ollama for local inference, Qdrant so agents remember good past queries, and Celery + Redis to handle async tasks when generation runs long. Fully open source. Link in the comments. 🔗 GitHub: https://lnkd.in/dCsz8F5v 👉 Disclaimer: This is built to understand how multi-agent orchestration works with local models, not to claim it's production-perfect. Treat it as a starting point and swap whatever makes sense for your setup. #text2sql #qdrant #productionsystems #agentscope #multiagentsystems

GitHub - inamdarmihir/t2sql-agentscope: Production Ready T2SQL boilerplate ready to use! github.com
Like Comment
To view or add a comment, sign in
Rasool Bux Palh
3w
Report this post
🌳 𝗧𝗿𝗲𝗲𝘀: 𝗧𝗵𝗲 𝗠𝗼𝘀𝘁 𝗨𝗻𝗱𝗲𝗿𝗿𝗮𝘁𝗲𝗱 𝗗𝗮𝘁𝗮 𝗦𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲 𝗶𝗻 𝗦𝘆𝘀𝘁𝗲𝗺 𝗗𝗲𝘀𝗶𝗴𝗻 When developers think about data structures, arrays and lists come first. But the real power behind scalable systems lies in Trees. • A tree is not just a structure • It’s a way to represent hierarchy efficiently 𝗨𝘀𝗮𝗴𝗲: • File Systems → Folder structure • Databases → B-Tree indexing • HTML DOM → Web page structure 𝗪𝗵𝘆 𝗧𝗿𝗲𝗲𝘀 𝗠𝗮𝘁𝘁𝗲𝗿? Because they give you: • Fast search → O(log n) • Structured data organization • Efficient insert/delete operations 𝗖𝗼𝗿𝗲 𝗖𝗼𝗻𝗰𝗲𝗽𝘁𝘀 𝗘𝘃𝗲𝗿𝘆 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿 𝗠𝘂𝘀𝘁 𝗞𝗻𝗼𝘄: Binary Trees & BST Tree Traversals (DFS, BFS) Balanced Trees (AVL, Red-Black) Heaps & Tries 𝗦𝗲𝗻𝗶𝗼𝗿-𝗟𝗲𝘃𝗲𝗹 𝗜𝗻𝘀𝗶𝗴𝗵𝘁: In real-world systems, trees are used to: • Handle millions of records efficiently • Build database indexes • Design scalable architectures #DataStructures #SystemDesign #Trees #SoftwareEngineering
Like Comment
To view or add a comment, sign in
Yash Srivastava
3w
Report this post
Everyone talks about RAG like it starts and ends with a vector database. That’s lazy thinking. RAG ≠ “dump everything into embeddings and pray.” Here’s the uncomfortable truth: Most use cases don’t need a vector DB as the first step. They need structure. Before jumping to embeddings, ask: - Is my data already structured (SQL, APIs, logs)? - Can exact match or filtered queries solve 80% of the problem? - Do I really need semantic search here? Because vector search comes with cost: - Approximate results (not always correct) - Extra infra + latency - Re-ranking complexity - Debugging nightmare Now compare that with a well-designed local database (SQL/NoSQL): - Deterministic queries (you know why results came) - Faster for exact retrieval - Easier to maintain and scale early - Cheaper Better RAG architecture (practical approach): 1. Start with primary retrieval from your local DB (filters, joins, indexed search) 2. Add semantic layer ONLY where meaning matters (unstructured text) 3. Hybrid retrieval → combine both 4. Re-rank + validate before sending to LLM This is how you build production-grade systems. Not by blindly plugging in a vector DB. Vector DBs are powerful — but they are a tool, not the foundation. If your system is slow, expensive, or hallucinating… It’s probably because you skipped the basics. Build like an engineer, not like a tutorial. #RAG #AIEngineering #SystemDesign #LLM #Backend
Like Comment
To view or add a comment, sign in
Ali Iftikhar
2w
Report this post
“Username already taken.” — but checked in milliseconds. How? 🤔 At first glance, it feels simple. You type a username on Gmail, and instantly the system tells you whether it’s available. But behind that tiny interaction lies a powerful concept that scales to billions of users without scanning massive databases every time. The secret? Bloom Filters. Instead of querying a huge dataset on every keystroke, systems use a probabilistic data structure that answers a smarter question: 👉 “Is this username definitely NOT taken… or maybe taken?” Here’s how it works in practice: • A Bloom Filter stores hashed representations of existing usernames • It uses multiple hash functions to map each username to bits in a compact array • When you check a new username, the system tests those same hash positions ⚡ If any bit is 0 → username is definitely available ⚡ If all bits are 1 → username is probably taken (then verified with a real database check) This approach is: ✅ Extremely fast (constant-time lookups) ✅ Memory efficient (no need to store full datasets in memory) ✅ Scalable to billions of entries The trade-off? 👉 Occasional false positives (it may say “taken” when it’s actually available) 👉 But never false negatives (it will never say “available” when it’s taken) That’s why systems combine Bloom Filters with a final database validation step — giving you both speed and accuracy. 💡 Next time you see “Username already taken,” remember: It’s not just a check — it’s a carefully engineered system designed for scale. Curious—where else have you seen Bloom Filters used in real-world systems? #SoftwareEngineering #SystemDesign #Scalability #BackendDevelopment #TechInsights #Programming
Like Comment
To view or add a comment, sign in

86 followers

31 Posts

View Profile Follow

Optimize Search Operations with HashSet

More Relevant Posts

Explore related topics

Explore content categories