Choosing the wrong data structure can make your code 100x slower. Here is how to pick the right one! Every data structure has a specific use case. Using the wrong one is like using a hammer to cut wood. Array ✅ Fast random access by index (O(1)) ❌ Fixed size, slow insertions/deletions Use case: When you know the size and need fast lookups Queue (FIFO) ✅ First In, First Out operations Use case: Task scheduling, breadth-first search, handling requests Stack (LIFO) ✅ Last In, First Out operations Use case: Undo/redo, function calls, depth-first search, expression evaluation Linked List ✅ Fast insertions/deletions (O(1) at head) ❌ Slow search (O(n)) Use case: When you need frequent insertions/deletions, implementing queues/stacks Tree ✅ Hierarchical data, fast search in balanced trees (O(log n)) Use case: File systems, databases, decision trees, BST for sorted data Graph ✅ Represents relationships between entities Use case: Social networks, maps/routing, recommendation systems Matrix ✅ 2D data representation Use case: Image processing, game boards, mathematical computations Max Heap ✅ Fast access to maximum element (O(1)) Use case: Priority queues, finding top K elements, median streaming Trie ✅ Fast prefix searches (O(m) where m is string length) Use case: Autocomplete, spell checkers, IP routing HashMap ✅ Fast key-value lookups (O(1) average) Use case: Caching, counting occurrences, fast lookups HashSet ✅ Fast membership checks, no duplicates (O(1) average) Use case: Removing duplicates, checking existence Pro tip: The best data structure is not always the most complex one. Sometimes a simple array is all you need. Which data structure do you find yourself using the most? Share below! #DataStructures #Programming #Java #BackendDevelopment #Algorithms #SoftwareDevelopment
Ollayor Sabirov’s Post
More Relevant Posts
-
Most people use Claude Code like a smarter autocomplete. That's not what it is. If you structure your repo correctly, Claude Code operates more like a disciplined junior engineer — one that reads the docs before touching anything, follows your conventions, guards against dangerous operations, and leaves a clean audit trail after every session. The difference isn't the model. It's the project structure. Here's what actually matters: 1. CLAUDE.md — your AI onboarding doc. Client context, architecture diagram, coding conventions, known gaps. Auto-loaded every session. 2. A session brief (read.md) — what today's focus is, what was decided last time, what's locked. Prevents you repeating the same discovery work twice. 3. Slash commands — package your multi-step workflows as markdown files. /add-bronze-object, /add-gold-transform, /check-pipeline-status. One command, done correctly every time. 4. Hooks — Python scripts that intercept Claude before it runs a bash command or writes a file. Block destructive CLI calls. Catch bad SQL. Surface a git diff on exit. 5. Discovery docs — let Claude query your actual source DB and document what it finds. Real column names, real data patterns, real gotchas. No guesswork in the SQL. I ran this setup on a full Snowflake medallion pipeline — MSSQL source, Bronze → Silver → Gold, 25 objects. 25/25 built. 0 failures. One session. I also wrote a section on prompt pollution — what happens when vague or exploratory prompts silently contaminate your session context and why it's so hard to catch. Worth reading if you use any LLM in your data work. #DataEngineering #SnowflakeDB #ClaudeCode #ETL #ArtificialIntelligence #Python #DataPipeline #MLOps Full article 👇 https://lnkd.in/gc7tAXDA
To view or add a comment, sign in
-
Docker can be overwhelming to start with. Most data projects use Docker to set up the data infrastructure. Here are six commands to quickly get you started building your own Dockerfiles 1. FROM: * This indicates the base operating system on which we can install the necessary libraries. * You can use a pre-built image for a tool, or install it yourself on any *nix OS image. * Search DockerHub for prebuilt images. 2. COPY: * Used to copy files from the local filesystem into the image that you are creating. * Used to copy requirements.txt or files necessary for setup * The files are copied over during image build. 3. RUN: * Used to run commands as part of the image build. * E.g., Install the requests library as part of the image build with RUN uv add requests 4. ENV * Used to set environment variables * E.g., Set Python path as part of the image build, ENV PYTHONPATH="/home/airflow:${PYTHONPATH}" 5. ENTRYPOINT: * Used to execute a command or script that starts the main process for the image * It will always be run when you use the image to run a container. 6. Port and Volume: * When running a container, expose a port to your local OS with the port flag. * Use volume to mount files that need bidirectional syncing, e.g., code. - Like this thread? Please let me know what you think in the comments below. Also, follow me for more actionable data content. #dataengineeering #datapipeline #docker #data
To view or add a comment, sign in
-
We’ve all been there. When we’re learning a new language or designing a schema, it’s easy to overlook data types. They feel like a "set it and forget it" step, but choosing the wrong one can lead to some serious headaches down the road. 🧠 Have you ever wondered why many developers steer clear of CHAR in favor of VARCHAR or dynamic strings? 📏 𝗧𝗵𝗲 𝗧𝗿𝗮𝗽 𝗼𝗳 𝗙𝗶𝘅𝗲𝗱 𝗟𝗲𝗻𝗴𝘁𝗵 (𝗖𝗛𝗔𝗥) • 𝚃𝚑𝚎 "𝚂𝚝𝚘𝚛𝚊𝚐𝚎 𝙸𝚕𝚕𝚞𝚜𝚒𝚘𝚗": 𝚈𝚘𝚞 𝚖𝚒𝚐𝚑𝚝 𝚝𝚑𝚒𝚗𝚔 𝚢𝚘𝚞’𝚛𝚎 𝚋𝚎𝚒𝚗𝚐 𝚙𝚛𝚎𝚌𝚒𝚜𝚎, 𝚋𝚞𝚝 𝚒𝚏 𝚊 𝚗𝚊𝚖𝚎 𝚒𝚜 𝚘𝚗𝚕𝚢 𝟻 𝚌𝚑𝚊𝚛𝚊𝚌𝚝𝚎𝚛𝚜 𝚕𝚘𝚗𝚐, 𝚝𝚑𝚎 𝚍𝚊𝚝𝚊𝚋𝚊𝚜𝚎 𝚙𝚊𝚍𝚜 𝚝𝚑𝚎 𝚛𝚎𝚖𝚊𝚒𝚗𝚒𝚗𝚐 𝟷𝟾 𝚜𝚙𝚊𝚌𝚎𝚜 𝚠𝚒𝚝𝚑 𝚋𝚕𝚊𝚗𝚔𝚜. • 𝚃𝚑𝚎 𝚀𝚞𝚎𝚛𝚢 𝙽𝚒𝚐𝚑𝚝𝚖𝚊𝚛𝚎: 𝚃𝚑𝚒𝚜 𝚒𝚜 𝚠𝚑𝚎𝚛𝚎 𝚒𝚝 𝚐𝚎𝚝𝚜 𝚝𝚛𝚒𝚌𝚔𝚢. 𝚆𝚑𝚎𝚗 𝚢𝚘𝚞 𝚏𝚒𝚕𝚝𝚎𝚛 𝚘𝚛 𝚚𝚞𝚎𝚛𝚢 𝚝𝚑𝚊𝚝 𝚍𝚊𝚝𝚊, 𝚗𝚊𝚖𝚎𝚜 𝚠𝚒𝚝𝚑 𝚏𝚎𝚠𝚎𝚛 𝚝𝚑𝚊𝚗 𝟸𝟹 𝚌𝚑𝚊𝚛𝚊𝚌𝚝𝚎𝚛𝚜 𝚖𝚒𝚐𝚑𝚝 𝚗𝚘𝚝 𝚛𝚎𝚝𝚞𝚛𝚗 𝚝𝚑𝚎 𝚛𝚎𝚜𝚞𝚕𝚝𝚜 𝚢𝚘𝚞 𝚎𝚡𝚙𝚎𝚌𝚝 𝚋𝚎𝚌𝚊𝚞𝚜𝚎 𝚘𝚏 𝚝𝚑𝚘𝚜𝚎 𝚝𝚛𝚊𝚒𝚕𝚒𝚗𝚐 𝚜𝚙𝚊𝚌𝚎𝚜. 𝙸𝚝’𝚜 𝚊 𝚜𝚒𝚕𝚎𝚗𝚝 𝚋𝚞𝚐 𝚠𝚊𝚒𝚝𝚒𝚗𝚐 𝚝𝚘 𝚑𝚊𝚙𝚙𝚎𝚗! 🐛 • 𝙲𝙷𝙰𝚁 𝚒𝚜 𝚊 𝚏𝚒𝚡𝚎𝚍-𝚕𝚎𝚗𝚐𝚝𝚑 𝚋𝚎𝚊𝚜𝚝. 𝙸𝚏 𝚢𝚘𝚞 𝚍𝚎𝚏𝚒𝚗𝚎 𝚊 𝚌𝚘𝚕𝚞𝚖𝚗 𝚊𝚜 𝙲𝙷𝙰𝚁(𝟸𝟹), 𝚎𝚟𝚎𝚛𝚢 𝚜𝚒𝚗𝚐𝚕𝚎 𝚌𝚎𝚕𝚕 𝚒𝚗 𝚝𝚑𝚊𝚝 𝚌𝚘𝚕𝚞𝚖𝚗 𝚠𝚒𝚕𝚕 𝚘𝚌𝚌𝚞𝚙𝚢 𝚜𝚙𝚊𝚌𝚎 𝚏𝚘𝚛 𝚎𝚡𝚊𝚌𝚝𝚕𝚢 𝟸𝟹 𝚌𝚑𝚊𝚛𝚊𝚌𝚝𝚎𝚛𝚜. 🏗️ 𝗧𝗵𝗲 𝗗𝘆𝗻𝗮𝗺𝗶𝗰 𝗔𝗹𝘁𝗲𝗿𝗻𝗮𝘁𝗶𝘃𝗲 While VARCHAR is the standard go-to, it’s important to remember that it also carries a tiny bit of overhead to manage that variable length. However, when dealing with Big Data, the best practice is often to lean toward dynamic string handling. Why? Because dynamic strings adapt to the data you actually have, rather than forcing your data into a rigid, pre-defined box. 📦 💡 The Bottom Line In the world of data engineering and backend dev, precision matters. Don't just pick a type because it's the default—pick it because you understand how it handles your data at scale. #DataEngineering #Databricks #SQL #BackendDevelopment #BigData #DatabaseDesign #CodingLife
To view or add a comment, sign in
-
🚨 𝗔 𝗺𝗶𝘀𝘁𝗮𝗸𝗲 𝗜 𝘀𝘁𝗶𝗹𝗹 𝘀𝗲𝗲 𝘁𝗵𝗮𝘁 𝗺𝗮𝗸𝗲𝘀 𝗺𝗮𝗻𝘆 𝘀𝘆𝘀𝘁𝗲𝗺𝘀 𝘀𝗹𝗼𝘄… 𝗲𝘃𝗲𝗻 𝗶𝗻 𝗽𝗿𝗼𝗱𝘂𝗰𝘁𝗶𝗼𝗻 Poorly designed search operations. A while ago, I reviewed a system where: 👉 Each query took several seconds 👉 The data volume was constantly growing And the problem wasn’t the infrastructure… It was this 👇 🔎 Linear search over an unordered collection ➡️ O(n) complexity ➡️ Every request required scanning the entire dataset 😬 👉 This pattern is common, for example, in offline-first applications where data is downloaded once and then queried in memory. The solution was simple: 👉 Change the data access structure (HashSet) Results: ⚡ From seconds → milliseconds 🚀 No changes to business logic 🚀 No need to scale servers 🧠 𝗛𝗲𝗿𝗲’𝘀 𝘁𝗵𝗲 𝗸𝗲𝘆 (𝗮𝗽𝗽𝗹𝗶𝗲𝗱 𝘁𝗵𝗲𝗼𝗿𝘆): Not all search operations are the same: 🔹 Unordered collection → O(n) 🔹 Binary Search (sorted data) → O(log n) 🔹 HashSet → O(1) 🚀 🔹 TreeSet → O(log n) + keeps ordering 🎯 𝐏𝐫𝐚𝐜𝐭𝐢𝐜𝐚𝐥 𝐜𝐨𝐧𝐜𝐥𝐮𝐬𝐢𝐨𝐧: 👉 The problem is not working in memory 👉 The problem is not designing data access properly 💡 Quick rule of thumb: 🔹 Speed → HashSet 🔹 Order → TreeSet 🔹 Already sorted data → Binary Search 💬 Curious to know: Have you seen this issue in real-world apps or offline systems? #SoftwareEngineering #Java #BackendDevelopment #SystemDesign #PerformanceOptimization #DataStructures #Algorithms
To view or add a comment, sign in
-
I built a Fast Data Analytics CLI Tool in Modern C++20 that processes million-row CSV datasets 10–12x faster than Python's Pandas — using custom multithreading, STL data structures, and a from-scratch CSV parser. Here's what it can do in a single command: fast_analytics.exe --group City --filter "Gender=Female" --filter "Rating>7" --chart ✅ Filters 1,000,000 rows down to matching ones ✅ Groups results by any column in parallel ✅ Prints an ASCII bar chart right in the terminal ✅ Exports to JSON + CSV + HTML report automatically The performance numbers speak for themselves: DatasetPython/PandasMy C++ ToolSpeedup100K rows~1.8 sec~180 ms~10x500K rows~7.2 sec~620 ms~11x1M rows~14 sec~1.2 sec~12x What I built from scratch: 🔹 Custom thread pool (producer-consumer pattern, 4 parallel workers) 🔹 CSV parser that handles quoted fields correctly 🔹 Row filter engine with 6 operators (=, !=, >, <, >=, <=) 🔹 Deep stats engine: median, stddev, P25/P75/P90/P95/P99, skewness 🔹 Benchmark system that reports rows/second throughput 🔹 HTML report exporter with dark-theme visual tables. The hardest bug I fixed: A race condition where the main thread was calling get_results() before worker threads finished. The fix was elegant — wrapping ThreadPool in a { } block so the destructor's join() call guarantees all threads complete before we read results. Tech stack: C++20 · STL · std::thread · mutex · condition_variable · CMake GitHub: https://lnkd.in/dQdsmzqz I'd love feedback from any C++ engineers or systems programmers in my network. What would you add next? #CPlusPlus #SystemsProgramming #DataEngineering #OpenSource #SoftwareEngineering #Programming #CPP20
To view or add a comment, sign in
-
We cut peak-time dashboard resource usage by ~50% without adding new servers. Here’s the breakdown. 🚀 As traffic grew, one of our internal dashboards started slowing down exactly when usage was highest. Response times increased, database load spiked, and unnecessary queries were consuming resources. The issue wasn’t infrastructure. It was application-level inefficiency. The Challenge The dashboard was making repeated database hits while rendering data-heavy views. Classic symptoms: • Slow response times during peak hours • Increased DB utilization • Higher CPU/memory pressure on the app layer After profiling the flow, the root cause was clear: 👉 N+1 query patterns + repeated data fetching logic What I Changed 1️⃣ Consolidated Data Fetching Used Django ORM features like: • select_related() for ForeignKey joins • prefetch_related() for reverse/M2M relationships This ensured related data was fetched in batches instead of per record. 2️⃣ Reduced Repeated Query Execution • Removed queryset evaluations inside loops • Cached reusable datasets during request lifecycle • Avoided duplicate ORM calls across helper methods 3️⃣ Shifted Transformations to Python Once the required data was fetched efficiently, grouping/filtering/manipulation was done in-memory rather than repeatedly querying the DB. 4️⃣ Leaner Payloads Used .values() / targeted field selection where full model objects were unnecessary. The Impact ⚡ • ~50% reduction in resource usage during peak load • Significant drop in DB hits • Faster dashboard response times • Better stability under concurrent traffic 🚀 3 Lessons for Scaling Django Backends Query count matters more than query elegance One clean query repeated 500 times is still expensive. Fetch once, process many Databases should retrieve data. Business logic can often run in memory. Profile peak traffic scenarios Many bottlenecks only appear under real concurrency. Performance wins don’t always come from bigger infra. Sometimes they come from better data flow design. #Django #Python #BackendEngineering #PerformanceOptimization #Scalability #SoftwareEngineering
To view or add a comment, sign in
-
-
New project unlocked🔓 I just finished building a 𝗖𝘂𝘀𝘁𝗼𝗺𝗲𝗿 𝗟𝗶𝗳𝗲𝘁𝗶𝗺𝗲 𝗩𝗮𝗹𝘂𝗲 (𝗖𝗟𝗩) 𝗣𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗼𝗻 𝗦𝘆𝘀𝘁𝗲𝗺. The starting question: 𝘩𝘰𝘸 𝘮𝘶𝘤𝘩 𝘳𝘦𝘷𝘦𝘯𝘶𝘦 𝘸𝘪𝘭𝘭 𝘦𝘢𝘤𝘩 𝘤𝘶𝘴𝘵𝘰𝘮𝘦𝘳 𝘨𝘦𝘯𝘦𝘳𝘢𝘵𝘦 𝘰𝘷𝘦𝘳 𝘵𝘩𝘦𝘪𝘳 𝘭𝘪𝘧𝘦𝘵𝘪𝘮𝘦 𝘪𝘯 𝘰𝘶𝘳 𝘣𝘶𝘴𝘪𝘯𝘦𝘴𝘴? Using the PostgreSQL DVD Rental dataset, I built an end-to-end pipeline: - Designed an ETL pipeline that processes ~14,000 transactions from 9 tables into a customer-level OLAP star schema - Engineered RFM-based features (Recency, Frequency, Monetary) for CLV modeling - Trained and compared multiple ML models (Linear Regression, Random Forest, Gradient Boosting) using chronological split and TimeSeriesSplit to avoid data leakage - Deployed everything into an interactive Django web app with a prediction form and business recommendations - The final model (Gradient Boosting) achieved strong performance, with R² close to 0.99 and low prediction error. One insight that came out of the analysis: customers who rent frequently, even at lower spend per transaction, often generate more lifetime value than occasional high spenders. Frequency matters more than monetary average! One limitation is that the dataset is static (historical DVD rental data), so the model reflects past behavior patterns rather than real-time customer activity. Additionally, some features like recency and tenure showed very low importance, likely due to the limited time range of the dataset, but they were still kept to ensures the model remains interpretable, aligned with business logic, and more generalizable to real-world scenarios beyond this dataset. This project helped me understand how data engineering, machine learning, and business thinking come together in a real system, not just a model. 🖇️GitHub → https://lnkd.in/g4k7iQuy Would love any feedback or thoughts!🖖🏻 #DataAnalytics #MachineLearning #Django #Python #PostgreSQL #PortfolioProject
To view or add a comment, sign in
-
The cleanest data in the world won’t save a product if you’re asking the wrong questions. I’ve spent years deep in SQL and Python, building ETL pipelines to ensure our dashboards were flawless. But I’ve also seen teams obsess over "statistically significant" metrics while completely ignoring why users were dropping off in the first place. Data is excellent at telling you what is happening. It’s terrible at telling you why. Early on, I thought a dip in a specific conversion rate meant we needed to refactor the backend logic. After digging into the logs and running the queries, the data supported a technical "fix." Then I actually talked to a customer. The issue wasn’t the data latency or the algorithm. It was a confusing UI shift that made the "Submit" button look like a secondary link. No amount of Python scripting was going to surface that human frustration. Data should inform your roadmap, but it shouldn't hold the pen. Use your queries to find the friction, but use your empathy to solve it. #DataStrategy
To view or add a comment, sign in
-
-
𝗛𝗼𝘄 𝗜 𝗦𝗲𝘁 𝗨𝗽 𝗦𝗻𝗼𝘄𝗳𝗹𝗮𝗸𝗲 𝗖𝗼𝗿𝘁𝗲𝘅 𝗖𝗼𝗱𝗲 𝗖𝗟𝗜 𝗶𝗻 𝗩𝗦 𝗖𝗼𝗱𝗲 Just got Cortex Code CLI running in VS Code , an AI assistant that lives right in your terminal and talks to your Snowflake data warehouse using plain English. Here's how to set it up in 5 simple steps: 𝗦𝘁𝗲𝗽 𝟭: 𝗜𝗻𝘀𝘁𝗮𝗹𝗹 𝗦𝗻𝗼𝘄𝗳𝗹𝗮𝗸𝗲 𝗖𝗟𝗜 pip install pipx pipx install snowflake-cli --force pipx ensurepath snow --version This installs the Snowflake CLI ("snow") , a prerequisite that Cortex Code depends on behind the scenes. `pipx` keeps it in an isolated environment so it doesn't mess with your other Python packages. 𝗦𝘁𝗲𝗽 𝟮: 𝗘𝗻𝗮𝗯𝗹𝗲 𝗖𝗿𝗼𝘀𝘀-𝗥𝗲𝗴𝗶𝗼𝗻 𝗜𝗻𝗳𝗲𝗿𝗲𝗻𝗰𝗲 (𝗿𝘂𝗻 𝗶𝗻 𝗦𝗻𝗼𝘄𝗳𝗹𝗮𝗸𝗲 𝗮𝘀 𝗔𝗖𝗖𝗢𝗨𝗡𝗧𝗔𝗗𝗠𝗜𝗡) ALTER ACCOUNT SET CORTEX_ENABLED_CROSS_REGION = 'AWS_US'; This allows Snowflake to use AI models hosted in AWS US region. Without this, Cortex Code can't reach the AI models it needs. 𝗦𝘁𝗲𝗽 𝟯: 𝗜𝗻𝘀𝘁𝗮𝗹𝗹 𝗖𝗼𝗿𝘁𝗲𝘅 𝗖𝗼𝗱𝗲 𝗖𝗟𝗜 - Windows PowerShell: check this commands in comments 𝗦𝘁𝗲𝗽 𝟰: 𝗩𝗲𝗿𝗶𝗳𝘆 𝗜𝗻𝘀𝘁𝗮𝗹𝗹𝗮𝘁𝗶𝗼𝗻 cortex --version 𝗦𝘁𝗲𝗽 𝟱: 𝗟𝗮𝘂𝗻𝗰𝗵 & 𝗖𝗼𝗻𝗻𝗲𝗰𝘁 cortex Follow the wizard , enter your Account ID, username, and auth method. Done! Now you can ask things like: - "What databases do I have access to?" - "Show me top 10 orders by amount" - "Fix the bug in @src/app.py" It reads your local files, runs terminal commands, writes SQL, and even builds Streamlit apps, all from a chat interface in VS Code. I ran into a few issues during my Windows installation (8.3 short filename TEMP path errors) , I've documented the problems and fixes in detail on my Medium blog along with a full demo video attached. #Snowflake #CortexCode #AI #DataEngineering #VSCode #CLI #Tutorial
To view or add a comment, sign in
-
Most data analysts on my team spent more time writing SQL than actually analysing data. So I built a fix — without touching our existing Superset setup. It's called a Text-to-SQL Sidecar: a standalone FastAPI microservice that sits alongside Apache Superset and turns plain English into validated, safe SQL. You ask: "which products had the highest return rate last quarter?" It generates, validates, and executes the SQL — then hands the results back. A few things I was deliberate about: → AST-level SQL validation (not string matching — trivially bypassable) → Per-database table allowlists so the LLM can only touch what it's supposed to → Schema caching so we're not hammering the DB on every request → LLM-agnostic design — swap the endpoint URL, change the model → Reasoning traces returned alongside SQL so analysts can actually trust the output Superset never needs to know it exists. It just receives SQL. I wrote up the full implementation — architecture, code walkthrough, and the design decisions that make it production-ready. Link in the comments 👇 #DataEngineering #AI #SQL #FastAPI #ApacheSuperset #LLM #Python
To view or add a comment, sign in
Explore related topics
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development