#Post7 In the previous post(https://lnkd.in/d6yT-aqc), we saw different components of process and learnt about thread specific components like: • Stack • Program Counter • Registers Although Stack and Program Counter was explained, but an important concept about Registers is still pending, which we will cover in this post. Now let’s understand something very important How does CPU actually execute multiple threads? A CPU cannot run all threads at the same time. It depends on the number of cores. Example: • 2 CPU cores → only 2 threads can run at a time But in real applications, we often have many more threads than cores. So what happens to the remaining threads? The CPU switches between them. This is called Context Switching. What is Context Switching? When the CPU switches from one thread to another: • It saves the current thread’s state • Loads the next thread’s state • Continues execution Where is this state stored? In registers. Registers store intermediate execution data. So when a thread is paused, its state is preserved, and later resumed from the same point. Important Insight More threads ≠ better performance If there are too many threads: • CPU spends more time switching • Less time doing actual work This leads to performance degradation. Key takeaway Context switching allows multiple threads to share CPU time, but excessive switching can reduce performance. Understanding this is crucial when designing multi-threaded systems. In the next post, we’ll explore how to choose the right number of threads and avoid performance issues. #Java #SoftwareEngineering #Multithreading #BackendDevelopment #Programming
Context Switching in Multi-Threading: Understanding CPU Performance
More Relevant Posts
-
I changed one word in C++ and my loop got faster. The word I removed? Nothing. It was already there, silently by default. Here's something I didn't know until recently: Every time you write flag.store(true), C++ quietly applies the strongest possible memory guarantee under the hood. No warning. No footnote. Just the default. It's called memory_order_seq_cst. And on x86-64, that one store doesn't compile to a plain memory write. It compiles to MFENCE a CPU instruction whose only job is to freeze the entire pipeline, force every pending write to flush through the cache hierarchy, wait for every other core to acknowledge it, and only then let your program continue. That's 40 to 100 cycles of your CPU sitting completely idle. Once? Noise. In a loop running millions of times per second? That's your latency budget. The fix is literally one word: // before — full pipeline stall, every iteration counter.fetch_add(1); // after — plain atomic, no fence counter.fetch_add(1, std::memory_order_relaxed); But you shouldn't copy-paste that without understanding why it's safe here and catastrophic somewhere else. The difference between relaxed, acquire, release, acq_rel, and seq_cst isn't just performance it's correctness. You write std::atomic, it works, you move on and the whole time the compiler is making decisions for you that have real hardware consequences you never see. So I wrote the post I wish existed when I first touched atomics. It covers: * What all 5 memory orderings actually mean with mental models, not just definitions * Why your CPU has a store buffer and why that changes everything * What LOCK XCHG and MFENCE physically do to your pipeline * How to use godbolt.org to see the assembly your compiler silently generates * The one question to ask yourself when picking an ordering Link in comments 👇 If you've ever written std::atomic and moved on without a second thought this one's for you. #cpp #systemsprogramming #concurrency #performance #programming
To view or add a comment, sign in
-
-
Padding vs Register Spilling Ex1: Padding (memory alignment) struct A { char a; // 1 byte int b; // 4 bytes }; actual memory layout: [a][pad][pad][pad][b] why? CPU prefers aligned memory access compiler adds padding ->faster access Ex2: register spilling (real + assembly) int compute(int a, int b, int c, int d, int e, int f, int g) { return a + b + c + d + e + f + g; } On x86-64, only limited registers are available for arguments: What compiler may generate: mov eax, edi ; eax = a add eax, esi ; eax += b add eax, edx ; eax += c add eax, ecx ; eax += d add eax, r8d ; eax += e add eax, r9d ; eax += f g does not fit in a register → spilled to stack mov r10d, [rsp+8] ; load g from stack add eax, r10d ; add g to sum Registers (eax, edi, esi, etc.) hold variables that fit in CPU registers. Stack ([rsp+8]) is used when there are more variables than registers — this is spilling. each add instruction accumulates the sum. Notice one thing how g is loaded from memory because the CPU ran out of registers. Impact are: Register → very fast Memory (stack) → slower More spilling = performance drop Key Difference: padding → added for performance spilling → happens due to limited registers One-line takeaway padding = better memory layout spilling = register shortage #linkedin #embadded #system #design #vlsi #c #computerarchitecture
To view or add a comment, sign in
-
-
Why can't application code talk to the CPU, RAM, and devices the way the OS does? Because the machine would be impossible to share safely. Every program would need its own drivers, fight over hardware, and one bug could corrupt another process or the whole system. The operating system exists to fix that: abstract hardware, split CPU time and memory, route I/O, and enforce rules. The part that actually runs in privileged mode and makes those rules real is the kernel — not the shell, not your app. Here's what you need in your head as an engineer: • User mode vs kernel mode — Normal code runs with limited rights. It cannot map arbitrary physical memory or poke device registers. That boundary is what makes multitasking and isolation possible. • System calls — When your program opens a file, allocates memory, or sends on a socket, it asks the kernel to do the privileged work. That's the stable interface, not direct hardware access. • Kernel vs "the OS" — People say "the OS" for the whole stack (kernel, libraries, services, sometimes GUI). The kernel is the trusted core inside that stack — scheduling, memory management, device control, interrupt handling. • Why it shows up in debugging — Permission errors, segfaults, "too many open files," syscall latency — you're already at this layer before you argue about languages or frameworks. Get the privilege model right and the rest of OS coursework (processes, memory, I/O) snaps into place. What clicked first for you — syscalls, or user vs kernel mode? #OperatingSystems #Kernel #ComputerScience #SystemsProgramming #DevOps #SoftwareEngineering #TechLearning #LearninginPublic
To view or add a comment, sign in
-
-
Two C++ loops. Same code. Same flags. One: 0.25s Other: 1.25s 👉 Code didn’t change. 👉 CPU behavior did. If: instructions ≈ same cycles ↑ IPC ↓ cache misses ↑ 👉 It’s not compute 👉 It’s memory I wrote a deep dive on using perf to see this clearly. 📖 https://lnkd.in/gjzwJWTP If you’re not measuring, you’re guessing. #CPP #Performance #Systems #LLVM #compilersutra
To view or add a comment, sign in
-
The free lunch is over. Clock speeds stopped growing. For many years, every new computer was faster than the last one because of higher clock speeds. This meant your code would run quicker without you changing anything. Programmers called this *"the free lunch"*. But around 2004, this changed. Clock speeds reached a limit at about 3–4 GHz and stayed there. The reason: faster clocks create more heat than can be cooled easily. So, chip designers started adding *more cores* instead of making each core faster. This is the biggest change in computer history for people who write software. Common mistake: Writing single-threaded code and waiting for hardware to make it faster automatically. That time has passed. Code that doesn’t use multiple cores will not get much faster even if you upgrade your CPU. The improvement now comes from more parallelism, not higher speed per core. Understanding why parallel programming is important is the first step to wanting to learn it. The free lunch ended 20 years ago. Now, only parallel programs can fully use modern hardware. *All product names, trademarks, and registered trademarks mentioned herein are the property of their respective owners and are used solely for educational and informational purposes, with no implication of endorsement or affiliation. #BackToBasics #SoftwareEngineering #Concurrency #Programming #ParallelComputing #HPC #Threading #HighPerformanceComputing #ParallelProgramming #GPGPU
To view or add a comment, sign in
-
-
99% of developers think this is trivial: 👉 'C = A + B' They’re wrong. And this exact assumption is why systems feel “mysteriously slow”. Inside an ARM CPU, this one line triggers a full execution pipeline: ⚡ Instruction fetch from L1 I-Cache (or worse… a miss 😬) ⚡ Control Unit decodes the instruction ⚡ Registers (X1, X2) supply operands ⚡ ALU performs the computation (10 + 5 = 15) ⚡ Data may travel L1 → L2 → L3 → RAM (on a miss) ⚡ Pipeline executes multiple instructions in parallel All of this… …just to compute **15** 💭 Here’s what most developers miss: Your code doesn’t run. 👉 Your **hardware executes it** 🚨 My take: • Cache misses hurt more than most bad code • Memory latency > CPU speed in real-world systems • Pipeline stalls are silent performance killers 📌 Most developers try to optimize code… …but ignore the thing that actually slows it down. 🧠 The best engineers don’t just write code… They think like the CPU. 📊 I built a visual to show what ACTUALLY happens from code → result 👇 What’s one thing about CPU performance that changed how you write code? 🔁 Follow me for deep dives into how code really runs on hardware. #EmbeddedSystems #ComputerArchitecture #Performance #ARM #LowLevel
To view or add a comment, sign in
-
-
Quantifying the Cost of the User-Kernel Boundary: A 215x Study in Latency 🐧 In systems programming, we often treat system calls as "slightly slower" function calls. To challenge this abstraction, I designed a micro-benchmark to measure the actual cycle-tax of crossing from user mode (Ring 3) to Kernel Mode (Ring 0). The Result: A 215x latency multiplier between a user-space instruction and a getpid() syscall. To ensure the results weren't skewed, I applied 2 critical constraints. 1. Memory Barriers & Volatility: Used the volatile qualifier to prevent the compiler's optimizer from pruning the loop during dead-code elimination. 2. Monotonic High-Res Timing: Leveraged clock_gettime(CLOCK_MONOTONIC) to avoid non-linearities caused by NTP updates or system time-of-day jumps. THE WHY ? The 215x isn't just code execution; it is the cost of the Hardware Trap. 1. Context Switch: The CPU must save the register state (GPRs) to the kernel stack. 2. Privilege Transition: Transitioning the processor state from User Mode (Ring 3) to Kernel Mode (Ring 0). 3. Interrupt Handling: The kernel must look up the syscall vector, verify permissions, and execute the routine before restoring the user-space context. In high-performance systems engineering, minimizing the "syscall tax" is critical(VERY VERY VERY CRITICAL) Whether optimizing Kubernetes pod lifecycles or building low-latency APIs, understanding the boundary between user-land and the kernel is what separates "code that works" from "code that scales". #SystemsProgramming #LinuxInternals #PerformanceEngineering #Kernel #LowLatency #SoftwareArchitecture
To view or add a comment, sign in
-
-
Day 7: Your CPU is a gambler. And when it loses, you pay. This one goes below the language straight into hardware. Modern CPUs don't wait. While executing your current instruction, they're already fetching and decoding the next several ones. When they hit an 𝒊𝒇 statement, they don't stop and think; they guess which branch you'll take and speculatively execute it ahead of time. Guess right? Free performance. The work was already done. Guess wrong? The CPU throws away everything it speculatively executed, rewinds, and starts over. That's called a branch misprediction, and in a tight loop running millions of iterations, it adds up fast. Here's the classic demonstration: process a vector of random numbers, filtering values above a threshold. Unsorted, the branch is unpredictable; the CPU guesses wrong constantly. Sort the data first, and the pattern becomes predictable: all the failing values, then all the passing ones. The CPU locks in, stops mispredicting, and flies. Same data. Same logic. Potentially dramatic difference in runtime. 🧠 Key insight: this isn't a C++ problem, it's a hardware problem. The language doesn't abstract it away. Writing performance-critical code means thinking about what your CPU is doing, not just what your compiler is doing. Worth knowing: · C++20 gives you [[𝒍𝒊𝒌𝒆𝒍𝒚]] and [[𝒖𝒏𝒍𝒊𝒌𝒆𝒍𝒚]] attributes to hint the compiler about branch probability. · Branchless techniques using arithmetic instead of conditionals are sometimes the right answer in hot paths. · Profile first. Sorting your data just to help branch prediction is a tradeoff, not a free lunch. The best C++ programmers know where the language ends, and the machine begins. Day 7 of my C++ deep-dive series. Missed Day 6? Go check out the 𝒔𝒕𝒅::𝒐𝒑𝒕𝒊𝒐𝒏𝒂𝒍 breakdown. Have you ever caught a branch misprediction bottleneck in a profiler? What did the fix look like? 👇 #cpp #cplusplus #performance #hardware #softwareengineering
To view or add a comment, sign in
-
-
Debugging High CPU in Production Ever faced a system where CPU is at 100%… but don't know why? I realized that debugging high CPU is not about tools—it’s about having a structured approach. Here’s the framework that should be followed: 🔹 1. First Question: Where is CPU going? Before touching any tool: * Is it User CPU (%user) → application logic? * Or System CPU (%system) → kernel, I/O, drivers? 👉 Why these matters: * High user CPU → inefficient code / loops * High system CPU → deeper issue (I/O, kernel path, interrupts) 👉 Tools: top, htop 🔹 2. Identify the Real Consumer and Not just Process Don’t stop at process level. 👉 Go deeper: * Which thread is consuming CPU? * Is it doing useful work? 👉 Tools: * top -H * ps -L -p <pid> 👉 Insight: High CPU at thread level often exposes: * tight loops * stuck workers * imbalance in workload 🔹 3. Find the Hot Path Most debugging fails because people don’t reach here. 👉 Tools: * perf top * perf record + perf report 👉 What you’re looking for: * Which function is dominating CPU? * Is it: - your code? - kernel path? - library? 👉 This answers: “Where exactly is CPU being burned?” 🔹 4. Check for “Fake Work” High CPU ≠ useful work. Common real causes: * Busy waiting (spin loops) * Lock contention * Retry loops due to failures * Inefficient polling 👉 Signals: * Threads active but no output progress * High CPU + low throughput 🔹 5. Trace Behavior, Not Just Functions Sometimes the issue is not compute—it’s behavior pattern. 👉 Tools: * strace -p <pid> 👉 Look for: * Repeated syscalls * Fast loops (read/write/poll) * Unexpected retries 🔹 6. Correlate with System Context Now step back and ask: * Did workload change? * Is I/O latency causing retries? * Is there a downstream bottleneck? Are threads competing for shared resource? 👉 This is where root cause emerges. 💡 Key Insight In system software, high CPU is often a symptom, not the problem. It usually points to: * contention * retry storms * poor synchronization * or design inefficiencies 🔚 Final Thought: ➡ Don’t ask “Why is CPU high?” ➡ Instead Ask “What useless work is the system doing repeatedly?” That question almost always leads to the real root cause. Understanding why the system is busy is far more important than just reducing CPU usage. High CPU isn’t always a performance issue—it’s often a design problem in disguise.
To view or add a comment, sign in
-
-
"Mastering Compiler Attributes in Embedded C" This presentation explores how compiler attributes enable the precise control required for efficient firmware development. It walks through key concepts from the sources, such as: • Memory Management: Techniques for using sections to place data in specific Flash or EEPROM regions and aligned to ensure data integrity. • Performance and Optimization: How to use always_inline for timing-critical GPIO paths and optimize levels to balance speed and code size. • Firmware Architecture: Utilizing weak linking for HAL customization and packed structures for precise protocol parsing. • Advanced Control: Practical patterns for combining attributes to manage complex tasks like firmware patching #learningbytutorials #learning #embeddedprogramming #embeddedsystems #cprogramming
To view or add a comment, sign in
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development
your calendar is just a scheduler with too many threads running