Bahdanau Attention for Sequence Modeling with PyTorch

Day 24/30 ML Challenge; Bahdanau Attention; Standard Encoder-Decoder architectures suffer from catastrophic amnesia because they force the Encoder to compress entire sequences into a single fixed-size vector. To solve this, I engineered an Attention bridge that dynamically calculates alignment scores, allowing the Decoder to "look back" at specific Encoder hidden states at every single generation step. Core Mechanics; 1. Architecture : A Bidirectional GRU Encoder paired with an Attention-driven Unidirectional GRU Decoder. 2. Tokenization : Strict character-level mapping to prevent the infinite vocabulary explosion inherent to mathematical domains. 3. Evaluation : Exact Match Accuracy (EMA). Character Error Rate is useless in calculus; a single hallucinated token invalidates the entire equation. 4. Data Pipeline : Engineered a deterministic synthetic generator using SymPy to build abstract syntax trees and exact ground-truth targets. The architecture works, but the mathematical engine is too slow to scale. Full Explanation, Math and Python Code in Repository. Repo : https://lnkd.in/gj-pd8dg #MachineLearning #PyTorch #DeepLearning #ArtificialIntelligence #SequenceModeling #Engineering #DataScience #AI

To view or add a comment, sign in

More Relevant Posts

Akshat Mishra
3w
Report this post
Day 25/30; Transformer Cross-Attention bridge; Problem : PGN-to-FEN Translation. Mapping a 1D sequence of events (chess move history) into a static 2D spatial snapshot (board layout). Architecture Implemented : 1. Zero Recurrence : Replaced the O(n) sequential bottleneck with O(1) parallel matrix multiplication using Scaled Dot-Product Attention. 2. Positional Encoding : Injected sine/cosine frequencies directly into the token embeddings to mathematically plot time geographically. 3. Multi-Head Bridge : The Decoder (Space) fires a Query (Q) at the Encoder's historical Keys (K), computing a dot-product alignment score to extract the exact Value (V) payload of the move. 4. VRAM Optimization : Aggressively cropped the context window and implemented bitwise padding + causal masking to prevent future-state leakage during autoregressive training. Genuinely one of the most interesting problems I have solved; I thank god for letting me have the ability to experience this beauty of a model and learn it. I'm gonna call this the QKV Gambit 😆 . The math is just beautiful. The model doesn’t "memorize" boards; it physically learns how to simulate piece trajectories across an attention matrix. Explanation, Math, Pytorch Implementation in the repository. Repo : https://lnkd.in/g8swe4yC #MachineLearning #DeepLearning #Transformers #AI #Python #DataScience #SoftwareEngineering #30DayChallenge
1 Comment
Like Comment
To view or add a comment, sign in
Faruque Hasan
1w Edited
Report this post
🔔 𝐄𝐱𝐜𝐢𝐭𝐢𝐧𝐠 𝐔𝐩𝐝𝐚𝐭𝐞𝐬 𝐨𝐧 “𝐊𝐊𝐓-𝐇𝐚𝐫𝐝𝐍𝐞𝐭” 🔔 𝐊𝐊𝐓-𝐇𝐚𝐫𝐝𝐍𝐞𝐭 —a general physics-constrained ML tool that combines data and domain knowledge for scientific machine learning (SciML)—is now officially available as a Python package. We’ve significantly upgraded the framework with CUDA support and a more modular problem construction pipeline, making it faster and easier to use than before. 𝑲𝒆𝒚 𝒊𝒎𝒑𝒓𝒐𝒗𝒆𝒎𝒆𝒏𝒕𝒔: - 📈 𝘐𝘮𝘱𝘳𝘰𝘷𝘦𝘥 𝘱𝘳𝘦𝘥𝘪𝘤𝘵𝘪𝘰𝘯 𝘢𝘤𝘤𝘶𝘳𝘢𝘤𝘺 𝘢𝘯𝘥 𝘪𝘯𝘧𝘦𝘳𝘦𝘯𝘤𝘦 𝘵𝘪𝘮𝘦 - ⏱️ 𝘍𝘢𝘴𝘵𝘦𝘳 𝘢𝘯𝘥 𝘮𝘰𝘳𝘦 𝘦𝘧𝘧𝘪𝘤𝘪𝘦𝘯𝘵 𝘵𝘳𝘢𝘪𝘯𝘪𝘯𝘨 𝘰𝘯 𝘣𝘰𝘵𝘩 𝘊𝘗𝘜 𝘢𝘯𝘥 𝘎𝘗𝘜 - 🧩 𝘔𝘰𝘥𝘶𝘭𝘢𝘳 𝘥𝘦𝘴𝘪𝘨𝘯 𝘧𝘰𝘳 𝘧𝘭𝘦𝘹𝘪𝘣𝘭𝘦 𝘱𝘳𝘰𝘣𝘭𝘦𝘮 𝘴𝘦𝘵𝘶𝘱 𝑰𝒇 𝒚𝒐𝒖’𝒗𝒆 𝒃𝒆𝒆𝒏 𝒖𝒔𝒊𝒏𝒈 𝒆𝒂𝒓𝒍𝒊𝒆𝒓 𝒗𝒆𝒓𝒔𝒊𝒐𝒏𝒔, 𝒘𝒆 𝒔𝒕𝒓𝒐𝒏𝒈𝒍𝒚 𝒓𝒆𝒄𝒐𝒎𝒎𝒆𝒏𝒅 𝒔𝒘𝒊𝒕𝒄𝒉𝒊𝒏𝒈 𝒕𝒐 𝒕𝒉𝒊𝒔 𝒐𝒑𝒕𝒊𝒎𝒊𝒛𝒆𝒅 𝒊𝒎𝒑𝒍𝒆𝒎𝒆𝒏𝒕𝒂𝒕𝒊𝒐𝒏. 📄 Paper: https://lnkd.in/gD7p7G6Z 💻 Code: https://lnkd.in/gzqrEVgf 📦 Package: https://lnkd.in/gNrDxF3t ⚙️ Install via pip: CPU: 𝘱𝘪𝘱 𝘪𝘯𝘴𝘵𝘢𝘭𝘭 𝘬𝘬𝘵-𝘩𝘢𝘳𝘥𝘯𝘦𝘵 GPU (CUDA 12): 𝘱𝘪𝘱 𝘪𝘯𝘴𝘵𝘢𝘭𝘭 "𝘬𝘬𝘵-𝘩𝘢𝘳𝘥𝘯𝘦𝘵[𝘤𝘶𝘥𝘢12]" We’d love to hear your feedback. Feel free to reach out with questions or thoughts on the documentation and examples. Bimol Nath Roy Rahul Golder Ashfaq Iftakher #PhysicsConstrainedMachineLearning #PCML #ConstrainedLearning #MachineLearning #Optimization #DeepLearning #JAX #CUDA #Research #AI #Engineering #PSE
4 Comments
Like Comment
To view or add a comment, sign in
Dr. Florent POUX
2w
Report this post
The current modular stack of robotic perception (separate blocks for detection, depth, and tracking) is a temporary artifact of our computational limits. Latency and accumulated error are killed by unified architectures. By 2030, we won't be fusing distinct outputs; we will be querying a single, holistic scene representation that encodes geometry, semantics, and affordance simultaneously. We are moving from "pipelines" to "foundation world models." The simplification of the stack will be the greatest driver of reliability. Here is my prediction for the next decade of perception. 👇 #SpatialAI #python #3d #research

2 Comments
Like Comment
To view or add a comment, sign in
Daniel S.
3w Edited
Report this post
Next month, Ayoub Belhadji and I are traveling to #AISTATS 2026 to talk about our work with Youssef Marzouk, "Weighted Quantization Using MMD: From Mean Field to Mean Shift via Gradient Flows". While our work is already up on ArXiv (I'll drop a link in the comments), I just made a little exploratory post about my intuition for one of the algorithmic contributions, 𝘮𝘦𝘢𝘯 𝘴𝘩𝘪𝘧𝘵 𝘪𝘯𝘵𝘦𝘳𝘢𝘤𝘵𝘪𝘯𝘨 𝘱𝘢𝘳𝘵𝘪𝘤𝘭𝘦𝘴 (MSIP), how it relates to the Lloyd's k-means algorithm, how to reformulate the MSIP system as a differential algebraic equation (DAE) and why, ultimately, you probably should avoid that reformulation. If you're interested, check it out here: https://lnkd.in/e8-XTY2t . While this isn't the focus of the post, I also just want to point out that the core algorithm takes about 15 lines of code---while I wrote the post in MATLAB to use DAE solvers, but it's just as easy with NumPy, Torch, Julia, JAX, or any other modern numerical software tool (honestly, it probably wouldn't be much harder in modern statically-typed compiled languages, either).

Integrating interacting particles and gradient flows for clustering and quantization dsharp.dev

1 Comment
Like Comment
To view or add a comment, sign in
Chandrakant Shinde
4w
Report this post
Adding more AI Agents doesn't make your system smarter. It mathematically breaks it. It's called Joint Probability Failure. If three sequential agents each have 90% reliability, your pipeline fails ~30% of the time. Enterprise architectures fix this by abandoning fragile DAGs for fault-tolerant State Machines: Contract-Based Handoffs: Strict Pydantic schemas prevent silent hallucination cascades. Deterministic Splits: LLMs handle the intent; Python handles the execution. Validation Gates: Cyclic graphs that loop back and self-correct errors. Stop building fragile chains. Architect for reliability. #GenAI #SystemDesign #LangGraph #MultiAgent #AIArchitecture

1 Comment
Like Comment
To view or add a comment, sign in
Sandeep c s
1mo
Report this post
"Fine-tuning costs thousands." I hear this every week. It's wrong. Here's the real cost. This myth comes from traditional full fine-tuning, especially on large proprietary models like GPT-4, where costs can indeed be high due to compute and infrastructure requirements. But modern techniques have completely changed the game. LoRA (Low-Rank Adaptation) allows you to train only a tiny fraction of model parameters — often less than 0.1%. This dramatically reduces compute requirements while maintaining strong performance. Then comes QLoRA, which uses 4-bit quantization to reduce memory usage even further. This means you can fine-tune powerful models on much smaller hardware. In reality, you can fine-tune a 7B parameter model on Google Colab (free tier) with zero cost. What you actually need is simple: one GPU, a clean dataset, a few hours (4–8), and basic Python knowledge. If you are starting, try models like Mistral 7B or LLaMA 3. The real barrier is not money. It is understanding the workflow. #FineTuning #LoRA #LLM #GenerativeAI #HuggingFace #QLoRA #Python
Like Comment
To view or add a comment, sign in
Vuk R.
2w
Report this post
Coding Transformer From Scratch - Pytorch Tutorial In this tutorial you'll code a transformer from scratch in PyTorch. We cover multi-head attention using PyTorch's built-in module, feed-forward networks with ReLU/SwiGLU activations, rotary positional embeddings (RoPE), decoder-only architecture with residual connections and RMS normalization, and the output projection to vocabulary logits. By the end you'll understand every building block of a modern GPT-style language model. 0:00 Introduction & multi-head attention 0:42 Feed-forward layer & activation functions 1:29 Skool community 1:47 Rotary positional embedding (RoPE) 3:23 Decoder layer, residual connections & normalization 5:01 Output projection & causal masking 5:53 RMS Norm & attention formula

1 Comment
Like Comment
To view or add a comment, sign in
Adil Shaikh
4d Edited
Report this post
Spent some time this week reading through TEASER++ from MIT SPARK Lab. It's a 2020 point cloud registration library that solves a problem most pipelines still work around. Point cloud registration is the task of aligning two 3D scans. Classical ICP works well when you already have a rough alignment, but it gets stuck on local minima when you don't. RANSAC handles noisier inputs by sampling, but its runtime blows up once outliers dominate the correspondence set. TEASER++ takes a different route. Instead of estimating the full transform at once, it solves for scale, rotation, and translation as three separate sub-problems. It filters outliers by finding the largest group of correspondences that all agree with each other, a max-clique step that throws out anything inconsistent. The rotation step then uses Graduated Non-Convexity, a relaxation technique that avoids the local minima. In the paper's benchmarks, it handles correspondence sets with up to 99% outliers in sub-second time. Bindings for C++, Python, MATLAB, and ROS as well. What stands out to me is that it can prove its own answer is optimal. Most estimators give you a result and you trust it. TEASER++ gives you the result and the proof. That matters for any pipeline where downstream decisions depend on the registration being right, not just plausible. Worth a look if you are working on SLAM loop closure, multi-sensor calibration, scene reconstruction, or any problem where correspondence quality is in question. Tutorial: Part 1: https://lnkd.in/ePswdwXZ Part 2: https://lnkd.in/eHypqYpi GitHub link: https://lnkd.in/eGjm4nEN #PointClouds #SLAM #Registration
Like Comment
To view or add a comment, sign in
Hendrik Ordoñez Bullon
3w Edited
Report this post
Paper 2 is live. AnimaCore: A Multi-Project Ecosystem Architecture for Inductive AI-Assisted Software Development If Paper 1 was about how to learn and build inductively, Paper 2 is about what happens when that protocol needs to scale across multiple independent projects without losing coherence. AnimaCore documents the architecture governing four sub-systems — PAEChaka, CodeShelter, EduSim, and MMVM — under a single Silo/Control Plane structure. It also formalizes the tri-language paradigm (Python/C++20/Rust) and the multi-model AI pipeline (Gemini CLI, Claude, Qwen) used to maintain architectural integrity across boundaries. The short version: spec-first development + inductive reasoning at ecosystem scale. This is Paper 2 of a series. More coming as the system gets built. https://lnkd.in/gjy-2g7c "Looking for an arXiv endorser for cs.AI — if you're an active arXiv author in this area and find this work relevant, I'd appreciate your support." #software architecture #rust #AI #research #systems design #inductive reasoning
Like Comment
To view or add a comment, sign in
François G.
2w
Report this post
What if dividing by zero wasn't an error, but information? That's the question behind Zero Domain Algebra (ZDA), a framework I've been developing to give degenerate arithmetic operations a formal algebraic home. Instead of throwing an exception when you divide by zero, ZDA captures the event as a labeled object in a parallel domain 𝒵 preserving full provenance, enabling structural analysis, and under controlled conditions, allowing partial resurrection back into a usable result. The framework defines four core operations (fusion, erasure, projection, resurrection), proves key structural theorems, and comes with a complete Python prototype. Is it a solution looking for a problem? Maybe. But the mathematical structure is sound and the applications in error tracking, log analysis, and formal verification are starting to take shape. Full paper (with proofs and implementation) now on Zenodo: 👉 https://lnkd.in/eD82fets Feedback welcome, especially from people who've hit the wall of zero-related failures in production systems. #AbstractAlgebra #FormalMethods #ErrorHandling #Observability #Mathematics https://lnkd.in/eD82fets

Zero Domain Algebra zenodo.org
Like Comment
To view or add a comment, sign in

1,124 followers

52 Posts

View Profile Follow

Bahdanau Attention for Sequence Modeling with PyTorch

More Relevant Posts

Explore related topics

Explore content categories