Name: Data Extraction Pipeline for 400+ Veterinarians in Vermont | Sandi Ridwan posted on the topic | LinkedIn
Uploaded: 2026-04-18T02:49:22.764Z
Duration: 1 min 19 s
Channel: Sandi Ridwan

Sandi Ridwan

Automation & Data Engineering | 4+ Years Exp. in Electrical Certification | Web Scraping & Data Extraction Specialist | Python & Selenium | Transforming Industrial Data into Scalable Solutions & Marketing Insights

🔍 Just shipped a full-stack data extraction pipeline — here's what I learned. A client needed contact data for 400+ veterinarians across Vermont. Simple enough, right? Wrong. Here's what I was up against: 🧱 Target 1: vtvets.org — Angular SPA running on MemberClicks CMS. Static requests returned nothing but a loading shell. The real data? Hidden behind an internal service-router endpoint that can't be reached outside a live browser session. 🧱 Target 2: Google Maps — infinite scroll, 44 city queries, email extraction from 180+ individual clinic websites. After 3 iterations of failed approaches (requests → JS inject → cross-origin CORS block), the breakthrough: 💡 Instead of calling the API directly, let the Angular app call it — and intercept the response. Using Playwright's page.on("response") handler, I captured all 236 records across 24 pages without a single CORS error. The browser made the requests. I just listened. Final pipeline: → VVMA Directory: 236 records via API intercept → Google Maps: ~300 records via Playwright + stealth → Smart merge with phone-based dedup: 293 unique records → Output: Excel + CSV, clean and formatted Key stats: 📞 91% records with phone 📧 34% records with email (public directories rarely expose this) 🌐 61% records with website ⏱️ Total runtime: ~3 hours The biggest lesson? Every scraping problem is unique. The stack that works is the one you discover after understanding WHY the obvious approach fails. Full pipeline on GitHub 👇 [https://lnkd.in/gGhSyqjb] #WebScraping #DataEngineering #Python #Playwright #LeadGeneration #APIReverseEngineering

To view or add a comment, sign in

More Relevant Posts

Ashish Kumar
5d Edited
Report this post
😅 I once changed a value in a “copied” object…... and somehow the original data changed too 💥 👉 That’s when I realized… I didn’t understand shallow vs deep copy properly. 🚀 Let’s break it down (this will save you from real bugs) 🧠 Why this matters In JavaScript, objects & arrays are reference types So copying them incorrectly = you might accidentally modify the original data 😬 📦 1. Shallow Copy A shallow copy only copies top-level values 👉 Nested objects are still shared (same reference) So: - Changing top-level → ✅ safe - Changing nested → 💥 affects original too ⚠️ The common mistake You think you created a new object… but deep inside, it’s still pointing to the same memory 😵 🔁 How to create shallow copy • Spread → {...obj} • Object.assign • Array methods → slice, Array.from 🔐 2. Deep Copy A deep copy creates a fully independent clone 👉 Every level is copied 👉 No shared references So: - Changing nested data → ✅ completely safe 🔁 How to create deep copy 👉 structuredClone() (Recommended) - Handles most data types - Modern & reliable 👉 JSON.parse(JSON.stringify()) - Quick but limited - loses functions, Dates, undefined 💡 Real Dev Insight Shallow copy is fast ⚡ Deep copy is safe 🛡️ 👉 Use shallow → for simple data 👉 Use deep → for nested structures 🚀 Final Thought: Most bugs don’t come from logic… 👉 They come from unexpected mutations Understand copying → write safer code 💪 #JavaScript #FrontendDevelopment #WebDevelopment #CodingTips #ShallowCopy #DeepCopy #LearnJavaScript #BuildInPublic #100DaysOfCode #LearnInPublic
Like Comment
To view or add a comment, sign in
North Park

1 follower
1mo
Report this post
Claude doesn't see your web page. We read all 512,000 lines of the Claude Code source that leaked yesterday. Most of the coverage focused on the Tamagotchi pet and the undercover mode. I was looking for something else: how does Claude actually handle citations? The biggest thing we found: when Claude fetches your URL, it doesn't pass your content to the main model. It sends it through Haiku (Anthropic's small model) first, which summarizes it. Only the summary reaches Claude. Your page has to survive lossy compression by a less capable model before it can influence an answer. Except for ~96 domains — all developer docs — that bypass the summarization layer entirely and get their full content passed through raw. react.dev, docs.python.org, kubernetes.io... there's a hardcoded whitelist in the source code. There's also a silent domain blocklist, a hardcoded 8 search cap per query, a 125 character quote limit, and a mandatory citation instruction appended to every single search result. we wrote up all 9 findings with the actual code excerpts and file paths. https://lnkd.in/e9KfXhX8

Claude Code Source Leak: Web Search, Citations & GEO Signals Revealed northparkgroup.co
Like Comment
To view or add a comment, sign in
Jeremy Meindl
1mo Edited
Report this post
Claude doesn't see your web page. I read all 512,000 lines of the Claude Code source that leaked yesterday. Most of the coverage focused on the Tamagotchi pet and the undercover mode. I was looking for something else: how does Claude actually handle citations? The biggest thing I found: when Claude fetches your URL, it doesn't pass your content to the main model. It sends it through Haiku (Anthropic's small model) first, which summarizes it. Only the summary reaches Claude. Your page has to survive lossy compression by a less capable model before it can influence an answer. Except for ~96 domains — all developer docs — that bypass the summarization layer entirely and get their full content passed through raw. react.dev, docs.python.org, kubernetes.io... there's a hardcoded whitelist in the source code. There's also a silent domain blocklist, a hardcoded 8 search cap per query, a 125 character quote limit, and a mandatory citation instruction appended to every single search result. I wrote up all 9 findings with the actual code excerpts and file paths. https://lnkd.in/e3q9ZQv6?

Claude Code Source Leak: Web Search, Citations & GEO Signals Revealed northparkgroup.co

1 Comment
Like Comment
To view or add a comment, sign in
David 🏴󠁧󠁢󠁳󠁣󠁴󠁿 Young
1mo
Report this post
Ruby can process chest X-rays. Sobel edge detection, histogram analysis, region segmentation, all running in milliseconds through native C filters that ruby-libgd already ships. That's one of six reads this week, all connected by the same idea: Rails developers owning more of their stack than they used to. Here's what caught my eye: 🩻 ruby-libgd ships native C filters for Sobel, Gaussian blur, and Laplacian. Operations that take 90 seconds in pure Ruby run in milliseconds. Ruby handles the orchestration, C does the heavy lifting. Clinically meaningful results on real chest X-rays. 🚀 A complete Rails 8 deployment guide: from ordering a Hetzner server to production with Kamal, the Solid stack on SQLite, Litestream backups, and Netdata monitoring. No Postgres, no Redis, no PaaS. One server, zero external services. 📊 A full analytics dashboard built with Ahoy and pure SVG charts in under 600 lines. No Google Analytics, no JS charting library, no cookie banners. The author used Claude Code with Ariadna for structured AI pair programming and the dashboard is public. 📱 A Hotwire Native Bridge Component triggers native iOS and Android calendar APIs directly from a Rails view. One Stimulus controller, one native implementation per platform. Clean pattern for any native API integration. 🤖 Five changes to make a Rails monolith AI-friendly: namespace-level .context.md files, sub-5-second test files with Guard, gradual Sorbet adoption, and service objects under 100 lines with keyword arguments. 🖼️ I wrote up how I generate OG images in Rails: Ferrum screenshots an HTML template at 1200x630, Active Storage puts it on Cloudflare R2, no Node.js or external service involved. One after_create_commit and it's done. The common thread: the tools for Rails teams to own their full stack keep getting better, and more people are choosing to use them. Full issue with actionable notes: https://lnkd.in/ePjKd6Uu What's the last thing you moved in-house that you used to outsource to a third-party service? #RubyOnRails #Kamal #Ruby #AIdev #HotwireNative

Reading Roundup: Owning more of your stack with Rails dcyoungdev.substack.com
Like Comment
To view or add a comment, sign in
Atharv Khare
4w Edited
Report this post
Stop Googling "JSON Formatter" and hoping they aren't logging your data. Most online dev tools are bloated, ad-ridden, or worst of all, send your sensitive inputs to a backend server. I got tired of it, so I built DevLoft: a collection of 19 essential utilities built purely with Vanilla JS. No React. No Webpack. No Node modules. Just index.html, style.css, and a bunch of scripts. Why did I build it this way? - Zero Latency: It loads faster than a framework can even initialize. - True Privacy: Since there’s no backend, it is physically impossible for your data to leave your machine. - Low Barrier to Entry: Want to add a tool? You don't need to learn a framework. Just write a function and open a PR. The Toolkit includes: Data Science: Z-Score outliers, Haversine distance, and Log Parsers. Security: PII Redaction and XSS Sanitizers. AI/LLM: Recursive text chunkers (RAG prep) and Token cost estimators. Classic Dev: Regex testers, SQL Schema generators, and Text diffs. This is an open-source "sandbox" for all of us. If you’ve ever written a quick script to solve a repetitive task, don't let it die in your Gists,add it to DevLoft and let the community use it. Explore the tools (link in comments) and feel free to contribute on GitHub I’m looking for contributors to help optimize the CSS and add more niche data-science utilities. What’s the one script you’re currently running locally that should be a UI tool? #OpenSource #VanillaJS #BuildInPublic #DataScience #WebDev #SoftwareEngineering
1 Comment
Like Comment
To view or add a comment, sign in
Bhoge Yuvakumar
3w
Report this post
🗺️ Web Scraping Roadmap (Beginner → Advanced) If you’re planning to learn Web Scraping, here’s a clear roadmap to guide your journey step by step 👇 🔰 1. Foundation (Basics) • Understand HTTP (request/response) • Learn HTML, tags, DOM structure • Identify Static vs Dynamic websites 🧩 2. Static Web Scraping • Use Requests to fetch HTML • Parse using BeautifulSoup • Learn CSS selectors & data extraction • Store data in CSV / Excel ⚙️ 3. Dynamic Web Scraping • Learn browser automation • Tools: Playwright / Selenium • Handle clicks, typing, scrolling • Use waits for loading elements 🧠 4. Data Extraction Mastery • XPath & advanced selectors • Handle lists, nested elements • Extract clean and structured data ⚠️ 5. Real-World Challenges • Infinite scroll • Timeout errors • CAPTCHA & blocking • Login-required websites 🚀 6. Advanced Techniques • API scraping (Network → XHR/Fetch) • Use Pandas for data cleaning • Build pipelines (CSV → Scrape → Store) 📦 7. Storage & Scaling • Store in CSV / Database • Handle large-scale scraping • Optimize performance 💡 Pro Tip: Start simple (static scraping), then move to dynamic websites. Strong fundamentals make advanced scraping much easier. 🎯 End Goal: Build automated systems that collect, clean, and store data efficiently from any website. #WebScraping #Python #Automation #Playwright #Selenium #DataEngineering #LearningPath #TechRoadmap
1 Comment
Like Comment
To view or add a comment, sign in
Shafi M.
2w Edited
Report this post
“set (𝙧𝙚𝙫𝙖𝙡𝙞𝙙𝙖𝙩𝙚: 𝟲𝟬) and move on” This works in Next.js — until it doesn’t. Now you’re: • serving stale data for up to 60s • re-fetching even when nothing changed • adding load for no reason This is where caching stops being an optimization — and becomes about 𝘄𝗵𝗼 𝗼𝘄𝗻𝘀 𝗳𝗿𝗲𝘀𝗵𝗻𝗲𝘀𝘀. At a high level: 𝗦𝘁𝗮𝘁𝗶𝗰 𝗿𝗲𝗻𝗱𝗲𝗿𝗶𝗻𝗴 → speed 𝗗𝘆𝗻𝗮𝗺𝗶𝗰 𝗿𝗲𝗻𝗱𝗲𝗿𝗶𝗻𝗴 → freshness 𝗖𝗮𝗰𝗵𝗶𝗻𝗴 + 𝗿𝗲𝘃𝗮𝗹𝗶𝗱𝗮𝘁𝗶𝗼𝗻 → where we balance the two But this abstraction starts to break down at scale. 👉 𝗧𝗶𝗺𝗲-𝗯𝗮𝘀𝗲𝗱 𝗿𝗲𝘃𝗮𝗹𝗶𝗱𝗮𝘁𝗶𝗼𝗻 (𝗿𝗲𝘃𝗮𝗹𝗶𝗱𝗮𝘁𝗲) periodically refetches data This works well when data becomes stale in predictable intervals Think dashboards, blogs, or analytics snapshots 👉 𝗢𝗻-𝗱𝗲𝗺𝗮𝗻𝗱 𝗿𝗲𝘃𝗮𝗹𝗶𝗱𝗮𝘁𝗶𝗼𝗻 (𝗿𝗲𝘃𝗮𝗹𝗶𝗱𝗮𝘁𝗲𝗧𝗮𝗴, 𝗿𝗲𝘃𝗮𝗹𝗶𝗱𝗮𝘁𝗲𝗣𝗮𝘁𝗵) flips the model Don't blindly revalidate — react to change In one of our systems, moving from time-based to event-driven invalidation: • reduced redundant fetches significantly • cache behavior became predictable under load This becomes the default once writes are frequent. 👉 𝗙𝘂𝗹𝗹 𝗥𝗼𝘂𝘁𝗲 𝗖𝗮𝗰𝗵𝗲 𝘃𝘀 𝗗𝗮𝘁𝗮 𝗖𝗮𝗰𝗵𝗲 • Full Route Cache → caches the rendered output • Data Cache → caches the underlying fetch calls That separation is powerful: Don't rebuild the entire page — refresh just the data 🧠 𝗠𝘆 𝘁𝗮𝗸𝗲𝗮𝘄𝗮𝘆 𝗦𝘁𝗼𝗽 𝘁𝗵𝗶𝗻𝗸𝗶𝗻𝗴 𝗶𝗻 𝘁𝗶𝗺𝗲 → 𝘀𝘁𝗮𝗿𝘁 𝘁𝗵𝗶𝗻𝗸𝗶𝗻𝗴 𝗶𝗻 𝗲𝘃𝗲𝗻𝘁𝘀 Instead of → “𝘳𝘦𝘷𝘢𝘭𝘪𝘥𝘢𝘵𝘦 𝘦𝘷𝘦𝘳𝘺 𝘟 𝘴𝘦𝘤𝘰𝘯𝘥𝘴” Think → “𝘸𝘩𝘢𝘵 𝘦𝘷𝘦𝘯𝘵 𝘴𝘩𝘰𝘶𝘭𝘥 𝘮𝘢𝘬𝘦 𝘵𝘩𝘪𝘴 𝘥𝘢𝘵𝘢 𝘴𝘵𝘢𝘭𝘦?” ❓Interested to hear how this plays out in write-heavy or multi-region setups. #NextJS #Caching #ReactJS #WebDevelopment #FullStack #JavaScript #SoftwareEngineering #SystemDesign #FrontendDevelopment
Like Comment
To view or add a comment, sign in
Krishna Rohilla
1w Edited
Report this post
🚀 I just shipped something I'm genuinely proud of — Text2SQL Studio. The idea is simple: what if anyone — analyst, manager, founder — could query a database just by asking a question in plain English? No SQL expertise. No bottlenecks. Just answers. --- 🔍 Here's what makes it special: ✅ ReAct Agent Loop — the system *reasons* about your schema before writing a single line of SQL ✅ Self-Healing Queries — if a query fails, it doesn't just crash. It automatically passes the error to GPT-4, which diagnoses the issue and returns a corrected query. Zero manual intervention. ✅ Visual Schema Explorer — interactive graph to explore tables, columns & relationships in real-time ✅ Upload .sqlite or .csv files — system auto-converts them into relational schemas ✅ Streaming Responses — watch the agent *think* step by step, live ✅ Enterprise-grade security — read-only execution, Firebase Auth, MongoDB + GridFS persistence --- The self-healing part was honestly the hardest to build — and the most satisfying when it worked. When a query breaks, the error context gets fed back to GPT-4, which figures out *why* it broke and fixes it on the fly. Like having a senior engineer reviewing every query in real time. --- 🛠️ Stack: React 18 · FastAPI · Tailwind CSS · Firebase · MongoDB · Docker · Prometheus 🌐 Live here → https://lnkd.in/e3VkQr7B 📹 Demo video attached 👇 If you've ever wished your data could just *talk back to you* — this is it. Would love your feedback! Drop a comment or DM me 🙌 #AI #LLM #Text2SQL #NaturalLanguageProcessing #FullStackDevelopment #OpenAI #FastAPI #React #MachineLearning #SideProject #BuildInPublic #DataEngineering #Python #WebDevelopment

2 Comments
Like Comment
To view or add a comment, sign in
Abhishek Rohatgi
1w
Report this post
⚡ Stop writing useEffect for data fetching in React. We've all written this at least once: useEffect(() => { setLoading(true); fetch('/api/users') .then(res => res.json()) .then(data => setData(data)) .catch(err => setError(err)) .finally(() => setLoading(false)); }, []); It works. But you're manually handling things React Query does for free. Here's what your useEffect approach is missing: ❌ No caching — refetches every single mount ❌ No background refetching ❌ No automatic retries on failure ❌ No deduplication of requests ❌ Boilerplate loading/error state every time ✅ The fix? React Query. const { data, isLoading, error } = useQuery({ queryKey: ['users'], queryFn: () => fetch('/api/users').then(r => r.json()) }); That's it. 4 lines replace 15. What you get out of the box: ✅ Smart caching — won't refetch if data is fresh ✅ Background sync — keeps data up to date silently ✅ Auto retry on network failure ✅ Built-in loading & error states ✅ Request deduplication useEffect is great — but it was never meant to be a data fetching tool. Work smarter, not harder. 🚀 #React #ReactQuery #JavaScript #Frontend #WebDevelopment #Performance #Programming
2 Comments
Like Comment
To view or add a comment, sign in
Shivam Bhilare
1w
Report this post
𝗧𝘂𝗿𝗻𝗶𝗻𝗴 𝗠𝗲𝘀𝘀𝘆 𝗗𝗮𝘁𝗮 𝗶𝗻𝘁𝗼 𝗩𝗶𝘀𝘂𝗮𝗹 𝗦𝘁𝗿𝗮𝘁𝗲𝗴𝘆 📈 Handling raw sales data is time-consuming and complicated. I built 𝗘𝘅𝗚𝗿𝗼𝘄𝘁𝗵 to bridge that gap; a full-stack web app that 𝘁𝗿𝗮𝗻𝘀𝗳𝗼𝗿𝗺𝘀 𝗿𝗮𝘄 𝗖𝗦𝗩/𝗘𝘅𝗰𝗲𝗹 𝗳𝗶𝗹𝗲𝘀 𝗶𝗻𝘁𝗼 𝗶𝗻𝘀𝘁𝗮𝗻𝘁, 𝘃𝗶𝘀𝘂𝗮𝗹 𝗶𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝗰𝗲. 𝗪𝗵𝗮𝘁 𝗘𝘅𝗚𝗿𝗼𝘄𝘁𝗵 𝗵𝗮𝗻𝗱𝗹𝗲𝘀: 𝗥𝗲𝗮𝗹-𝘁𝗶𝗺𝗲 𝗞𝗣𝗜𝘀: Revenue, Profit, and Average Order Value (AOV). 𝗗𝗲𝗲𝗽 𝗦𝗲𝗴𝗺𝗲𝗻𝘁𝗮𝘁𝗶𝗼𝗻: Performance by Product Category, Region, and Top Customers. 𝗧𝗿𝗲𝗻𝗱 𝗧𝗿𝗮𝗰𝗸𝗶𝗻𝗴: Monthly charts for sales and customer acquisition. 𝗦𝗲𝗰𝘂𝗿𝗲 𝗠𝗮𝗻𝗮𝗴𝗲𝗺𝗲𝗻𝘁: File history with interactive previews and cloud deletion. 𝗧𝗵𝗲 𝗧𝗲𝗰𝗵𝗻𝗶𝗰𝗮𝗹 𝗕𝗮𝗰𝗸𝗯𝗼𝗻𝗲: I prioritized a modular architecture using the Service Layer Pattern to keep logic decoupled from the UI. 𝗦𝘁𝗮𝗰𝗸: Python, Django 6.0.4, Pandas, and NumPy. 𝗙𝘂𝘇𝘇𝘆 𝗠𝗮𝘁𝗰𝗵𝗶𝗻𝗴: Integrated RapidFuzz to intelligently map inconsistent column headers. 𝗜𝗻𝗳𝗿𝗮𝘀𝘁𝗿𝘂𝗰𝘁𝘂𝗿𝗲: Supabase (Postgres & Storage) with local fallback for resilience. 𝗧𝗵𝗲 𝗘𝘃𝗼𝗹𝘂𝘁𝗶𝗼𝗻 & 𝗣𝗿𝗼𝗰𝗲𝘀𝘀: This project is a complete V1 overhaul of a Flask prototype I built in my first year. I migrated to Django to implement JSON caching for near-instant dashboard loads. To move faster, I used AI as a "technical partner" to troubleshoot edge cases while I maintained control over the architecture. 𝗕𝗼𝗻𝘂𝘀 𝗦𝗸𝗶𝗹𝗹: This was also my first time editing a software demo video! It was a great challenge to align the visual flow with the app's logic. It’s not perfect, but it was a massive deep dive into building cloud-native data pipelines. 𝗜'𝗱 𝗹𝗼𝘃𝗲 𝘆𝗼𝘂𝗿 𝗳𝗲𝗲𝗱𝗯𝗮𝗰𝗸! 🔗 𝗚𝗶𝘁𝗛𝘂𝗯:https://lnkd.in/dAtZu2MV 🔗 𝗣𝗿𝗼𝗳𝗶𝗹𝗲: https://lnkd.in/d-vXp97X #Python #Django #DataScience #FullStack #ExGrowth #BuildInPublic #ExGrowth #SoftwareEngineering #WebDevelopment #Supabase

2 Comments
Like Comment
To view or add a comment, sign in

133 followers

27 Posts

View Profile Follow

More Relevant Posts

Explore content categories