View profile for Sandi Ridwan

Automation & Data Engineering | 4+ Years Exp. in Electrical Certification | Web Scraping & Data Extraction Specialist | Python & Selenium | Transforming Industrial Data into Scalable Solutions & Marketing Insights

🔍 Just shipped a full-stack data extraction pipeline — here's what I learned. A client needed contact data for 400+ veterinarians across Vermont. Simple enough, right? Wrong. Here's what I was up against: 🧱 Target 1: vtvets.org — Angular SPA running on MemberClicks CMS. Static requests returned nothing but a loading shell. The real data? Hidden behind an internal service-router endpoint that can't be reached outside a live browser session. 🧱 Target 2: Google Maps — infinite scroll, 44 city queries, email extraction from 180+ individual clinic websites. After 3 iterations of failed approaches (requests → JS inject → cross-origin CORS block), the breakthrough: 💡 Instead of calling the API directly, let the Angular app call it — and intercept the response. Using Playwright's page.on("response") handler, I captured all 236 records across 24 pages without a single CORS error. The browser made the requests. I just listened. Final pipeline: → VVMA Directory: 236 records via API intercept → Google Maps: ~300 records via Playwright + stealth → Smart merge with phone-based dedup: 293 unique records → Output: Excel + CSV, clean and formatted Key stats: 📞 91% records with phone 📧 34% records with email (public directories rarely expose this) 🌐 61% records with website ⏱️ Total runtime: ~3 hours The biggest lesson? Every scraping problem is unique. The stack that works is the one you discover after understanding WHY the obvious approach fails. Full pipeline on GitHub 👇 [https://lnkd.in/gGhSyqjb] #WebScraping #DataEngineering #Python #Playwright #LeadGeneration #APIReverseEngineering

To view or add a comment, sign in

Explore content categories