Top LinkedIn Content on Email File Management

Senior Salesforce Architect - 5X Salesforce Certified

8,118 followers 11mo

Duplicate Leads in Salesforce? It’s not just messy — it’s dangerous! As a Salesforce Architect, one of the most underestimated pain points I see across orgs is poor duplicate management. It silently: ⚠️ Breaks automation 📉 Skews reporting ❌ Slows down sales 🛑 Violates GDPR rules Last week, I worked on optimizing a lead flow where 80,000+ leads were sitting unvalidated — many of them potential contacts, already in the system. Here’s how we tackled it: 📌 Step 1: Smart Duplicate Check → We built a flow that compares incoming Leads to Contacts & Leads using fuzzy logic (Email, Phone, Name, etc.) 📌 Step 2: Decision Branch → If a duplicate is found, we flag it or merge it automatically (using Apex + native Merge tools). → If not, we convert the Lead cleanly to a Contact, ensuring no clutter. 📌 Step 3: Automation with Guardrails → All this runs inside a scalable Salesforce Flow — enriched with Apex where needed — and leaves a full audit trail. 💡 Architecture isn’t just about building — it’s about protecting your data layer. If you're still relying on name-only matching or manual checks, you're setting your CRM up for failure. Let’s talk if you want a duplicate management framework that scales 👇 #Salesforce #CRMStrategy #DuplicateCheck #SalesforceFlow #Architect #RevOps #DataIntegrity

46 Comments

Sanjeev Sharma

Data Architect & Sr Data Engineer with expertise in data modernisation, Azure Data Factory, Databricks (PySpark), Microsoft Fabric, Synapse, Snowflake, Python, data migration and AI-enabled data enrichment.

3,251 followers 1y

Data Engineer Interview Scenario!! You are tasked with deduplicating a massive dataset where processing the entire dataset at once is not feasible due to memory constraints. You need to handle it in chunks, ensuring deduplication based on specific columns while keeping one record per duplicate set. How would you approach this in PySpark? Key Interview Flow # 1. Initial Question: Partitioning Logic - How would you process such a huge dataset without loading all the data into memory at once? - Expected Answer: Repartition the dataset by a logical key to split it into manageable chunks. # 2. Code Discussion: Partition-Based Deduplication Present this code to the candidate and ask them to explain each step: ```python chunked_df = df.repartition("key") # Step 1: Repartition by key def process_partition(iterator): partition_data = list(iterator) # Step 2: Collect data in the current chunk if not partition_data: return # Skip empty chunks # Convert the chunk into a DataFrame partition_df = spark.createDataFrame(partition_data, schema=df.schema) # Deduplicate within the chunk using Window function window_spec = Window.partitionBy("key").orderBy(col("date").desc()) deduplicated = partition_df.withColumn("row_number", row_number().over(window_spec)) \ .filter(col("row_number") == 1) \ .drop("row_number") # Simulate saving results for row in deduplicated.collect(): print(row) # Replace with save logic chunked_df.foreachPartition(process_partition) # Step 3: Apply logic to each partition ``` Follow-Up Questions # Partitioning Challenges 1. Why is repartitioning critical in this approach? - Expected Answer: Repartitioning ensures that all duplicate records for a given key are in the same partition, enabling efficient local deduplication. 2. What could go wrong if some keys have significantly more data than others? - Hint: Lead the discussion toward data skew and its impact on memory. 3. How would you handle a situation where a single partition exceeds memory limits? - Expected Answer: Use salting techniques, further chunking, or write intermediate results to disk. # Deduplication Logic 1. What is the role of the `Window` function here? - Expected Answer: It assigns a `row_number` to each duplicate set based on the ordering criteria (`date` in this case). 2. Can `dropDuplicates()` replace the `Window` function here? Why or why not? - Expected Answer: No, because `dropDuplicates()` does not allow control over which duplicate to keep. # Performance/Optimization 1. How does `foreachPartition` improve performance compared to `mapPartitions`? - Expected Answer: `foreachPartition` is designed for side effects (like saving to a database) and avoids unnecessary data collection. 2. What optimizations can you suggest for this approach? - Repartitioning based on expected skew. - Caching intermediate results. - Using efficient file formats like Parquet.

1 Comment

Pooja Pawar, PhD

Data Analyst | Business Intelligence & Data Visualization | Data Insights & Practical Learning | Top 127 Global Data Science Creators (Favikon)

19,302 followers 9mo

𝐒𝐐𝐋 𝐈𝐧𝐭𝐞𝐫𝐯𝐢𝐞𝐰 𝐒𝐞𝐫𝐢𝐞𝐬 – 𝐃𝐚𝐲 𝟏𝟐⁣ ⁣ ⁣ 𝐓𝐚𝐬𝐤:⁣ Write a SQL query to identify email addresses that appear more than once in the customers table.⁣ ⁣ This type of question is commonly asked in interviews at companies like PwC, KPMG, and Infosys, especially when the role involves data quality audits, reporting, or data migration tasks. The focus here is on identifying duplicates—an essential skill in data cleaning and preprocessing workflows.⁣ ⁣ ⁣ 𝐇𝐨𝐰 𝐭𝐨 𝐟𝐫𝐚𝐦𝐞 𝐢𝐭:⁣ Start by grouping the table by the email column. Then apply the COUNT(*) function to count how many times each email appears in the dataset. To find duplicates, use a HAVING clause to return only those email groups where the count is greater than one.⁣ ⁣ This logic helps detect data integrity issues such as multiple records with the same email due to failed validations or duplicate imports.⁣ ⁣ ⁣ 𝐂𝐨𝐧𝐜𝐞𝐩𝐭𝐬 𝐮𝐬𝐞𝐝:⁣ 1. 𝐆𝐑𝐎𝐔𝐏 𝐁𝐘 𝐂𝐥𝐚𝐮𝐬𝐞:⁣ Groups records by email so that aggregation functions can be used to count how many times each unique email appears.⁣ ⁣ 2. 𝐂𝐎𝐔𝐍𝐓(*) 𝐅𝐮𝐧𝐜𝐭𝐢𝐨𝐧:⁣ Counts the total number of records for each grouped email value. If an email appears more than once, its count will be greater than one.⁣ ⁣ 3. 𝐀𝐥𝐢𝐚𝐬𝐢𝐧𝐠 𝐀𝐠𝐠𝐫𝐞𝐠𝐚𝐭𝐞𝐬:⁣ The result of COUNT(*) is aliased as occurrences for better readability and downstream usage in reporting or debugging queries.⁣ ⁣ 4. 𝐇𝐀𝐕𝐈𝐍𝐆 𝐂𝐥𝐚𝐮𝐬𝐞 𝐭𝐨 𝐅𝐢𝐥𝐭𝐞𝐫 𝐀𝐠𝐠𝐫𝐞𝐠𝐚𝐭𝐞𝐬:⁣ Since WHERE cannot be used with aggregated values, the HAVING clause is applied to filter the grouped data. It returns only those emails with more than one record.⁣ ⁣ 5. 𝐃𝐚𝐭𝐚 𝐐𝐮𝐚𝐥𝐢𝐭𝐲 𝐑𝐞𝐥𝐞𝐯𝐚𝐧𝐜𝐞:⁣ Identifying duplicates is critical in ETL pipelines, CRM data syncs, and compliance checks. Interviewers expect you to write efficient queries that surface these issues clearly.⁣ ⁣ 6. 𝐀𝐝𝐯𝐚𝐧𝐜𝐞𝐝 𝐅𝐨𝐥𝐥𝐨𝐰-𝐔𝐩 𝐒𝐭𝐫𝐚𝐭𝐞𝐠𝐢𝐞𝐬:⁣ Once duplicates are identified, interviewers may ask how to remove or resolve them. You can suggest using ROW_NUMBER() to isolate the latest record, or DISTINCT to retain unique values based on business rules.⁣ ⁣ 7. 𝐏𝐫𝐚𝐜𝐭𝐢𝐜𝐚𝐥 𝐀𝐩𝐩𝐥𝐢𝐜𝐚𝐭𝐢𝐨𝐧:⁣ This pattern is useful in fraud detection, contact deduplication, lead cleanup, or preparing customer data for machine learning models.⁣ ⁣ ⁣ 𝐖𝐡𝐲 𝐭𝐡𝐢𝐬 𝐢𝐬 𝐚𝐬𝐤𝐞𝐝:⁣ This question tests your understanding of SQL’s grouping and filtering logic, and how to detect and report anomalies in data. Clean data is the foundation of every analytics project, and the ability to identify duplicates is a must-have skill for any analyst or engineer. Interviewers look for candidates who can think practically and solve messy data problems with precision.⁣ ⁣ ⁣ ⁣ #SQL #SQLInterview #PwCInterview #DataCleaning #HAVINGClause #DataAnalytics #SQLQuery #LearnSQL #DataQuality #BusinessIntelligence #InterviewPreparation #DataEngineering #AnalyticsJobs #SQLTips

48 Comments

Gowtham SB

Sr Data Engineer | PayPal | YouTuber

87,568 followers 1y Edited

💡 𝐏𝐲𝐒𝐩𝐚𝐫𝐤 Interview Problem – #day12 Problem 🎯 Topic: Detecting Near-Duplicate Records (Fuzzy Logic) 🚀 Question: You’re given a PySpark DataFrame containing user registration data. Due to human entry or OCR errors, names and emails might have minor typos. Your task: ✅ Detect potential duplicate users based on similar name and email patterns (e.g., John Doe vs Jhon Doe, john@gmail.com vs john@gmal.com) ✅ Flag pairs where Levenshtein distance between name or email is below a certain threshold. 📦 Sample Input (user_data): +---------+-----------------+---------------------+ | user_id | full_name | email | +---------+-----------------+---------------------+ | U001 | John Doe | john@gmail.com | | U002 | Jhon Doe | john@gmal.com | | U003 | Alice Wonder | alice@xyz.com | | U004 | Ailce Wonder | alice@xyz.com | | U005 | Mark Smith | mark.smith@abc.com | 🎯 Expected Output: +--------+--------+ name_distance | email_distance | | user_1 | user_2 | 1 | 2 | +--------+--------+---------------+----------------+ | U001 | U002 | 1 | 2 | | U003 | U004 | 2 | 0 | ✅ PySpark Code Solution: from pyspark.sql import SparkSession from pyspark.sql.functions import col, levenshtein spark = SparkSession.builder.appName("FuzzyDuplicateDetection").getOrCreate() data = [ ("U001", "John Doe", "john@gmail.com"), ("U002", "Jhon Doe", "john@gmal.com"), ("U003", "Alice Wonder", "alice@xyz.com"), ("U004", "Ailce Wonder", "alice@xyz.com"), ("U005", "Mark Smith", "mark.smith@abc.com") ] df = spark.createDataFrame(data, ["user_id", "full_name", "email"]) # Self join to compare each pair df1 = df.alias("a") df2 = df.alias("b") # Avoid duplicate comparisons and self-join joined = df1.join(df2, col("a.user_id") < col("b.user_id")) # Calculate Levenshtein distances result = joined.select( col("a.user_id").alias("user_1"), col("b.user_id").alias("user_2"), levenshtein(col("a.full_name"), col("b.full_name")).alias("name_distance"), levenshtein(col("a.email"), col("b.email")).alias("email_distance") ).filter((col("name_distance") <= 2) | (col("email_distance") <= 2)) result.show() 🧠 Why this is different and useful: Applies fuzzy matching with levenshtein() — rarely used in basic PySpark problems Used in data deduplication, identity matching, master data cleanup Real-world use case in CRM systems, KYC, user onboarding platforms 🏢 Companies Asked: Razorpay LinkedIn Gojek 𝐒𝐨𝐜𝐢𝐚𝐥𝐬 🤝𝐂𝐨𝐧𝐧𝐞𝐜𝐭 𝐟𝐨𝐫 𝟏:𝟏 - https://lnkd.in/ggMJgs_k 📖𝐌𝐲 𝐁𝐨𝐨𝐤𝐬 - https://lnkd.in/ggMJgs_k 🎥𝐘𝐨𝐮𝐓𝐮𝐛𝐞 - https://lnkd.in/db-XNeP9 🎥𝐘𝐨𝐮𝐓𝐮𝐛𝐞 - https://lnkd.in/gc-8rdjM 📸𝐈𝐧𝐬𝐭𝐚𝐠𝐫𝐚𝐦 - https://lnkd.in/gwH84mRW 📸𝐈𝐧𝐬𝐭𝐚𝐠𝐫𝐚𝐦 - https://lnkd.in/gccKJZek 💼𝐋𝐢𝐧𝐤𝐞𝐝𝐈𝐧 - Gowtham SB #PySpark #BigData #DataCleaning #FuzzyMatching #Levenshtein #DataDeduplication #ApacheSpark #DataEngineering #100DaysOfPySpark #RealWorldUseCase

5 Comments

Okan YILDIZ

Global Cybersecurity Leader | Innovating for Secure Digital Futures | Trusted Advisor in Cyber Resilience

83,917 followers 9mo

🔍 Level Up Your Email Investigations with Email OSINT Tools! 📧✨ Ever struggled to track down email addresses or verify their authenticity during your investigations? This Email OSINT Cheat Sheet is a treasure trove of powerful tools and tricks to streamline your email-based recon and investigations. Here’s a snapshot of some standout tools: 🚩 The Harvester • Scrape emails and usernames from multiple sources (Google, LinkedIn, Twitter). 🔎 SimplyEmail • Effortlessly gather emails linked to domains with rapid enumeration. 🎯 Holehe • Quickly verify if an email is registered across platforms like Twitter or Instagram. 📌 Quidam & Mailcat • Investigate usernames and identify connected email addresses. 🛠️ MOSINT • Automated email OSINT framework to uncover social media profiles and breaches. ✨ Bonus Tools: Zen, Yopmail & more! Email OSINT is critical for: • Threat hunting and cybersecurity investigations. • Digital forensics and incident response. • OSINT enthusiasts and analysts. 👉 Pro tip: Combine multiple tools for comprehensive recon and cross-verification. Which email OSINT tool do you prefer in your toolkit? Drop your recommendations below! #OSINT #EmailOSINT #CyberSecurity #DigitalForensics #ThreatIntelligence #InfoSec #Recon #SecurityTools #Investigation #BlueTeam #ThreatHunting

5 Comments

vinesh diddi

5,152 followers 5mo

PySpark scenario-based interview questions & answer: 1) Deduplicate and normalize messy user data (Beginner): #Scenario: You receive a user signup CSV with messy names, mixed-case emails, and multiple signups per email. Keep the most recent signup per email and normalize fields. #Purpose: Data hygiene — prevents duplicate users and inconsistent keys that break joins and metrics. Question & data (sample): #Schema: user_id: int, full_name: string, email: string, signup_ts: string, country: string #Samplerows: (1, " john DOE ", "JOHN@EX.COM ", "2025-11-20 10:00", "US") (2, "John Doe", "john@ex.com", "2025-11-21 09:00", "US") (3, "alice", "alice@mail.com", "11/20/2025 12:00", "IN") #Approach: Read CSV with header. Trim & normalize (full_name → title case, email → lower-case). Parse multiple timestamp formats to timestamp. Filter obviously invalid emails (basic regex). Deduplicate by email, keeping row with latest signup_ts. #Explanation: Lowercasing emails and trimming prevents false-unique keys. Multiple to_timestamp attempts handle variable input formats. Window + row_number() deterministically selects the most recent record per email. Caveat: Basic regex filters obvious invalid addresses but not full RFC validation. Karthik K. #PySpark #DataCleaning #ETL #DataEngineering #ApacheSpark code:

6 Comments

Sudhanshu Tiwari

12,414 followers 1y

SQL interview question: How to Identify and Delete Duplicates (with Code) Handling duplicate records in SQL is a common task, especially when dealing with raw or legacy datasets, and interviewers love to ask this. Here are 3 reliable methods to identify and delete duplicates using SQL: 1. 𝗨𝘀𝗶𝗻𝗴 𝗥𝗢𝗪_𝗡𝗨𝗠𝗕𝗘𝗥() (𝗕𝗲𝘀𝘁 𝗳𝗼𝗿 𝗰𝗼𝗺𝗽𝗹𝗲𝘅 𝗱𝘂𝗽𝗹𝗶𝗰𝗮𝘁𝗲 𝗰𝗼𝗻𝗱𝗶𝘁𝗶𝗼𝗻𝘀) WITH CTE AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY name, email ORDER BY id) AS rn FROM users ) DELETE FROM users WHERE id IN ( SELECT id FROM CTE WHERE rn > 1 ); ✅ Why this works: It keeps the first occurrence (based on id) and removes the rest. Super handy when deduplication depends on multiple columns. 2. 𝗨𝘀𝗶𝗻𝗴 𝗚𝗥𝗢𝗨𝗣 𝗕𝗬 𝘄𝗶𝘁𝗵 𝗠𝗜𝗡() 𝗼𝗿 𝗠𝗔𝗫() DELETE FROM users WHERE id NOT IN ( SELECT MIN(id) FROM users GROUP BY name, email ); ✅ Why this works: Simple datasets with clear duplicate keys. Just keeps the record with the smallest id. 3. 𝗨𝘀𝗶𝗻𝗴 𝗦𝗘𝗟𝗙 𝗝𝗢𝗜𝗡 DELETE u1 FROM users u1 JOIN users u2 ON u1.name = u2.name AND u1.email = u2.email WHERE u1.id > u2.id; ✅ Why this works: No CTE required, straightforward and readable. How would you answer that? Comment that down! ------------------------------------------------------------------ #SQL #interviewquestions

6 Comments

Varun Sagar Theegala

9,012 followers 3w

Every advanced SQL interview question — YoY growth, top-N per group, running totals, percent contribution — has the same answer. 𝗪𝗶𝗻𝗱𝗼𝘄 𝗳𝘂𝗻𝗰𝘁𝗶𝗼𝗻𝘀. Here are the 5 patterns that cover 90% of real analytics work — no theory, just the queries you'll actually use. 𝗣𝗮𝘁𝘁𝗲𝗿𝗻 𝟭: 𝗣𝗲𝗿𝗶𝗼𝗱-𝗼𝘃𝗲𝗿-𝗽𝗲𝗿𝗶𝗼𝗱 𝗰𝗼𝗺𝗽𝗮𝗿𝗶𝘀𝗼𝗻 • revenue - LAG(revenue) OVER (ORDER BY month) • Month-over-month, week-over-week, YoY — one line. No self-joins. 𝗣𝗮𝘁𝘁𝗲𝗿𝗻 𝟮: 𝗧𝗼𝗽-𝗡 𝗽𝗲𝗿 𝗴𝗿𝗼𝘂𝗽 • ROW_NUMBER() OVER (PARTITION BY region ORDER BY sales DESC) • Filter WHERE rn <= 3 in a CTE. Top 3 products per region, top 5 customers per segment — same pattern every time. 𝗣𝗮𝘁𝘁𝗲𝗿𝗻 𝟯: 𝗥𝘂𝗻𝗻𝗶𝗻𝗴 𝘁𝗼𝘁𝗮𝗹𝘀 & 𝗺𝗼𝘃𝗶𝗻𝗴 𝗮𝘃𝗲𝗿𝗮𝗴𝗲𝘀 • SUM(revenue) OVER (ORDER BY date ROWS UNBOUNDED PRECEDING) • AVG(revenue) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) • Cumulative revenue and 7-day rolling average. No temp tables, no loops. 𝗣𝗮𝘁𝘁𝗲𝗿𝗻 𝟰: 𝗣𝗲𝗿𝗰𝗲𝗻𝘁𝗮𝗴𝗲 𝗼𝗳 𝘁𝗼𝘁𝗮𝗹 • revenue * 100.0 / SUM(revenue) OVER () • Each row's share of overall revenue. OVER() with empty parentheses = entire result set as one partition. 𝗣𝗮𝘁𝘁𝗲𝗿𝗻 𝟱: 𝗗𝗲𝗱𝘂𝗽𝗹𝗶𝗰𝗮𝘁𝗶𝗼𝗻 • ROW_NUMBER() OVER (PARTITION BY email ORDER BY updated_at DESC) • Keep rn = 1. Cleanest way to deduplicate without DELETE or DISTINCT ON. 𝗧𝗵𝗿𝗲𝗲 𝘁𝗵𝗶𝗻𝗴𝘀 𝘁𝗼 𝗿𝗲𝗺𝗲𝗺𝗯𝗲𝗿: → Window functions run after WHERE and GROUP BY — you can't filter on them directly, wrap in a CTE → LAST_VALUE with default frame only sees up to the current row — always set ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING → Every major platform supports these — BigQuery, Snowflake, Postgres, Databricks, Redshift If you're writing self-joins or correlated subqueries for any of the above, you're writing 5x the SQL you need to. Learn these 5 patterns. They'll cover most of what analytics actually asks for. #SQL #DataAnalytics #WindowFunctions #DataEngineering #Analytics

Pranjali Awasthi

18, ceo/co-founder @ slashy (yc s25)

15,249 followers 4mo

I EA'd for a Fortune 500 CEO for a week. Here's what I learned: Email isn't communication. It's a task list pretending to be communication. What a morning looked like: 9:47am - "Can you send me your availability?" Opens calendar. Checks timezones. Types out three slots. 8 minutes. 10:23am - "None of those work, what about next week?" Does it again. 6 minutes. 11:15am - "Remember that budget conversation?" Searches 400 emails. Tries five different keywords. 12 minutes. Four hours of my day doing tasks that emails contained. The actual reading? Five minutes total. Every tool treats AI like a feature: autocomplete, summaries, drafts. Nobody recognizes that email is just tasks with extra steps. By day three, I kept thinking in commands that don't exist: "Just give them my availability" "Handle this scheduling thread" "Find that conversation" "Write the polite no" Not "help me write faster." Just do it. The same way developers use command lines instead of clicking menus. We have AI that can pass the bar exam. And I spent four hours manually typing out calendar availability. What if you could type /give-availability and it just worked? What if /schedule-this-meeting handled the entire back-and-forth? I spent 20+ hours this week manually doing things a computer should handle. It felt absurd. Email isn't broken because we read too much or write too slow. Email is broken because it's a to-do list that makes you do everything manually. And nobody's fixing the actual problem. Working on this. DM me if you think email is secretly just tasks.

6 Comments

Dennis Hoffman

📬 Direct Mail Fundraising Ops | Lockbox, Caging & Donor Data for Nonprofits | 🏆 4x Inc. 5000 CEO | 👨👨👦👦 3 great kids & 1 patient husband

12,319 followers 4mo

4.37 billion pieces of mail were undeliverable last year. That’s the part no one talks about. We worry about creative. We argue about channels. We obsess about timing. But most campaigns lose money long before any of that matters. We lose it in the data. I learned this early in my career. I was running an agency, obsessed with creative. I lived and breathed it. I really believed great copy could fix anything. Then I saw a client’s donor file for the first time. > Wrong addresses… > Duplicates… > People who moved years ago… > People who passed away… > Records coded so poorly nothing matched… It didn’t matter how good the letter was. The data broke it before it ever had a chance. That moment changed the course of my career. It’s the reason I eventually got into the lockbox business. Because if the data isn’t right, nothing else works. And here’s the part that still keeps me up: Even with better tools and better hygiene than we had in the 90s, we still wasted more than $1.3 billion on undeliverable mail last year. This isn’t about postage. It’s about trust. It’s about connection. It’s about whether your donor ever even sees what you sent. Good data isn’t back-office work. It’s stewardship. You want higher response rates? Cleaner matchbacks? Better retention? Fix the inputs. Because the best creative in the world can’t overcome a bad address.

24 Comments

Email File Management

More in Email File Management

More Technology topics

Explore categories