I’ve been building and managing data systems at Amazon for the last 8 years. Now that AI is everywhere, the way we work as data engineers is changing fast. Here are 5 real ways I (and many in the industry) use LLMs to work smarter every day as a Senior Data Engineer: 1. Code Review and Refactoring LLMs help break down complex pull requests into simple summaries, making it easier to review changes across big codebases. They can also identify anti-patterns in PySpark, SQL, and Airflow code, helping you catch bugs or risky logic before it lands in prod. If you’re refactoring old code, LLMs can point out where your abstractions are weak or naming is inconsistent, so your codebase stays cleaner as it grows. 2. Debugging Data Pipelines When Spark jobs fail or SQL breaks in production, LLMs help translate ugly error logs into plain English. They can suggest troubleshooting steps or highlight what part of the pipeline to inspect next, helping you zero in on root causes faster. If you’re stuck on a recurring error, LLMs can propose code-level changes or optimizations you might have missed. 3. Documentation and Knowledge Sharing Turning notebooks, scripts, or undocumented DAGs into clear internal docs is much easier with LLMs. They can help structure your explanations, highlight the “why” behind key design choices, and make onboarding or handover notes quick to produce. Keeping platform wikis and technical documentation up to date becomes much less of a chore. 4. Data Modeling and Architecture Decisions When you’re designing schemas, deciding on partitioning, or picking between technologies (like Delta, Iceberg, or Hudi), LLMs can offer quick pros/cons, highlight trade-offs, and provide code samples. If you need to visualize a pipeline or architecture, LLMs can help you draft Mermaid or PlantUML diagrams for clearer communication with stakeholders. 5. Cross-Team Communication When collaborating with PMs, analytics, or infra teams, LLMs help you draft clear, focused updates, whether it’s a Slack message, an email, or a JIRA comment. They’re useful for summarizing complex issues, outlining next steps, or translating technical decisions into language that business partners understand. LLMs won’t replace data engineers, but they’re rapidly raising the bar for what you can deliver each week. Start by picking one recurring pain point in your workflow, then see how an LLM can speed it up. This is the new table stakes for staying sharp as a data engineer.
Using LLMs to Strengthen Data Strategy
Explore top LinkedIn content from expert professionals.
Summary
Using large language models (LLMs) to strengthen data strategy means applying advanced AI tools that understand and generate human language to help organizations manage, organize, and make sense of their data for improved decision-making and efficiency. LLMs can quickly translate business questions into data queries, support smart data integration, and help create clear data documentation, making data strategy more responsive and scalable.
- Streamline data access: Use LLMs to turn natural language questions into accurate database queries, so everyone in your company can get the answers they need without waiting for analysts.
- Clarify documentation: Let LLMs automatically create and update readable technical documents and internal guides, saving time and making onboarding easier for new team members.
- Improve data integration: Apply LLMs to identify patterns, connect scattered information, and suggest ways to organize your company’s data in line with business goals and key definitions.
-
-
RIP Tableau Tableau is a business intelligence tool owned by Salesforce. For years it was part of how we worked at Voi. In the beginning it felt powerful, but over time it turned into what many legacy SaaS tools become: expensive, clunky and slow. Every ad hoc request ended up in an analyst backlog. Local teams across our 100 plus cities were left waiting for insights, costs kept going up and speed disappeared. So we ripped it out, saved at least 500k EUR, potentially millions (from speed). The direct savings are hundreds of thousands of euros in licenses. The indirect savings are even bigger since analysts can now focus on high impact work instead of repetitive reporting. The biggest shift is speed. What once took weeks now happens in seconds. Here is how we made it possible: 1. We fixed the foundations. Years of work on data governance. Every metric has an owner, quality checks, semantics and definitions. Everyone in the company knows what a number means. With that in place, self serve became possible, which is essential when local teams in 100 plus cities need the right data at the right time. 2. We defined what we need, not what we paid for. A single source of truth, real time data streaming and self serve for non technical users. Analysts no longer spend their days on small one off requests. 3. We used LLMs as the bridge. Together with a design partner we built a UI that supports continuous business intelligence, and we created an AI data analyst that lives inside Slack and Sheets. LLMs translate natural language into SQL, query the warehouse and return insights or visuals in natural language again. This step is what unlocked true self serve at scale. But LLMs alone are not enough. In an enterprise setting you need strict guidelines and guardrails. Without governance you risk inconsistent answers, wrong definitions or even compliance issues. The combination of solid data governance with the power of LLMs is what makes this work. The results are clear: 1. Millions saved on SaaS and labor 2. One source of truth for all key metrics 3. Self serve for everyone in the company within clear constraints 4. Up to 100x faster time to insight and decision making LLMs made this shift possible. Strong governance made it safe. RIP Tableau. And it will not be the last legacy SaaS tool we replace.
-
AI runs on data. If you want to train your AI, you need to give it the right data. This is why the priority of any organisation’s AI strategy must be to first get its data strategy in order. 🔵 The Failure of Centralisation: But this is no small task. For decades, we’ve tried to solve the data problem with centralised solutions: first, data warehouses, then data lakes, and later, data lakehouses. These approaches share a common flaw—they require putting all your data into one central store. It’s time to acknowledge that this method has had its chance. If centralisation were the answer, the problem would have been solved by now. 🔵 The Real Issue is Human, Not Technical: The problem isn’t technical—it’s human. A centralised store is typically managed by a single team. Expecting that team to handle all the complexity of an organisation’s data landscape is unrealistic. The sheer scale and variety of data make centralisation an impossible task. 🔵 The Shift to Data Products: Enter the concept of data products. This idea has gained momentum because it flips the script. Instead of placing the burden of cleaning, linking, and organising data on the consumer, we shift it to the data publisher—the one creating and maintaining the data. This is what I call inverting the cost of data integration. The responsibility for making data usable now falls on the creator, not the end user. 🔵 The Urgency Created by AI: Since the rise of large language models (LLMs), this shift is no longer just a good idea—it’s essential. LLMs have dramatically raised the stakes. Organisations need to invert the cost of data integration and start building data products immediately, so they can inject relevant data into the context of the foundational models and begin building much higher levels of automated intelligence into their operations. 🔵 Semantic Data Products: But here’s the catch: those data products now need to be smarter, clearer, and better defined than ever before. They must have consistent, precise semantics so they can work seamlessly with LLMs. To achieve this, data products need to be built around the concepts and language used every day in your business. They must reflect your organisation's ontological core—the key ideas and terms that drive your operations. Without a solid semantic framework, LLMs are left to guess the meaning of your data—and when AI guesses, it can go wrong fast, turning potential insights into embarrassing mistakes. ⭕ Ontological Core: https://lnkd.in/e9HZbkFY ⭕Product Vector Search: https://lnkd.in/et3DTN2w
-
If you want an example of how AI empowers data scientists, consider this one: A new study shows how we can use LLMs to harness unstructured data and the knowledge embedded in their pre-training to drive significant reduction in variance in experiments (think CUPED, but for multimodal data). 𝐋𝐋𝐌𝐬 𝐥𝐚𝐜𝐤 𝐠𝐮𝐚𝐫𝐚𝐧𝐭𝐞𝐞𝐬 𝐨𝐧 𝐚𝐜𝐜𝐮𝐫𝐚𝐜𝐲, and naively using them to predict counterfactuals isn't statistically valid. Few-shot LLMs rely on a small set of demo examples. This makes predictions variable and correlated across observations, breaking the independence assumption for valid causal analysis. The study shows 𝐡𝐨𝐰 𝐭𝐨 𝐮𝐬𝐞 𝐋𝐋𝐌𝐬 𝐭𝐨 𝐫𝐞𝐝𝐮𝐜𝐞 𝐯𝐚𝐫𝐢𝐚𝐧𝐜𝐞 in a principled way: • Calibration - give more weight to subpopulations where the LLM predictions align with observed outcomes. • Resampling-based aggregation - average across many random demo sets to neutralize variability. • Three-way sample splitting - separate data used for examples, prediction, and estimation to preserve independence. The framework is conceptually similar to double machine learning (2ML) - blending causal inference and machine learning. So how do we get this kind of unstructured data? Surveys are one way to go. Ping us at Causara if you'd like to chat about marketing experimentation.
-
Leading large language models (LLMs) are trained on public data. However, the majority of the world's data is dark data not publicly accessible, mainly in the form of private organizational data or enterprise data. 𝐓𝐡𝐞 𝐚𝐮𝐭𝐡𝐨𝐫𝐬 𝐬𝐡𝐨𝐰 𝐭𝐡𝐚𝐭 𝐭𝐡𝐞 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞 𝐨𝐟 𝐦𝐞𝐭𝐡𝐨𝐝𝐬 𝐛𝐚𝐬𝐞𝐝 𝐨𝐧 𝐋𝐋𝐌𝐬 𝐬𝐞𝐫𝐢𝐨𝐮𝐬𝐥𝐲 𝐝𝐞𝐠𝐫𝐚𝐝𝐞𝐬 𝐰𝐡𝐞𝐧 𝐭𝐞𝐬𝐭𝐞𝐝 𝐨𝐧 𝐫𝐞𝐚𝐥-𝐰𝐨𝐫𝐥𝐝 𝐞𝐧𝐭𝐞𝐫𝐩𝐫𝐢𝐬𝐞 𝐝𝐚𝐭𝐚𝐬𝐞𝐭𝐬. Current benchmarks, based on public data, overestimate the performance of LLMs. They release a new benchmark dataset, the Goby Benchmark, to advance discovery in enterprise data integration. Based on their experience with this enterprise benchmark, 𝐭𝐡𝐞 𝐚𝐮𝐭𝐡𝐨𝐫𝐬 𝐩𝐫𝐨𝐩𝐨𝐬𝐞 𝐭𝐞𝐜𝐡𝐧𝐢𝐪𝐮𝐞𝐬 𝐭𝐨 𝐮𝐩𝐥𝐢𝐟𝐭 𝐭𝐡𝐞 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞 𝐨𝐟 𝐋𝐋𝐌𝐬 𝐨𝐧 𝐞𝐧𝐭𝐞𝐫𝐩𝐫𝐢𝐬𝐞 𝐝𝐚𝐭𝐚, 𝐢𝐧𝐜𝐥𝐮𝐝𝐢𝐧𝐠: (1) 𝐡𝐢𝐞𝐫𝐚𝐫𝐜𝐡𝐢𝐜𝐚𝐥 𝐚𝐧𝐧𝐨𝐭𝐚𝐭𝐢𝐨𝐧, (2) 𝐫𝐮𝐧𝐭𝐢𝐦𝐞 𝐜𝐥𝐚𝐬𝐬-𝐥𝐞𝐚𝐫𝐧𝐢𝐧𝐠, 𝐚𝐧𝐝 (3) 𝐨𝐧𝐭𝐨𝐥𝐨𝐠𝐲 𝐬𝐲𝐧𝐭𝐡𝐞𝐬𝐢𝐬. 𝐓𝐡𝐞𝐲 𝐬𝐡𝐨𝐰 𝐭𝐡𝐚𝐭, 𝐨𝐧𝐜𝐞 𝐭𝐡𝐞𝐬𝐞 𝐭𝐞𝐜𝐡𝐧𝐢𝐪𝐮𝐞𝐬 𝐚𝐫𝐞 𝐝𝐞𝐩𝐥𝐨𝐲𝐞𝐝, 𝐭𝐡𝐞 𝐩𝐞𝐫𝐟𝐨𝐫𝐦𝐚𝐧𝐜𝐞 𝐨𝐧 𝐞𝐧𝐭𝐞𝐫𝐩𝐫𝐢𝐬𝐞 𝐝𝐚𝐭𝐚 𝐛𝐞𝐜𝐨𝐦𝐞𝐬 𝐨𝐧 𝐩𝐚𝐫 𝐰𝐢𝐭𝐡 𝐭𝐡𝐚𝐭 𝐨𝐟 𝐩𝐮𝐛𝐥𝐢𝐜 𝐝𝐚𝐭𝐚. SOURCE: https://lnkd.in/gBC53NPb
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Healthcare
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development