After cleaning and preparing the dataset, today I made the chatbot talk to a real database. No CSV reading anymore. No in-memory DataFrame queries. The data is now stored in SQLite and accessed using real SQL. What I did today: • Established a connection between Python and SQLite • Converted the cleaned Pandas DataFrame into a SQL table using to_sql() • Designed the table structure directly from the dataset • Ensured data is permanently stored and queryable • Closed the connection properly to avoid database locks Now the system architecture looks like this: User Question → Rule Logic → SQL Query → SQLite Database → Answer This is where the project stops being a script… and starts becoming a real data system. Why this step matters: Because AI systems don’t answer from files. They answer from structured, queryable data sources. The chatbot is now able to answer questions directly from the database, not from Python memory. Next step: Use if / elif logic to map user questions directly to SQL queries and make the chatbot answer real questions from the database. Screenshots from Jupyter Notebook will be shared in the final project. #Python #SQL #SQLite #DataEngineering #AI #MachineLearning
Connecting Python to SQLite Database for AI Systems
More Relevant Posts
-
We need to start caring about data packaging again. I migrated Rahu’s Python AST from a pointer-heavy recursive structure to an arena-backed one, and it improved both analysis and lookup much more than I expected. Rahu is a Python language server I’m building from scratch in Go. The old AST used separate structs, pointers, and slices to model recursive trees. That made it easy to work with, but it also meant many small allocations, pointer chasing, and poor cache locality in hot paths. The new AST is stored as a flat arena: compact nodes in a contiguous slice, stable NodeIDs, sibling-linked children, and side tables for names, strings, and numbers. A good example is attribute access. In the old AST, obj.field was an Attribute node pointing to both the base expression and a separate Name node. In the new one, it’s just a NodeAttribute plus child IDs into the same array. Traversal involves indexed access instead of following heap pointers. The result: AnalysisSmall: ~84 µs → ~55 µs AnalysisMedium: ~183 µs → ~117 µs AnalysisLarge: ~2.15 ms → ~1.85 ms DefinitionLookup: ~205 ns → ~30 ns HoverLookup: ~207 ns → ~34 ns DefinitionLookupAll: ~12.2 µs → ~1.36 µs The geomean across the benchmark set dropped by about 45%. Some construction-heavy paths worsened slightly, which is expected: the arena model added bookkeeping and shifted work into indexing and side tables. The edit-time analysis path improved, and lookup improved significantly, which matters more for the actual LSP experience. The main takeaway for me was simple: data layout matters. I didn’t change the language features. I changed AST storage and traversal, and that had a large effect on end-to-end performance.
To view or add a comment, sign in
-
-
📊 Understanding Frequency in Data Analysis In data analysis, frequency shows which values appear more often. 🔹 What is Frequency? Frequency tells us how many times a value appears in a dataset. 🔹 Why is it Important? It helps us understand data distribution and identify patterns easily. 📊 Types of Frequency in Data Analysis 🔹 1. Absolute Frequency It is the exact number of times a value appears in a dataset. 👉 Example: 5 appears 3 times → frequency = 3 🔹 2. Relative Frequency It shows the proportion (or percentage) of each value compared to total data. 👉 Formula: Frequency ÷ Total observations 🔹 3. Cumulative Frequency It is the running total of frequencies, adding values step by step. 👉 Helps understand data distribution up to a certain point 💻 Python Code – Frequency Example Python import pandas as pd # Sample data data = [1, 2, 2, 3, 3, 3, 4] df = pd.DataFrame(data, columns=["Values"]) # Absolute Frequency frequency = df["Values"].value_counts() # Relative Frequency relative_frequency = df["Values"].value_counts(normalize=True) # Cumulative Frequency cumulative_frequency = frequency.cumsum() print(frequency) print(relative_frequency) print(cumulative_frequency) ✨ Output Insight value_counts() → gives count of each value normalize=True → gives percentage (relative frequency) cumsum() → gives cumulative #HOPE AI Logeshwari Parthiban Ramisha Rani K #DataScience #DataAnalysis #Python #Pandas #Statistics #MachineLearning #Analytics #LearningJourney #DataVisualization #ExploratoryDataAnalysis #Coding #LinkedInLearning
To view or add a comment, sign in
-
In real-world Machine Learning and Data Science workflows, handling JSON data is a fundamental skill. JSON (JavaScript Object Notation) is a widely used data format because it is lightweight, human-readable, and supported across almost all programming languages. It is commonly used for data exchange between APIs, servers, and web applications. --- 🔹 Working with Local JSON Files JSON data stored locally can be directly loaded into a DataFrame using Pandas: "pd.read_json("train.json")" --- 🔹 Fetching JSON Data from APIs Data can also be fetched from external sources using URLs: "pd.read_json(url)" APIs typically return data in JSON format, making it easy to parse and analyze. --- 🔹 Handling Nested JSON Data In many real-world scenarios, JSON data is nested. To transform it into a structured tabular format, we use: "pd.json_normalize()" --- 🔹 Key Takeaways • JSON is a universal and API-friendly data format • Pandas simplifies reading JSON from both files and URLs • Nested JSON requires normalization for proper analysis • Always explore and understand the data after loading --- Understanding how to work with JSON efficiently is an essential step in building robust data pipelines and ML systems. #MachineLearning #DataScience #Python #Pandas #AI #LearningInPublic #DeepLearning #DataScientist
To view or add a comment, sign in
-
-
Post 4/7 - Text2SQL Project (Post 3 link in👇comment) My AI agent was confidently wrong That's the scariest thing about LLMs in production. They don't say "I don't know." They make something up. And it SOUNDS right. Here are the 4 hardest problems I solved building this agent: 1. HALLUCINATION - The LLM invented column names that didn't exist Fix: Schema injection + "Echo Test" guardrail 2. MULTI-CURRENCY MATH - Sales in INR + YEN got summed together Fix: Shadow Discovery + Single-Pass CASE normalization 3. SQL DIALECT WARS - SQLite syntax ≠ SQL Server syntax Fix: Dialect-aware prompts with db_mode parameter 4. THE AGENT "FORGETS" - Same mistake, different day Fix: Automated Learning Loop that saves corrections permanently Each one of these took day. Not hours to test and validate the implementation. And each one taught me something no tutorial covers: Production AI is 20% building, 80% defending against edge cases. The real lesson wasn't technical. It was this: LLMs need constraints, not just prompts. The architecture is what makes it reliable. Swipe through - every challenge mapped with the exact fix. Post 5 drops tomorrow: the full architecture deep dive. What's the nastiest edge case YOU'VE hit with AI? Let's swap war stories 👇 #AI #LLM #TextToSQL #DataEngineering #RAG #BuildInPublic #Gemini #Python #MachineLearning
To view or add a comment, sign in
-
💡 𝗦𝗤𝗟 & 𝗣𝘆𝘁𝗵𝗼𝗻 𝗶𝗻 𝗥𝗲𝗮𝗹-𝗪𝗼𝗿𝗹𝗱 𝗦𝗰𝗲𝗻𝗮𝗿𝗶𝗼𝘀 — 𝗪𝗵𝗲𝗿𝗲 𝗗𝗮𝘁𝗮 𝗠𝗲𝗲𝘁𝘀 𝗔𝗰𝘁𝗶𝗼𝗻 Knowing SQL and Python is one thing, but applying them to real-world problems is where true impact happens. In most modern data workflows, SQL and Python don’t compete—they complement each other. SQL helps you quickly extract, filter, and aggregate structured data, while Python gives you the flexibility to clean, transform, analyze, and even predict outcomes using that data. Think about everyday business problems like understanding customer behavior, detecting fraud, forecasting sales, or building automated dashboards. SQL plays a critical role in pulling the right data efficiently, and Python takes it further by adding logic, automation, and advanced analytics. Together, they power everything from ETL pipelines to machine learning models and real-time data processing systems. What makes this combination powerful is not just the tools themselves, but how seamlessly they integrate into solving end-to-end data challenges. SQL gives you speed and precision with data access, while Python unlocks deeper insights and scalability. If you’re aiming to grow in data engineering or analytics, mastering both isn’t optional anymore—it’s a necessity. 👉 𝗪𝗵𝗲𝗿𝗲 𝗵𝗮𝘃𝗲 𝘆𝗼𝘂 𝘂𝘀𝗲𝗱 𝗦𝗤𝗟 𝗮𝗻𝗱 𝗣𝘆𝘁𝗵𝗼𝗻 𝘁𝗼𝗴𝗲𝘁𝗵𝗲𝗿 𝗶𝗻 𝗿𝗲𝗮𝗹-𝘄𝗼𝗿𝗹𝗱 𝗽𝗿𝗼𝗷𝗲𝗰𝘁𝘀? #SQL #Python #DataEngineering #DataScience #Analytics #ETL #BigData #MachineLearning #DataAnalytics
To view or add a comment, sign in
-
📊 Deep dive into Exploratory Data Analysis (EDA) - Real world dataset analysis with python Recently, I completed a hands-on Jupyter Notebook focused on Exploratory Data Analysis (EDA) using a raw employee dataset. This exercise helped me understand how Python can be used to clean, transform, and analyze real-world messy data effectively. Key learnings: 1) Learned how to clean raw data using string operations and regex 2) Handled missing values using mean, mode, and appropriate imputation techniques 3) Converted data types for accurate analysis (categorical, numerical) 4) Performed data transformation to create structured and analysis-ready datasets 5) Explored visualization techniques using Matplotlib and Seaborn (distribution plots, regression plots) 6) Applied encoding techniques like one-hot encoding for categorical variables 7) Practiced indexing, slicing, and feature-target separation 💡 Key Insight: Clean and well-structured data is the foundation of any successful data analysis or machine learning model. EDA plays a critical role in understanding data patterns, detecting anomalies, and preparing datasets for advanced analytics. This milestone was completed under the guidance of KODI PRAKASH SENAPATI Sir, whose structured and practical teaching approach made these concepts easy to understand and apply. This project strengthened my ability to work with real-world messy data and transform it into meaningful insights using Python 🚀 Continuing to build strong fundamentals in Data Analytics step by step! #PythonProgramming #EDA #DataCleaning #DataVisualization #MachineLearning
To view or add a comment, sign in
-
Python Series – Day 22: Data Cleaning (Make Raw Data Useful!) Yesterday, we learned Pandas🐼 Today, let’s learn one of the most important real-world skills in Data Science: 👉 Data Cleaning 🧠 What is Data Cleaning Data Cleaning means fixing messy data before analysis. It includes: ✔️ Missing values ✔️ Duplicate rows ✔️ Wrong formats ✔️ Extra spaces ✔️ Incorrect values 📌 Clean data = Better results Why It Matters? Imagine this data: | Name | Age | | ---- | --- | | Ali | 22 | | Sara | NaN | | Ali | 22 | Problems: ❌ Missing value ❌ Duplicate row 💻 Example 1: Check Missing Values import pandas as pd df = pd.read_csv("data.csv") print(df.isnull().sum()) 👉 Shows missing values in each column. 💻 Example 2: Fill Missing Values df["Age"].fillna(df["Age"].mean(), inplace=True) 👉 Replaces missing Age with average value. 💻 Example 3: Remove Duplicates df.drop_duplicates(inplace=True) 💻 Example 4: Remove Extra Spaces df["Name"] = df["Name"].str.strip() 🎯 Why Data Cleaning is Important? ✔️ Better analysis ✔️ Better machine learning models ✔️ Accurate reports ✔️ Professional workflow ⚠️ Pro Tip 👉 Real projects spend more time cleaning data than modeling 🔥 One-Line Summary Data Cleaning = Convert messy data into useful data 📌 Tomorrow: Data Visualization (Matplotlib Basics) Follow me to master Python step-by-step 🚀 #Python #Pandas #DataCleaning #DataScience #DataAnalytics #Coding #MachineLearning #LearnPython #MustaqeemSiddiqui
To view or add a comment, sign in
-
-
Every analytical database can aggregate, filter, and join. None of them can tell you "something is wrong with this data" as a first-class operation. We just shipped native anomaly detection in Stratum. Train an isolation forest, score millions of rows — all from SQL: SELECT * FROM transactions WHERE ANOMALY_SCORE('fraud_model') > 0.7; No Python. No export pipeline. No serialization boundary. 6 microseconds per transaction, SIMD-accelerated, running inside the query engine. The standard workflow today — export to pandas, fit scikit-learn, write results back — adds seconds of latency and a whole second runtime to maintain. For fraud detection on live transactions, those seconds matter. Full write-up on why we built it and how it works: https://lnkd.in/gJkjgKaH #Clojure #SQL #Analytics #DuckDB #AnomalyDetection #MachineLearning #DataScience
To view or add a comment, sign in
-
🔷 My model could see all the right information. It was still getting things wrong. And I could not figure out why. Then I plotted the total_amount_spent column and saw the problem immediately. A few customers had spent 50,000 shillings. Most had spent between 500 and 3,000. The column was not a bell curve. It was a spike on the left and a long flat tail stretching to the right. The model was spending most of its energy trying to understand those big spenders at the far end. The regular customers in the middle were getting ignored. The data was right. The scale was wrong. So I transformed it. #import numpy as np 💠 df["amount_spent_log"] = np.log1p(df["total_amount_spent"]) log1p adds 1 before taking the log so zero values do not break everything. After transformation the distribution looked like a proper curve. The model could now treat the difference between a 500 shilling and a 1,000 shilling customer with the same attention it gave to the difference between a 20,000 and a 40,000 shilling customer. Same data. Completely different picture. That is feature transformation. You are not creating new columns. You are not extracting hidden ones. You are changing the shape of what already exists so the model can actually read it properly. • Engineering asks what new information can I create. • Extraction asks what hidden information can I uncover. • Transformation asks what shape does this information need to be in. 📍 All three are different tools. All three are necessary. Knowing which one your data needs is the skill. ❓ Have you ever had a model improve significantly just by transforming a column you already had? #DataScience #MachineLearning #Python
To view or add a comment, sign in
-
Day 12/30 what I learned today Introduction to Pandas : this is the most important library in python import pandas as pd It is used for loading and saving data in different formats, data cleaning and processing , feature engineering for machine learning. It provides: Series(1D labelled array), like a single column of data with an index. creating series using’ pd.Series ‘either from : A list sales = pd.Series([100, 150, 200, 175, 225]) A dictionary population = pd.Series({ 'California': 39538223, 'Texas': 29145505, 'Florida': 21538187, 'New York': 20201249 }) Or with custom index sales = pd.Series( [100, 150, 200, 175, 225], index=['Mon', 'Tue', 'Wed', 'Thu', 'Fri'] ) DataFrame (2D labelled data structure), like a spreadsheet or SQL table. Creating DataFrame from dictionary data = { 'name': ['Alice', 'Bob', 'Charlie', 'Diana'], 'age': [25, 30, 35, 28], 'city': ['New York', 'Los Angeles', 'Chicago', 'Houston'], 'salary': [70000, 80000, 90000, 75000] } df = pd.DataFrame(data) Reading Data (CSV, JSON,Excel): panda can read data in different formats. slicing and indexing file path to access file Exploring DataFrames : using pandas functions such as Head() : to view specific numbers of rows in a DataFrame starting from the first . Tail(): to view specific number of rows in a DataFrame starting from the last. Sample(): to view specific number of random rows in a DataFrame. #30daysofTech #TsAcademy #Datascience
To view or add a comment, sign in
-
More from this author
Explore related topics
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development
This project is being built step by step. No shortcuts. Only engineering.