Here's a small thing that changed how I read code. Python naming convention: `lower_case_with_underscores`. PySpark: `groupBy`, `orderBy`, `printSchema`. For a while I just accepted it as "quirky." Then I learned PySpark is a Python API sitting on top of Apache Spark — a JVM engine built in Scala, where camelCase is standard. That "quirk" wasn't random. It was a signal. Abstractions leak. And when they do, the details they leave behind — naming conventions, error formats, edge case behaviors — are actually clues about the system underneath. Once you start seeing code this way, you stop being confused by inconsistencies. You start getting curious about them. What's leaking through tells you more than the documentation ever will. #Python #PySpark #Engineering #DataEngineering
Vyshak S V’s Post
More Relevant Posts
-
Standardized date columns to yyyy-MM-dd 👉 Convert all date formats to a standard format Python from pyspark.sql.functions import to_date, date_format df = df.withColumn("loan_date", to_date("loan_date", "yyyy-MM-dd")) # If already date but want formatting: df = df.withColumn("loan_date", date_format("loan_date", "yyyy-MM-dd"))
To view or add a comment, sign in
-
Working on Real World Data Problems Using Pure Python Recently worked on a project focused on handling and analyzing structured data using core Python without relying on libraries like NumPy or Pandas. The goal was to understand the logic from the ground up. Cleaned and structured raw JSON data Built logic for “People You May Know” (mutual connections) Implemented “Pages You Might Like” recommendations Focused on problem-solving using basic data structures This approach helped me strengthen my core data handling and logical thinking, rather than depending on pre-built tools. Late nights after work, but worth it for the growth. #Python #DataProcessing #DataScience #ProblemSolving #CorePython #Algorithms #NumPy #pandas
To view or add a comment, sign in
-
-
Python Data Types — One Post Cheat Sheet Understanding data types is fundamental to writing efficient Python code. Here’s a quick overview: 🔢Numeric int → 10 float → 10.5 complex → 2+3j 🔤 String (str) Ordered & immutable Example: "Hello Python" 📋 List Ordered, mutable, allows duplicates Example: [10, 20, 30] 📦 Tuple Ordered, immutable Example: (10, 20, 30) 🔁 Set Unordered, no duplicates Example: {10, 20, 30} 📖 Dictionary Key–value pairs, mutable Example: {"name": "Maha", "age": 25} 🧠 Boolean True / False Used in conditions 🔍 Check Type type(variable) Choosing the right data type improves performance, readability, and data handling. #Python #DataTypes #PythonBasics #Programming #LearnPython #Coding #DataAnalytics #PythonForBeginners
To view or add a comment, sign in
-
-
📘 **Day 11 – File Handling in Python** Today I learned about **File Handling in Python** 📂 👉 File handling allows us to **create, read, write, and update files**. It helps store data permanently instead of keeping it only in memory. 🔹 **Types of Modes:** * `r` → Read file * `w` → Write (overwrites file) * `a` → Append (adds data) * `x` → Create new file 🔹 **Basic Example:** ```python # Writing to a file file = open("example.txt", "w") file.write("Hello, Python!") file.close() # Reading from a file file = open("example.txt", "r") print(file.read()) file.close() ``` 💡 **Best Practice:** Use `with` statement (auto closes file) ```python with open("example.txt", "r") as file: data = file.read() print(data) ``` ✨ **Key Learning:** File handling is important for saving data like logs, user input, and reports. 🚀 Step by step becoming better in Python! #Day11 #Python #CodingJourney #FileHandling #SkillCourse #DataAnalyst
To view or add a comment, sign in
-
Mastering Data Ingestion: Why NumPy is the Standard For anyone working with numerical data in Python, the transition from built-in functions to NumPy is a game-changer. While Python’s open() function handles basics, NumPy arrays offer a level of efficiency and speed that standard lists simply cannot match. Why use NumPy for flat files? The Industry Standard: NumPy arrays are the backbone of the Python data ecosystem. Essential for ML: If you plan to use libraries like scikit-learn, your data needs to be in a NumPy format. Built-in Efficiency: Functions like loadtxt() and genfromtxt() make importing arrays seamless. Pro-Tips for np.loadtxt() When importing data, the real power lies in the customization arguments: delimiter: Remember that the default is whitespace. For CSVs, always specify delimiter=','. skiprows: Perfect for bypassing headers (e.g., skiprows=1) so string labels don't break your numerical array. usecols: Optimization starts at ingestion. Only grab what you need by passing a list of indices, like usecols=[0, 2]. dtype: Control your data types from the start (e.g., dtype='str'). The Catch While loadtxt() is excellent for clean, uniform datasets, it hits a wall with mixed data types (like the Titanic dataset). When your columns vary between strings and floats, it’s time to level up to genfromtxt() or move into the world of Pandas. #DataEngineering #python #Numpy #Learninginpublic
To view or add a comment, sign in
-
New blog post on dynamic vertical scaling in Microsoft Fabric Python notebooks. It’s a nice trick that can be really useful, especially with unpredictable workloads, it is not new, but was not really documented another reaosn why Fabric Pipeline are awesome ok tested it with 158 GB of csv. https://lnkd.in/gpKysemw #onelake #python #Microsoftfabric #pipeline #polars #duckdb #lakesail
To view or add a comment, sign in
-
-
Most Python classes I've seen in DS projects do too much! They load data, clean it, transform it, run the model, and log results... all in one place. It feels efficient until you need to change one thing and have to re-test everything else. That's the cost of ignoring the Single Responsibility Principle. 🐍 In my latest article, I break down what SRP actually means for Python data pipelines: https://lnkd.in/esKz_ARk This is post 1 of 5 in a series on SOLID principles applied to Data Science code. What's the messiest class you've inherited on a DS project? 👇 #Python #DataScience #SoftwareEngineering #SOLID #DataEngineering
To view or add a comment, sign in
-
Most beginners learn SQL and Python separately. Today, I connected both. I extracted real data from MySQL and processed it using Python (pandas) in VS Code. This is where data science actually starts not just writing queries, but turning raw data into something useful. Still early in my journey, but now I’m focusing on building complete workflows instead of isolated skills. Next step: deeper analysis and real insights. #DataScience #SQL #MySQL #Python #Pandas #DataAnalytics #DataWorkflow #LearningByDoing #TechSkills #FutureDataScientist
To view or add a comment, sign in
-
-
If you want to improve your Data Science and Python skills, this course is for you. You'll use popular Python libraries like Pandas, scikit-learn, and NumPy to extract and clean data, then analyze it. You'll also learn about grouping & aggregation functions, merging datasets, and using regex, plus some Machine Learning techniques, too. https://lnkd.in/gK3gfthg
To view or add a comment, sign in
-
Explore content categories
- Career
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Artificial Intelligence
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Hospitality & Tourism
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development