R's Shift in the Data Ecosystem

Is R Becoming a Niche Guest in its Own House? For those of us who grew up in the Tidyverse, the recent ripples in the data ecosystem feel more like a tidal wave. After planing a 200,000-line codebase transition from R to Python, I’ve been reflecting on five pivotal shifts that signal a "New World Order" in Data Science: 1. The "Pandas" Effect & The Memory Revolution Wes McKinney didn't just give Python a DataFrame; he (and the subsequent Apache Arrow movement) unified the underlying data infrastructure. By bringing Wes into the fold, the industry shifted focus from "language-specific" tools to "language-agnostic" high-performance kernels. 2. The End of an Era: From RMarkdown to Quarto The departure of Yihui Xie from Posit wasn't just a personnel change; it was a symbolic turning point. As Quarto supersedes RMarkdown, we see a move toward a multi-language future. R is no longer the center of the solar system—it’s just one of the planets orbiting the "publish anything" sun. 3. The Shiny Expansion (and Dilution?) Shiny for Python is a technical marvel, but it marks the fall of R's last "monopoly." When the most efficient tool for interactive dashboards goes cross-platform, the gravity inevitably pulls toward the broader Python ecosystem for production-grade deployment. 4. The SparkR Sunset With SparkR deprecated and the baton passed to sparklyr, the message from big-data platforms is clear: core development is moving elsewhere. R is being reframed as a specialized "interface" rather than a first-class citizen in massive-scale parallel computing. 5. The Infrastructure Barrier: The "Shared Cluster" Problem In modern cloud environments like Databricks, the lack of R support on Shared Clusters is a deal-breaker for many enterprise architects. When you can't share resources or scale multi-user environments in R, you aren't just losing a language; you're losing the battle for ROI and stability. My Takeaway: I am not pessimistic about R’s survival—it will always remain the "Gold Standard" for deep statistical rigor and validated research, especially in the Pharmaceutical industry. However, for AI Automation and Big Data Engineering, the "Great Consolidation" toward Python is no longer a trend—it's a finished reality. If you are building for the next 10 years of stability (and avoiding the 3-year re-validation nightmare), it's time to stop fighting the current and start mastering the new stack. What do you think? Is R returning to its roots as a specialist's tool, or is it losing its seat at the head of the table? #DataScience #RStats #Python #BigData #AI #Databricks #Pharmaceuticals #Quarto #TechTrends #DataEngineering

To view or add a comment, sign in

More Relevant Posts

Brockmann Consult GmbH

1,240 followers
3w Edited
Report this post
🚀 New Release: xarray-eopf v0.2.7 We’re excited to share the latest release of the xarray EOPF backend — bringing easier, analysis-ready access to the new ESA Earth Observation Processing Framework (EOPF) data in Zarr format: https://lnkd.in/eyPiyjCc 🌍 What is the new EOPF Zarr format? The new EOPF Zarr format is designed to modernize and unify how Sentinel mission data is stored and accessed. It replaces the legacy SAFE formats with a hierarchical, cloud-optimized data structure, aiming to streamline data access across different Sentinel missions. By adopting Zarr, the format enables more efficient, scalable workflows — especially in cloud and large-scale processing environments. 🌍 What is xarray-eopf? xarray-eopf is a Python package that extends xarray with a new backend: "eopf-zarr". It enables you to: - Read the EOPF Sentinel Zarr Samples: https://lnkd.in/eZydDqU5 - Access data via the EOPF STAC Catalog: https://lnkd.in/dZDzNV8w - Work with analysis-ready data models for faster insights ✨ Key Highlights 🔹 𝗦𝗶𝗺𝗽𝗹𝗲 𝗮𝗰𝗰𝗲𝘀𝘀 𝘄𝗶𝘁𝗵 𝘅𝗮𝗿𝗿𝗮𝘆 python xr.open_dataset(..., engine="eopf-zarr") xr.open_datatree(..., engine="eopf-zarr") 🔹 𝗧𝘄𝗼 𝗳𝗹𝗲𝘅𝗶𝗯𝗹𝗲 𝗺𝗼𝗱𝗲𝘀 - analysis mode → optimized, user-friendly, and ready for analysis - native mode → close to the original Zarr structure with minimal changes 🔹 𝗕𝘂𝗶𝗹𝘁 𝗳𝗼𝗿 𝗺𝗼𝗱𝗲𝗿𝗻 𝗘𝗢 𝘄𝗼𝗿𝗸𝗳𝗹𝗼𝘄𝘀 - Supports Sentinel-1, Sentinel-2, and Sentinel-3 data in Zarr - Access subgroups directly as datasets - Lazy loading with Dask for scalable, cloud-native processing 🔹 𝗔𝗱𝘃𝗮𝗻𝗰𝗲𝗱 𝗰𝗮𝗽𝗮𝗯𝗶𝗹𝗶𝘁𝗶𝗲𝘀 (𝗮𝗻𝗮𝗹𝘆𝘀𝗶𝘀 𝗺𝗼𝗱𝗲) - Subset data efficiently (only load required chunks) - Band selection & spatial resampling - Reprojection to a user-defined CRS - Rectification of satellite-native grids 👉 Resampling & reprojection powered by xcube-resampling: https://lnkd.in/e8TpPEHU 𝗡𝗼𝘁𝗲: The "eopf-zarr" engine extends the "zarr" engine. While the hierarchical data tree structure can also be opened using xarray.open_datatree with the "zarr" engine, the direct access to subgroups and advanced features of the analysis mode are only available when using the "eopf-zarr" engine. 🔍 Learn More & Try It Out • 🌐 Project page: https://lnkd.in/dbm8Hq3c • 💻 GitHub: https://lnkd.in/eM3mrTPi • 📖 Documentation: https://lnkd.in/eSZaqN4H • 🧪 Example: https://lnkd.in/e2qAwNXK • 📓 Notebook Gallery: https://lnkd.in/ekDewcqA If you're working with are Sentinel-1, Sentinel-2, or Sentinel-3 data and want to explore the new EOPF Zarr format, give it a try — and share your feedback! #EarthObservation #RemoteSensing #ESA #Copernicus #EOPF #Python #DataScience #Zarr #Xarray
Like Comment
To view or add a comment, sign in
Shakeel Javed Khan
1mo Edited
Report this post
#dataengineering A data ingestion problem. 20–30K files. Download. Chunk. Vectorize. Every few hours. Forever. Imagine this. It's 2am. Your pipeline is running — barely. A single Python script, looping through files one by one. Download. Chunk. Vectorize. Wait. Repeat. 25,000 files taking hours. And by the time it finishes, it's almost time to start again. Sound familiar? Here's how you'd think through it — and why the "obvious" answers are often wrong. Lets walk through available options. Option 1 — Serial (single-threaded) Does one thing, finishes, moves on. Simple to write and debug. If it fails, you know exactly where. But for 25K files? You're waiting all night. Fine for a weekend prototype. A disaster in production. Option 2 — Async / Concurrent Send 100 requests before the first one comes back. A right step to take. Python's asyncio let us fire off dozens of downloads simultaneously. I/O-bound work — waiting for HTTP responses — is where async shines. Runtime dropped dramatically. But we're still on one machine, one CPU core. Vectorization is CPU-heavy. Async won't help there. Option 3 — Multi-threaded Put every core to work. ThreadPoolExecutor or multiprocessing let us use all CPU cores for the chunking and embedding work. Combined with async for downloads, this was a real upgrade. But Python's GIL limits true CPU parallelism in threads — you need multiprocessing to escape it. Still a single machine. Still a single point of failure. Option 4 — Apache Spark Distribute the job across a cluster. Spark is extraordinary — when you need it. Petabytes? Millions of files? Yes. 25K files every few hours? You're spending more time on cluster management than the actual work. Spark has high overhead. Don't bring a rocket ship to a road trip. Option 5 — Highly Available Distributed Service A queue. Workers. Retries. Observability. Always on. This is where we landed for production. A task queue (Celery, RQ, or a cloud-native option like Cloud Tasks) pulls jobs off a queue. Workers process independently. Failed jobs retry automatically. New files? Push to queue. Workers scale up. It's more complex to set up than the first 3 options — but it's the only option that handles real-world messiness: flaky APIs, partial failures, midnight spikes. The lesson? Each step wasn't an upgrade in prestige. It was an upgrade in the problem being solved. Serial → Async: you're I/O-bound. Async → Multi-process: you're CPU-bound. Single node → Distributed: you need fault tolerance. Spark → HA service: you need continuous operation, not just scale. Know which problem you actually have before you reach for the hammer.
Like Comment
To view or add a comment, sign in
Amani Mulira
3w
Report this post
Spark is powerful… but it punishes poor fundamentals. And most people learn this too late. Everyone loves the idea of distributed compute. Massive scale. Parallel processing. “Big data” handled effortlessly. Until things slow down… break… or cost 10x more than expected. Here’s the uncomfortable truth: Spark doesn’t fail randomly. It exposes weak thinking. The most common mistakes I see: 1. Treating Spark like pandas ❌ Bad df['new_col'] = df['a'].apply(lambda x: x * 2) → Row-by-row thinking → Forces Python execution (slow) ✅ Good from pyspark.sql.functions import col df = df.withColumn("new_col", col("a") * 2) → Uses Spark’s optimiser → Runs in parallel across the cluster 2. Ignoring data movement (the real cost) ❌ Bad df1.join(df2, "id").groupBy("category").count() → Large join + shuffle with no thought → Expensive and slow ✅ Good from pyspark.sql.functions import broadcast df1.join(broadcast(df2), "id").groupBy("category").count() → Broadcasts small table → Avoids massive shuffle 3. No partitioning strategy ❌ Bad df.write.parquet("s3://bucket/data/") → One huge dataset → Slow reads later ✅ Good df.write.partitionBy("date").parquet("s3://bucket/data/") → Queries can prune data → Faster + cheaper reads 4. Blindly trusting defaults ❌ Bad df.repartition(2000) → Arbitrary number → Can create tiny partitions or overhead ✅ Good df.repartition("country") → Partition aligned with query patterns → Smarter distribution 5. Debugging at the wrong level ❌ Bad df.show() → “Looks fine” → No idea how it runs ✅ Good df.explain() → See the execution plan → Identify shuffles, scans, inefficiencies How to think about Spark instead: Stop thinking in steps. Start thinking in data flow. Stop thinking in rows. Start thinking in partitions. Stop thinking in code. Start thinking in execution plans. The shift that changes everything: Instead of “Does this code run?” ask “How will this execute across a cluster?”
Like Comment
To view or add a comment, sign in
Christian Weilbach
2w
Report this post
Every analytical database can aggregate, filter, and join. None of them can tell you "something is wrong with this data" as a first-class operation. We just shipped native anomaly detection in Stratum. Train an isolation forest, score millions of rows — all from SQL: SELECT * FROM transactions WHERE ANOMALY_SCORE('fraud_model') > 0.7; No Python. No export pipeline. No serialization boundary. 6 microseconds per transaction, SIMD-accelerated, running inside the query engine. The standard workflow today — export to pandas, fit scikit-learn, write results back — adds seconds of latency and a whole second runtime to maintain. For fraud detection on live transactions, those seconds matter. Full write-up on why we built it and how it works: https://lnkd.in/gJkjgKaH #Clojure #SQL #Analytics #DuckDB #AnomalyDetection #MachineLearning #DataScience

Anomaly Detection Belongs in Your Database datahike.io
Like Comment
To view or add a comment, sign in
Navneet Singh
3d
Report this post
I just learned that Web Scraping is 10% coding and 90% problem-solving. I’ve been building a Data Collection repository to house different techniques for my data science projects, and today’s session was a massive eye-opener. Beyond the Script - Building a Resilient Data Acquisition Pipeline Web scraping is more than just fetching a URL it’s about building a pipeline that can handle the unpredictability of the web. Today, I reached a major milestone in my Data Collection Techniques repository. Instead of a basic "one-and-done" script, I implemented a robust Local-First ETL architecture to aggregate 200+ records across multiple pages. The Logic Breakdown - Persistent Ingestion Layer - I designed the system to stage raw HTML snapshots locally before processing. This "snapshot" approach allows for offline development, reduces redundant network load, and ensures I have a verifiable source of truth. Strict Endpoint Validation - To ensure high data fidelity, I implemented logic to validate every server response. By verifying HTTP Status Codes and schema consistency at the point of ingestion, I prevented corrupt or "silent failure" data from ever entering my pipeline. Multi-Source Aggregation - I built a dynamic loop that traverses my local storage, programmatically extracting and cleaning data from 10+ distinct sources into a single, high-fidelity Pandas DataFrame. And the result is - What started as fragmented HTML is now a sanitized, analysis-ready dataset. Data isn’t just found; it’s engineered. Check out the architecture and the code here - https://lnkd.in/dwkezQpE #DataEngineering #WebScraping #Python #Pandas #DataScience #ETL #SoftwareEngineering #AI #AIEngineer
Like Comment
To view or add a comment, sign in
Chinmay Athavale
2w
Report this post
Hello Techies, Did you know you can train a Machine Learning model using just SQL — no Python, no setup, no deployment headaches? I recently explored BigQuery ML and honestly, it changed how I think about ML workflows for data teams. Let me show you what I mean. The Traditional Way (Python) Imagine you work at an e-commerce company. Your manager asks: "Can we predict which website visitors are likely to make a purchase?" As a data scientist, here's your to-do list: # 1. Extract data from warehouse to your machine df = bigquery_client.query("SELECT ...").to_dataframe() # 2. Clean it manually df['country'] = df['country'].fillna("") df['pageviews'] = df['pageviews'].fillna(0) # 3. Encode text columns (ML doesn't understand strings) df['country'] = LabelEncoder().fit_transform(df['country']) # 4. Split train/test X_train, X_test, y_train, y_test = train_test_split(X, y) # 5. Train model = LogisticRegression() model.fit(X_train, y_train) # 6. Save model pickle.dump(model, open('model.pkl', 'wb')) # 7. Deploy an API so others can use it # ... another few days of engineering work Total time from question to answer: days to weeks Skills needed: Python, sklearn, MLOps, deployment knowledge The BigQuery ML Way (SQL) Same problem. Same data. Here's your to-do using Biq Query: 1. Train CREATE MODEL `ecommerce.purchase_predictor` OPTIONS (model_type = 'logistic_reg', input_label_cols = ['will_purchase']) AS SELECT will_purchase, device_type, country, pageviews, session_duration FROM `ecommerce.visitor_data`; 2. Evaluate SELECT * FROM ML.EVALUATE(MODEL `ecommerce.purchase_predictor`); 3. Predict & get business insight SELECT country, SUM(predicted_will_purchase) AS expected_buyers FROM ML.PREDICT(MODEL `ecommerce.purchase_predictor`, (SELECT * FROM `ecommerce.next_month_visitors`)) GROUP BY country ORDER BY expected_buyers DESC; Total time from question to answer: minutes Skills needed: SQL What BigQuery ML handles automatically that you'd do manually in Python: > Null/missing value handling > Encoding text columns (country, OS, etc.) > Train/test splitting > Model storage — saved directly in BigQuery > Deployment — ML.PREDICT IS your API > Scaling — handles petabytes natively Supported model types in BigQuery ML today: > Logistic & Linear Regression > K-Means Clustering > XGBoost & Random Forest > Deep Neural Networks > Time Series Forecasting (ARIMA+) > Import TensorFlow/PyTorch models directly BigQuery ML won't replace data scientists — but it puts ML in the hands of every analyst who knows SQL. And that's a massive unlock for any data-driven organization. Have you tried BigQuery ML? What was your experience? Drop it in the comments #BigQuery #GoogleCloud #MachineLearning #DataScience #SQL #BigQueryML #GCP #DataEngineering #MLOps #Analytics #CloudComputing #AI #DataAnalytics #Python #TechLearning
Like Comment
To view or add a comment, sign in
Hamza Iqbal
1w
Report this post
I wrote a bug today that took me 20 minutes to find. The function looked completely fine. ━━━━━━━━━━━━━━━━━━━━━━ def add_item(item, data=[ ]): ····data.append(item) ····return data ━━━━━━━━━━━━━━━━━━━━━━ I called it three times — expecting three separate lists. Got this instead: ▶ add_item("apple") → ["apple"] ▶ add_item("banana") → ["apple", "banana"] ▶ add_item("cherry") → ["apple", "banana", "cherry"] Same list. Growing every time. I never passed a list — Python was reusing the same default list across every single call. ━━━━━━━━━━━━━━━━━━━━━━ This is Python's Mutable Default Argument trap. The default value [ ] is created once when the function is defined — not every time it's called. So every call without an argument shares the exact same list object in memory. My Software Engineering brain expected fresh memory every time. That's how C++ and Java work. Python doesn't work that way. ━━━━━━━━━━━━━━━━━━━━━━ The fix: def add_item(item, data=None): ····if data is None: ········data = [ ] ····data.append(item) ····return data None as default. Fresh list created inside. Done. ━━━━━━━━━━━━━━━━━━━━━━ The scary part? This bug doesn't crash your program. It silently gives you wrong results. In a Data Science pipeline — that means corrupted data with zero error messages. ━━━━━━━━━━━━━━━━━━━━━━ Senior developers — what's the silent bug that once corrupted your data without a single error? Would love to know I'm not alone in this. SE → Data Science | OOP Series #2 | IUB #Python #OOP #DataScience #100DaysOfCode #SoftwareEngineering
Like Comment
To view or add a comment, sign in
Akash Sivanandan
4w
Report this post
We need to start caring about data packaging again. I migrated Rahu’s Python AST from a pointer-heavy recursive structure to an arena-backed one, and it improved both analysis and lookup much more than I expected. Rahu is a Python language server I’m building from scratch in Go. The old AST used separate structs, pointers, and slices to model recursive trees. That made it easy to work with, but it also meant many small allocations, pointer chasing, and poor cache locality in hot paths. The new AST is stored as a flat arena: compact nodes in a contiguous slice, stable NodeIDs, sibling-linked children, and side tables for names, strings, and numbers. A good example is attribute access. In the old AST, obj.field was an Attribute node pointing to both the base expression and a separate Name node. In the new one, it’s just a NodeAttribute plus child IDs into the same array. Traversal involves indexed access instead of following heap pointers. The result: AnalysisSmall: ~84 µs → ~55 µs AnalysisMedium: ~183 µs → ~117 µs AnalysisLarge: ~2.15 ms → ~1.85 ms DefinitionLookup: ~205 ns → ~30 ns HoverLookup: ~207 ns → ~34 ns DefinitionLookupAll: ~12.2 µs → ~1.36 µs The geomean across the benchmark set dropped by about 45%. Some construction-heavy paths worsened slightly, which is expected: the arena model added bookkeeping and shifted work into indexing and side tables. The edit-time analysis path improved, and lookup improved significantly, which matters more for the actual LSP experience. The main takeaway for me was simple: data layout matters. I didn’t change the language features. I changed AST storage and traversal, and that had a large effect on end-to-end performance.
1 Comment
Like Comment
To view or add a comment, sign in
Datalaria

9 followers
1w Edited
Report this post
In academic theory, datasets are clean. In the industrial trenches, legacy ERPs export garbage. Attempting to dump a static Bill of Materials (BOM) directly from a corporate Excel into a relational database is the perfect recipe for destroying your system's referential integrity. Hidden blank spaces, inconsistent nomenclatures, and mixed data typing (text vs. numbers) will crash any automated risk model. In Operations Engineering, we do not rely on manual data entry; we build firewalls. In part four of the Obsolescence series on Datalaria, I deconstruct how to build a ruthless ETL (Extract, Transform, Load) pipeline to sanitize this structural entropy before it impacts your P&L: 1️⃣ Radical Cleansing (Pandas): Utilizing Python as a strict gatekeeper to standardize MPNs (Manufacturer Part Numbers) and enforce strict numerical typing. 2️⃣ Graph Shattering: Breaking the flat 2D Excel table into true hierarchical vectors (End Product -> Subassembly -> Component). 3️⃣ The Golden Rule (Idempotency): Implementing upsert architecture in Supabase. The system must allow you to run the ingestion script 1,000 times consecutively without duplicating a single node. Sandbox Strategy: Theory is no longer enough. Don't take my word for it; run it yourself. I have embedded a secure, interactive environment (Google Colab) inside the article. Without installing anything, you will watch a Python script ingest a corrupt CSV and build a relational tree in milliseconds. 👉 Access the interactive Sandbox and full analysis here: https://lnkd.in/eaxyv_mQ #OperationsEngineering #DataEngineering #Python #Pandas #Supabase #SupplyChain #ETL #BOMManagement #FirstPrinciples

Trial by Fire: From Garbage Excel to Relational Graph with Python and Pandas datalaria.com
Like Comment
To view or add a comment, sign in
Ajay Sarkar
1mo
Report this post
🚀 𝗗𝗮𝘆 𝟭𝟰/𝟯𝟬: 𝗧𝗵𝗲 𝗔𝗹𝗴𝗼𝗿𝗶𝘁𝗵𝗺𝗶𝗰 '𝗖𝗵𝗲𝗮𝘁 𝗖𝗼𝗱𝗲' (𝗧𝗶𝗺𝘀𝗼𝗿𝘁) Two weeks down! Halfway through my #30DaysOfCode challenge. ⚡ We’ve seen the "Turtles" (O(n^2)), the "Rockets" (O(n \log n)), and the "Math Masters" (O(n)). But when you run .sort() in Python, Java, or Swift, which one does the computer actually pick? The answer: None of them. It uses a Hybrid Sort called Timsort. 💡 𝗪𝗵𝘆 𝗰𝗼𝗺𝗯𝗶𝗻𝗲 𝗮𝗹𝗴𝗼𝗿𝗶𝘁𝗵𝗺𝘀? There is no "perfect" algorithm: Insertion Sort (O(n^2)): Lightning fast for tiny datasets (< 64 items) and "Adaptive" (finishes O(n) if data is already sorted). Merge Sort (O(n \log n)): A beast for massive data, but heavy on memory and complex for small tasks. 1. The Cheat Code: Dynamic Selection 🧠 Timsort is the ultimate pragmatist. It analyzes your data at runtime: Identify "Runs": It scans the array for naturally sorted chunks. Sort Small: If a chunk is small, it uses Insertion Sort for instant, low-overhead results. Merge Big: It then uses Merge Sort to "zip" these sorted chunks together into one final, stable O(n \log n) result. ✅ 𝗪𝗵𝗮𝘁 𝗜 𝘁𝗮𝗰𝗸𝗹𝗲𝗱 𝘁𝗼𝗱𝗮𝘆: Synergy Analysis: Why Merge Sort’s stability and Insertion Sort’s speed on small data are the "Dream Team." Adaptive Power: How Timsort approaches O(n) linear speed on real-world, partially sorted data. Stability: Why preserving the order of duplicate items is mandatory for production-grade software. 🤖 𝗧𝗵𝗲 𝗔𝗜 𝗖𝗼𝗻𝗻𝗲𝗰𝘁𝗶𝗼𝗻: This "Adaptive Synthesis" is key to LLMs. A coherent response depends on maintaining Sequential Context. Just as Timsort preserves order, AI must preserve the relationship between words to make sense. ⚡ 𝗣𝗿𝗼𝗴𝗿𝗲𝘀𝘀: 𝟭𝟰/𝟯𝟬 The engines are mastered. Tomorrow, we move from how we process data to where we store it: Data Structures! 𝗤𝘂𝗲𝘀𝘁𝗶𝗼𝗻: Timsort is robust but needs extra memory (O(n) Space). Can you name an adaptive hybrid sort that is "In-Place"? (Hint: Go 1.19 uses it!) 👇 #30DaysOfCode #Algorithms #Timsort #HybridSorting #BigO #SoftwareEngineering #GoLang #Java #PHP #Day14 #BackendDevelopment
Like Comment
To view or add a comment, sign in

181 followers

10 Posts

View Profile Follow

R's Shift in the Data Ecosystem

More Relevant Posts

Explore content categories