Wikipedia Edit Spikes Predict Real-World Events with Python

📊 4 weeks. 100K+ Wikipedia edits. 1 key finding. I'm happy to share WikiPulse – my first end-to-end data analytics project. The question: Do Wikipedia edit spikes happen before or after real-world events? The finding: Most significant spikes occur 1–2 days before events, suggesting editors anticipate rather than just react. Strongest signal: Academy Awards (r = 0.977, p < 0.05) Tech stack: Python (pandas, NumPy, SciPy, statsmodels) Wikipedia API for data collection SQLite for local database storage Plotly for interactive visualizations Streamlit for dashboard & deployment Live demo: https://lnkd.in/g9bNc3jB GitHub: https://lnkd.in/ghTQfdng Open to feedback and suggestions. #DataAnalytics #Python #Streamlit #PortfolioProject

To view or add a comment, sign in

More Relevant Posts

Mohammedali Saiyed
3d
Report this post
Day 24/75 — This one Python function helped me understand my data better 👇 When I started analyzing datasets, I felt overwhelmed. Too many rows. Too much information. Then I discovered this: df.groupby('city')['price'].mean() 💡 What it does: 👉 Groups data by a category 👉 Calculates insights (like average, sum, count) Example: Instead of looking at thousands of rows… I can instantly see: 📊 Average price per city 🚨 Why this is powerful: • Turns raw data into insights • Helps you compare groups easily • Makes analysis faster and clearer 👨💻 Now I use it all the time to: • Compare categories • Find patterns • Simplify data Small function… But a big upgrade in how I analyze data. Have you used groupby() before? 👇 #DataScience #Python #Pandas #DataAnalysis #LearningInPublic
Like Comment
To view or add a comment, sign in
Dwiti Bhavsar
4d
Report this post
Most pandas slowdowns aren't caused by bad data-they're caused by the loop you wrote to process it. `𝗶𝘁𝗲𝗿𝗿𝗼𝘄𝘀()` is the default most analysts reach for when they need row-level logic. The problem: it converts each row into a Python Series, creating a new Python object per iteration and bypassing the vectorized NumPy operations that make pandas fast in the first place. 𝗩𝗲𝗰𝘁𝗼𝗿𝗶𝘇𝗮𝘁𝗶𝗼𝗻 fixes this - operating on entire columns at once, no Python loop required. → Slow (iterrows): ```python for idx, row in df.iterrows(): df.at[idx, 'margin'] = row['revenue'] - row['cost'] ``` → Fast (vectorized): ```python df['margin'] = df['revenue'] - df['cost'] ``` Same result. On a 1M-row dataset, the vectorized version runs 50–100× faster. This applies to new column calculations, conditional row flags, string transformations , any operation where you're currently writing a loop. 📌 𝗣𝗿𝗼 𝘁𝗶𝗽: When your logic genuinely requires row-level access, `.apply(axis=1)` is a solid middle ground and still slower than pure vectorization, but dramatically faster than `iterrows()`. What's one loop in your current pipeline you could replace today? #DataAnalytics #Python #Data #DataScience #Analytics #DataEngineering #BI
4 Comments
Like Comment
To view or add a comment, sign in
Harmanpreet Singh
3w
Report this post
🚀 Last month, I built and published my first Python package — Pristinizer I wanted to solve a simple but real problem in data science: 👉 Cleaning and understanding raw datasets takes way too much time. So I built Pristinizer, a lightweight Python package that helps streamline data cleaning + EDA in just a few lines of code. 🔍 What Pristinizer does: • Cleans messy datasets (duplicates, missing values, column formatting) • Generates structured dataset summaries • Visualizes missing data (heatmap, matrix, bar chart) ⚙️ Tech Stack: Python • pandas • matplotlib • seaborn 📦 Try it out: >> pip install pristinizer >> import pristinizer as ps df = ps.clean(df) ps.summarize(df) ps.missing_heatmap(df) 🧠 What I learned while building this: • Designing a clean and intuitive API • Structuring a real-world Python package • Publishing to PyPI • Writing proper documentation for users 📌 Next, I’m planning to add: • Outlier detection • Automated preprocessing pipelines • Advanced EDA reports Would love to hear your thoughts or feedback! #Python #DataScience #MachineLearning #OpenSource #Pandas #EDA #Projects
Like Comment
To view or add a comment, sign in
Stephen Dalanon
6d Edited
Report this post
Built a quick little project this week: justaskit The idea was simple, most data tools make you learn SQL just to ask basic questions. So I made one where you just... ask. In plain English. Upload a CSV, type "show me top 3 products by revenue" and it spits out a chart with an explanation in about 8 seconds. Under the hood it's a multi-agent system with LangGraph where separate agents handle the analysis, visualization, and insights. Added full code transparency too so you can see exactly what it's doing. Stack: Python, FastAPI, Next.js 15, LangGraph, pandas GitHub link in the comments if you want to check it out! #AI #OpenSource #LangGraph #Python #BuildInPublic
3 Comments
Like Comment
To view or add a comment, sign in
Fimijoba Micheal Oladokun
3w
Report this post
If you are doing data analysis in Python, pandas pivot tables are one of the most powerful tools you can master. They let you go from raw, messy data to a clean, structured summary in just a few lines of code —grouping by multiple dimensions, applying aggregation functions, handling missing values, and adding totals automatically. Once you understand pivot tables, your data analysis workflow becomes significantly faster and more insightful. If you are still doing everything manually with loops and conditional logic, it is time to learn pivot tables. Read the full post here: https://lnkd.in/eCaBFSB5 #Python #Pandas #DataScience #DataAnalysis #DataEngineering #Analytics
Like Comment
To view or add a comment, sign in
Abhishek Prasad
1w
Report this post
The loop that takes 47 seconds becomes 0.3 seconds. Day 11 of 30 -- Advanced Pandas Optimization No new hardware. No rewrite. Just one change. Replace iterrows() with a vectorized expression. Here is what most Pandas developers do not realize: A DataFrame is just a NumPy array -- contiguous C memory. When you write df.iterrows(), Python converts every row to a Python dict. You are running a Python for-loop over a C array. That is where the 47 seconds comes from. Write df['total'] = df['qty'] * df['price'] instead. That is a C loop on the raw array. 157x faster. Today's topic covers: Why Pandas can be slow -- the Python loop trap explained Speed hierarchy -- iterrows 47s vs apply 28s vs itertuples 5s vs vectorized 0.3s dtype optimization -- 6 dtype conversions that cut memory by 70% before writing a single query Auto dtype downcast function that optimizes an entire DataFrame in 10 lines pd.eval and query for complex expressions without intermediate arrays Chunked processing -- 50M rows on a laptop with 6GB RAM Real scenario -- retail analytics, 48GB to 6GB, 4 hours to 8 minutes 8 optimization techniques including the SettingWithCopyWarning trap 5 mistakes including growing DataFrames in loops and loading unused columns Key insight: Pandas is not slow. Writing Python loops over Pandas DataFrames is slow. #Python #Pandas #DataEngineering #Performance #SoftwareEngineering #100DaysOfCode #PythonDeveloper #TechContent #BuildInPublic #TechIndia #DataScience #Analytics #PythonProgramming #LinkedInCreator #LearnPython #PythonTutorial

1 Comment
Like Comment
To view or add a comment, sign in
Fimijoba Micheal Oladokun
3w
Report this post
One of the most common sources of confusion for pandas beginners and even experienced analysts is knowing when to use apply(), map(), and applymap(). They look similar. They sometimes produce the same result. But they are designed for completely different situations. map() is for single column transformations and value substitution. apply() is for complex row-level or column-level logic across a DataFrame. DataFrame.map() is for applying the same transformation to every individual cell. And before reaching for any of them — always check if a vectorized operation can do the job faster. Getting this right means cleaner code, better performance, and fewer bugs in your data pipelines. Read the full post here: https://lnkd.in/e8sJfEgh #Python #Pandas #DataScience #DataEngineering #DataAnalysis #Analytics
Like Comment
To view or add a comment, sign in
Diana Baquero
2w
Report this post
🐍 Working with data? Save this. Honest truth — I keep coming back to these commands more than I'd like to admit. In most data projects, cleaning takes up more time than the actual analysis, and having the right commands at hand makes a real difference. This Python Data Cleaning cheat sheet covers the 5 essentials I rely on constantly: ✅ Handling nulls and duplicates ✅ Quickly inspecting your dataset ✅ Renaming, converting & cleaning columns ✅ Filtering and slicing rows efficiently ✅ Merging and grouping data If you work with pandas regularly, this should always be within reach. Which of these do you use the most? 👇 #Python #DataScience #DataCleaning #Pandas #DataAnalytics
Like Comment
To view or add a comment, sign in
Deepansh Arora
2w
Report this post
Most beginners write loops in Pandas to modify data. I did the same… until I realized something important 👇 👉 You don’t need loops at all. With just one line of code, you can transform an entire column — faster, cleaner, and more efficient. Example: Python df['value'] = df['value'] * 1.1 No loops. No complexity. Just clean data transformation. This is one of those small concepts that completely changes how you write code in Python and Pandas. If you're getting into Data Science, learning these patterns early can save you a lot of time later. 🎥 I’ve explained this in a short video — simple and practical. Link: https://lnkd.in/gitJuMU8
Like Comment
To view or add a comment, sign in
Shadabur Rahaman
1w
Report this post
After working with NumPy, one question came to my mind 👇 “If NumPy is so powerful… why do we need Pandas?” Here’s what I understood: NumPy is great for: - numerical operations - fast array computations But when working with real-world data, things are not that clean. We deal with: - missing values - column names - mixed data (numbers + text) That’s where Pandas comes in. 👉 Built on top of NumPy 👉 Designed for structured data (tables) Think of it like this: NumPy → handles raw numbers efficiently Pandas → makes data easier to read, clean, and analyze This helped me connect the dots: It’s not about choosing one… It’s about using the right tool at the right stage Now exploring Pandas to work with real datasets more effectively. What do you find easier to work with — NumPy or Pandas? #NumPy #Pandas #Python #DataEngineering #DataScience #CodingJourney #TechLearning
Like Comment
To view or add a comment, sign in

204 followers

6 Posts

View Profile Connect

Wikipedia Edit Spikes Predict Real-World Events with Python

More Relevant Posts

Explore content categories