Marcos Vinicius Thibes Kemer’s Post

✅ #PythonJourney | Day 152 — All API Endpoints Tested & Production Ready Today: Comprehensive endpoint testing. The entire URL Shortener API is now fully operational! Key accomplishments: ✅ Tested 4 critical endpoints: • POST /api/v1/urls → Creates shortened URL with auto-generated short code • GET /api/v1/urls → Returns user's URL list (ordered by newest first) • GET /api/v1/urls/{url_id} → Retrieves specific URL details • GET /{short_code} → Redirects to original URL + tracks click in database ✅ Fixed SQLAlchemy Click model: • Issue: Composite primary key (id + clicked_at) prevented autoincrement • Solution: Made id the sole primary key, clicked_at just a timestamp • Result: Click tracking now works perfectly ✅ Verified full request/response cycle: • Authentication: API key validation ✓ • Input validation: Pydantic models ✓ • Database operations: CRUD complete ✓ • Click tracking: Events recorded correctly ✓ • Response serialization: JSON output perfect ✓ ✅ Data flow confirmed: 1. User creates URL → Stored in PostgreSQL 2. User accesses short code → Redirect happens 3. Click event → Recorded in clicks table 4. URL counter → Incremented automatically 5. JSON response → Properly formatted What I learned today: → Comprehensive testing reveals edge cases early → SQLAlchemy's primary key behavior affects autoincrement → Docker image caching can hide recent code changes → Click tracking requires careful database schema design → Manual testing validates the entire architecture The API is now: - ✅ Accepting requests from multiple sources - ✅ Storing data reliably in PostgreSQL - ✅ Returning proper JSON responses - ✅ Tracking user behavior - ✅ Handling redirects correctly - ✅ Managing database transactions safely Endpoints remaining to test: - GET /api/v1/urls/{url_id}/analytics (analytics aggregation) - DELETE /api/v1/urls/{url_id} (soft delete) Status: API Core is production-ready. Ready for comprehensive test suite (pytest) next. This is what backend development looks like: build → test → debug → iterate → victory! #Python #FastAPI #API #Testing #Backend #PostgreSQL #Docker #SoftwareDevelopment #StartupLife

To view or add a comment, sign in

More Relevant Posts

Felipe Cardoso Martins
3w
Report this post
🚀 pytest-capquery 0.3 is live! This release was heavily focused on the Developer Experience (DX). We've officially introduced automated SQL snapshot testing, heavily inspired by the Jest framework. Instead of manually hardcoding and maintaining massive SQL strings in your Python tests, you can now seamlessly generate and validate physical .sql execution baselines with zero friction. To dive deeper into the "why," I've just published a new article breaking down the reality of database performance in production. The post covers: - 🚨 A painfully familiar SRE late-night "novel" - 🏢 The cultural divide between Developers and DBAs - 🛡️ Common architectural pitfalls (like the Python GC trap and the JOIN illusion) -💡 How pytest-capquery bridges the gap, complete with a Getting Started guide You can read the full breakdown here: https://lnkd.in/dJzBQ8nV If you care about engineering excellence, catching N+1 regressions in CI, and building robust backend systems, I invite you to check out the repository! Follow the project, drop a star, or open a PR. Together we can do more! 🤝 🔗 https://lnkd.in/d9EJgd8V #Python #SQLAlchemy #Pytest #SRE #EngineeringExcellence #OpenSource #DatabasePerformance

GitHub - fmartins/pytest-capquery: capquery is a pytest fixture for high-precision SQL testing in SQLAlchemy. It captures executed queries, normalizes formatting via sqlparse, and tracks transaction savepoints. Designed for strict query plan validation, it includes built-in debugging tools to easily maintain and update database assertions. github.com

2 Comments
Like Comment
To view or add a comment, sign in
GizmoData

385 followers
3w Edited
Report this post
dbt-gizmosql — a month of new capabilities Six releases shipped for dbt-gizmosql, the dbt adapter for GizmoSQL (an Apache Arrow Flight SQL engine backed by DuckDB). The headline: the adapter now supports a lot of things it just... didn't before. → Python models (brand new!) Write dbt transformations in Python. dbt.ref() / dbt.source() pull upstream tables as Arrow; return a DuckDB relation, pandas DataFrame, or PyArrow Table and the result is shipped back to the server via ADBC bulk ingest. Incremental Python models supported too. → session.remote_sql() — server-side pushdown for Python models Python models run client-side, so dbt.ref('big_table') streams the whole upstream table across the wire before your code sees it. The new remote_sql() escape hatch runs SQL directly on GizmoSQL and returns only the result — the filter executes server-side: def model(dbt, session): schema = dbt.this.schema return session.remote_sql( f"select * from {schema}.big_table where name = 'Joe'" ) → External materialization — write straight to files A new 'external' materialization issues a server-side COPY to Parquet, CSV, or JSON. Anywhere the server's DuckDB can reach: local disk, s3://, gs://, azure://, MinIO. Supports partitioning, codecs, and format inference. The result is ref()-able downstream. → Microbatch incremental strategy Time-windowed incrementals via dbt's microbatch strategy — reprocess a recent event_time window each run, with automatic batching. → Snapshot merge rewritten around MERGE BY NAME Snapshots now use DuckDB's native MERGE ... UPDATE / INSERT BY NAME — more robust to column reordering, far clearer than a hand-rolled merge. → Much faster seed loading Seeds are now read with DuckDB's CSV reader (correct null handling, proper type inference) and bulk-ingested as Arrow via ADBC instead of row-by-row INSERTs. A shout-out to ADBC None of this would be practical without ADBC (Arrow Database Connectivity). It gives the adapter a columnar, zero-copy path to the server: seeds and Python-model results ship as Arrow record batches, ref() pulls land as Arrow tables, and remote_sql() streams Arrow results back. It's the reason the adapter can move real data volumes without the usual row-by-row ODBC/JDBC tax. Huge thanks to the Apache Arrow community and the adbc-driver-flightsql maintainers. And finally — the best GizmoSQL features come from the user community. GizmoData thanks our users for their great feedback and engagement! pip install dbt-gizmosql https://lnkd.in/ewKGMUCe #dbt #duckdb #dataengineering #apachearrow #adbc #flightsql

GitHub - gizmodata/dbt-gizmosql: A dbt adapter for GizmoSQL - with the DuckDB engine github.com
Like Comment
To view or add a comment, sign in
AlgoAscend

31 followers
3w Edited
Report this post
💻 What “Structured Coding” Means in Claude? Claude is strongest when: Problem is clearly defined Input → Output is well structured You want clean, production-style code 👉 Think: APIs Backend logic Data processing Automation scripts NOT: messy brainstorming code vague “build something cool” prompts 🧠 Why Claude is Good at This Follows instructions very strictly Writes clean, readable code Handles large codebases (huge context) 🔧 Example: Build a Flask API (Structured Backend) 📝 Prompt you give Claude: Build a Flask API for user login with: - JWT authentication - Password hashing - In-memory database - Endpoints: /register, /login, /profile 💡 Claude Output (Typical Quality) from flask import Flask, request, jsonify import jwt, datetime from werkzeug.security import generate_password_hash, check_password_hash app = Flask(__name__) app.config['SECRET_KEY'] = 'secret' users = {} @app.route('/register', methods=['POST']) def register(): data = request.json users[data['email']] = generate_password_hash(data['password']) return jsonify({"message": "User registered"}) @app.route('/login', methods=['POST']) def login(): data = request.json user = users.get(data['email']) if user and check_password_hash(user, data['password']): token = jwt.encode({ 'email': data['email'], 'exp': datetime.datetime.utcnow() + datetime.timedelta(hours=1) }, app.config['SECRET_KEY']) return jsonify({"token": token}) return jsonify({"error": "Invalid credentials"}), 401 Register → Store Hashed Password → Login → Verify → Generate Token → Access System #Python #DataScientist #DataAnalyst #CS #IT #BCA #MCA

1 Comment
Like Comment
To view or add a comment, sign in
Sunil Sharma
3w Edited
Report this post
I just published spark-perf-lint to PyPI, the first dedicated Apache Spark performance linter for the Python ecosystem with built-in pre-commit hooks, CI/PR annotations, and deep audit capabilities. PyPI: https://lnkd.in/gu_qd5yB Live Webpage: https://lnkd.in/gje5sMac GitHub: https://lnkd.in/g6WF8-Yn pip install spark-perf-lint One command. That's all it takes for any PySpark team in the world to start catching performance anti-patterns before they reach production. There are 500,000+ PySpark projects on GitHub. Thousands of organizations run Spark ETL pipelines processing billions of rows daily. Yet until today, there was no dedicated Spark performance linter available on PyPI. We have linters for Python style, type safety, security, even framework-specific rules for Django and FastAPI. But the framework that processes more data than almost anything else in the enterprise? Nothing. Today that changes. What I've contributed to the ecosystem: → 93 Spark-specific performance rules — not generic Python lint. Every rule understands Spark internals: how the Catalyst optimizer works, when shuffles happen, what causes data skew, which join strategy Spark will choose and why. → 11 dimensions of coverage — cluster configuration, shuffle optimization, join strategy, partitioning, data skew, caching lifecycle, I/O and file formats, AQE tuning, UDF patterns, Catalyst optimizer, and monitoring gaps. This is the most comprehensive Spark performance rule set available anywhere — open source or commercial. → Every finding comes with a fix — not "consider optimizing." Actual before/after code. Specific config changes. Spark internals explanation of why the anti-pattern hurts. Estimated performance impact. Effort level to fix. → A complete knowledge base — 50+ Spark patterns with decision matrices. When to use broadcast join vs sort-merge vs bucket join. How to size partitions. When to cache vs checkpoint. This alone is worth reading even if you never run the linter. → Three tiers of analysis: Pre-commit hook, CI/PR analysis and Deep audit Why this matters at scale: A single missed .collect() without a filter can OOM your driver with 200M rows. A join on a low-cardinality column can create straggler tasks that run 50x longer than the median. A default spark.sql.shuffle.partitions=200 on a 500GB dataset creates 200 partitions of 2.5GB each guaranteed spills and GC pressure. These bugs don't show up in dev. They don't show up in code review. They show up at 2 AM when your production pipeline fails and the on-call engineer is staring at a Spark UI full of red. spark-perf-lint catches them at commit time. Before they ever run on a cluster. Before they cost compute. Before they wake anyone up. Try it today ! #ApacheSpark #PySpark #DataEngineering #OpenSource #Python #PerformanceEngineering #ETL #BigData #DevTools #PreCommit #PyPI #ClaudeCode #Fintech #DataPipelines

spark-perf-lint pypi.org

1 Comment
Like Comment
To view or add a comment, sign in
Aditya kumar singh
1w
Report this post
I spent the last few weeks building Pipekit — a distributed data pipeline orchestrator from scratch in Python. Here's what I built and what I learned WHAT IS PIPEKIT? A simplified Apache Airflow — built from first principles to deeply understand how pipeline orchestrators actually work under the hood. You define tasks with a decorator: @task(retries=3) def fetch_data(): return "raw_data" Express dependencies in one line: merge.depends_on(fetch_users, fetch_orders, fetch_products) And Pipekit handles the rest. WHAT IT CAN DO - DAG execution — Kahn's algorithm resolves dependencies automatically - True parallelism — tasks in the same wave run across multiple Celery workers simultaneously - State machine — every task tracked: pending → running → success / failed - Persistent state — full audit trail in PostgreSQL - Exponential backoff retry — 2s, 4s, 8s between attempts - Artifact passing — task outputs flow automatically to downstream tasks - REST API — trigger pipelines over HTTP (FastAPI) - CLI tool — pipekit run, pipekit status - Cron scheduler — pipelines run automatically on a schedule - Cycle detection — raises an error immediately for circular dependencies PROOF OF PARALLELISM 3 tasks × 2 seconds each: Sequential → 6.2 seconds Pipekit → 2.04 seconds ForkPoolWorker-7, ForkPoolWorker-8, and ForkPoolWorker-1 all picked up tasks at the same timestamp. That's real parallelism — not threads, actual separate processes. TECH STACK - Python (core) - FastAPI (REST API) - PostgreSQL (persistent state) - Redis (message broker) - Celery (distributed workers) - APScheduler (cron scheduling) - Click (CLI) THE BIGGEST THING I LEARNED Reliability in distributed systems comes from state. Without persistent state — a crash means you lose everything. You don't know what ran, what failed, where to restart from. With state — you can observe, recover, and retry. That's the insight behind every production orchestrator. Every database. Every message queue. It's all about managing state reliably. Building this from scratch taught me more about distributed systems than any course ever did. If you're learning backend or distributed systems — I highly recommend building something like this from scratch. You understand it on a completely different level when you've written every line yourself. Project is on GitHub 👇 https://lnkd.in/gHx4Rtjp website link - https://lnkd.in/g4xC5RsN #Python #DistributedSystems #BackendEngineering #BuildInPublic #SoftwareEngineering #DataEngineering #OpenSource
Like Comment
To view or add a comment, sign in
Marcos Vinicius Thibes Kemer
2w
Report this post
✅ #PythonJourney | Day 154 — Test Suite Complete: 14 Tests, 100% Endpoint Coverage Today: Completed the comprehensive test suite. Every API endpoint now has automated tests validating behavior, error handling, and authentication. Key accomplishments: ✅ Full test coverage (14 tests): • Health Check: 1 test • Create URL: 4 tests (success, invalid format, no auth, invalid auth) • List URLs: 3 tests (empty, with data, no auth) • Get URL Details: 2 tests (success, not found) • Delete URL: 2 tests (success, not found) • Get Analytics: 2 tests (success, not found) ✅ Testing patterns implemented: • Fixture-based setup (conftest.py) • Isolated database per test • Mock user creation • Authentication validation • Error condition testing • Status code verification ✅ All edge cases covered: • Valid requests return proper responses • Invalid inputs rejected with 422 • Missing auth returns 401 • Non-existent resources return 404 • Successful deletes return 204 • Analytics properly calculated ✅ Test execution: • 14 passed in 2.51s • Zero flaky tests • All database operations isolated • Clean setup and teardown What I learned today: → Comprehensive testing catches edge cases early → Fixtures reduce boilerplate and improve maintainability → Test isolation prevents hidden dependencies → Fast tests enable rapid development cycles → Good test names document expected behavior The test suite now validates: - ✅ API contract (request/response format) - ✅ Authentication (API key validation) - ✅ Authorization (users see only their data) - ✅ Error handling (proper HTTP status codes) - ✅ Business logic (URL creation, deletion) - ✅ Data persistence (database operations) This is production-grade testing: - Every endpoint tested - Every error case covered - Fast feedback on code changes - Confidence to refactor safely - Documentation through tests Current status: - ✅ Backend: Production-ready - ✅ Tests: 14/14 passing (100%) - ✅ Code coverage: All endpoints - ✅ API: Fully validated - ⏳ Deployment: Next (GCP) From zero to production-grade in 154 days. The backend is ready for real-world use. Next: Deploy to Google Cloud Platform (GCP). #Python #Testing #Pytest #Backend #API #Quality #SoftwareDevelopment #TDD #Production
Like Comment
To view or add a comment, sign in
Divine Owai
2w Edited
Report this post
Just shipped a distributed systems project And honestly i learned more debugging in this than in any project I've ever done 😅 I built a search recommendation engine from scratch, the kind of system that tracks what you click on and uses that data to re-rank future search results in real time The stack: → FastAPI for the APIs → Kafka as the message queue (click events flow through here) → PostgreSQL to store click scores → Redis to cache hot results → Docker Compose to wire everything together The whole flow looks like this: user clicks a result → event goes to Kafka → stream processor reads it → updates scores in postgres → next search returns re-ranked results with Redis serving the hot ones in <1ms sounds clean right? it was NOT clean getting there at all 😭 The issues i ran into (and how i fixed them): 1. Kafka advertising the wrong address to my Python service which kept getting a "connection refused" error, even though Kafka was running. Turned out Kafka was advertising itself as localhost:9092, which inside Docker means the container's own localhost, not the network address other containers could reach. fixed it by setting KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:9092 so containers find each other by service name. 2. "depends_on" doesn't mean "ready" Docker's depends_on only waits for a container to start, not for the service inside to be ready. Kafka takes 20-30 seconds to fully boot. My stream processor was connecting before Kafka was ready and crashing. I fixed it with proper Docker healthchecks + retry logic in Python that actively probes Kafka every 5 seconds instead of blindly sleeping. 3. WSL2 dropping me into Docker's internal VM when i typed wsl it dropped me into Docker Desktop's internal Linux VM instead of a real Ubuntu distro. everything was broken — no sudo, no apt, wrong environment. Had to install Ubuntu properly via wsl --install -d Ubuntu and work from there. 4. Running Python outside Docker: I spent a while confused why my local Python couldn't reach Kafka. The fix was containerising the Python services too so everything runs on the same Docker network. Learned the hard way that localhost means something completely different inside vs outside a container. The most satisfying moment was watching the metrics endpoint show: cache hit rate: 57% kafka consumer lag: 0 cache HIT latency: 0.6ms vs cache MISS: 2.3ms That 12x latency difference between Redis and PostgreSQL is real and measurable not just theory anymore. Would apply this to an actual application next... Link to repo: https://lnkd.in/eDt65PvS #softwareengineering #distributedsystems #kafka #redis #python #docker #buildinpublic #devops Alright keep scrollingggg :)
9 Comments
Like Comment
To view or add a comment, sign in
Manoj Babu
3w
Report this post
How we went from bespoke PySpark scripts to a composable, config-driven ETL framework — inspired by Rust's trait system. The idea: separate infrastructure from business logic. YAML handles Spark tuning, Iceberg catalogs, S3 shuffle config. Python mixins with priority ordered hooks handle the rest — composable at runtime, reusable across pipelines. A new pipeline looks like this: class DataCleaningETL( ProcessDateCLIMixin, WAPMixin, DeduplicationMixin, CleaningMixin, EnrichmentMixin, ComposableETL, ): pass No Spark boilerplate. No copy-paste. Write-Audit-Publish, schema evolution, and Iceberg housekeeping are all handled by the framework. We also use DuckDB as a drop-in for PySpark in unit tests — same DataFrame API, no JVM. Tests run in seconds instead of minutes. Built on Apache Spark 4.0, Apache Iceberg, and Lakekeeper on Kubernetes at ZeroToOne.AI Full writeup: https://lnkd.in/gUUm6mH2

Building a Composable ETL Framework for Spark blog.platform.zerotoone.ai
Like Comment
To view or add a comment, sign in
Arockia Nirmal
3w
Report this post
Most data engineers lint their Python. Nobody lints their SQL. You've got flake8 on every commit, formatters on your notebooks, type checkers on your APIs. But the SQL that actually touches production data? That gets eyeballed in a PR and merged with a "looks good to me." I started running SQLFluff on a client's dbt project almost by accident. Someone mentioned it in a thread, I added it to the pre-commit config, and within a week I couldn't imagine working without it. It's open source. It supports most major warehouses - Redshift, BigQuery, Snowflake, Postgres - and has a dbt templater that actually handles Jinja SQL without losing its mind. Basic setup takes minutes. Tuning it for your team's conventions takes a bit longer, but that's time well spent. What it catches isn't dramatic. Inconsistent casing across models. Ambiguous joins that pass review because everyone reads them differently. Implicit column references that'll break the next time someone touches the schema. None of it is catastrophic on its own. But messy SQL accumulates. And accumulated mess is where bugs hide. The real value isn't even the linting itself. It's that your team stops arguing about SQL style in pull requests. The linter decides. You move on. And when you onboard someone new, they don't have to reverse-engineer your conventions from 50 different models - the rules are in the config. Run it in CI and it prevents style drift over time, not just during review. That matters more as your team grows. We treat SQL like it's somehow exempt from the standards we apply to every other language in the stack. It shouldn't be. It's the language closest to your data, and it deserves at least the same rigor as the Python wrapping it. SQLFluff is free. There's no reason not to try it.
2 Comments
Like Comment
To view or add a comment, sign in

402 followers

202 Posts

View Profile Follow

Marcos Vinicius Thibes Kemer’s Post

More Relevant Posts

Explore content categories