Preventing Silent Data Corruption in Django with F() Expressions

1mo

Updating a database field in Python is not atomic. Even when it looks like it is. The common pattern - order.total += item.price order.save() In a concurrent system, it's a race condition waiting to happen. Here's what actually happens: 1. Django fetches order.total into Python memory. Adds item.price. Writes the result back. 2. Between the read and the write, another request can fetch the same value, do the same addition, and save. 3. One update silently overwrites the other. No error. No warning. Solution - F() expressions eliminate this entirely. F() tells Django to write the computation directly into the SQL. One query. No read. No Python. No race condition. The trap - after an F() update, the Python instance still holds the old value. Accessing order.total returns stale data until refresh_from_db() is called. This catches engineers off guard every time. Takeaway — -> field += value in Python → read → compute → write → race condition under concurrency -> F('field') + value → single atomic SQL update → no Python read → no race condition -> After F() update → Python instance is stale → always call refresh_from_db() -> Works in update(), save(), annotate(), not just single instance updates Where have you hit silent data corruption from concurrent updates? F() or something else? #Python #Django #BackendDevelopment #SoftwareEngineering

To view or add a comment, sign in

More Relevant Posts

Yuvraj Angad Singh
1w
Report this post
my CLI search took 4.3 seconds for 5000 files. the obvious fix was numpy. i shipped it. it got slower. the synthetic benchmark said numpy was 300x faster on the cosine loop (N=1000, D=3072). the real test on a fresh subprocess said it was 60% SLOWER at N=5000. two hidden costs the synthetic benchmark never saw: 1) import numpy takes ~180ms cold on macOS. my CLI paid that on every invocation, even vemb --version. 2) np.asarray on a 5000 x 3072 list-of-lists copies 15M Python floats. 2 full seconds at scale. so to save 1 second of python cosine, i was paying ~2.2 seconds of overhead. the real bottleneck wasn't the compute. it was the JSON cache. json.loads on a 317 MB file takes 2-3 seconds by itself. and then numpy has to re-unpack the python objects into a contiguous array. the fix wasn't a faster loop. it was a different cache format. binary .npy matrix + a small JSON manifest. vectors pre-normalized on write, so cosine reduces to a dot product at query time. np.load can mmap the whole matrix in milliseconds with zero copy. what shipped: N=200: tied N=1000: 1.5x faster N=5000: 2.6x faster cache size: 317 MB to 61 MB the principle: for short-lived python CLIs, your hot-path compute is rarely the bottleneck. it's IO, deserialization, or imports. measure the whole command in a fresh subprocess before you optimize the kernel. the synthetic benchmark was lying. the real benchmark wasn't.
Like Comment
To view or add a comment, sign in
Mustaqeem Siddiqui
1w
Report this post
🚀Python Series – Day 26: JSON in Python (Handle API Data Like a Pro!) Yesterday, we learned APIs in Python🌐 Today, let’s learn how Python works with the most common data format used in APIs: JSON What is JSON? JSON stands for JavaScript Object Notation It is a lightweight format used to store and exchange data. 📌 JSON is easy for humans to read and easy for machines to understand. 🔹 Where is JSON Used? ✔️ APIs ✔️ Web applications ✔️ Config files ✔️ Data exchange between systems 💻 Example of JSON Data { "name": "Mustaqeem", "age": 24, "skills": ["Python", "SQL", "Power BI"] } 💻 Convert JSON to Python Dictionary import json data = '{"name":"Ali","age":22}' result = json.loads(data) print(result) print(result["name"]) 🔍 Output: {'name': 'Ali', 'age': 22} Ali 💻 Convert Python Dictionary to JSON import json student = { "name": "Sara", "age": 23 } json_data = json.dumps(student) print(json_data) 🔍 Output: {"name": "Sara", "age": 23} 🎯 Why JSON is Important? ✔️ Used in almost every API ✔️ Easy data exchange format ✔️ Important for Web Development ✔️ Must-know for Data Science projects ⚠️ Pro Tip 👉 Learn dictionary concepts well, because JSON looks similar to Python dictionaries. 🔥 One-Line Summary 👉 JSON = Standard format to store and exchange data 📌 Tomorrow: SQL with Python (Connect Python with Databases!) Follow me to master Python step-by-step 🚀 #Python #JSON #API #WebDevelopment #DataScience #Coding #Programming #LearnPython #MustaqeemSiddiqui
Like Comment
To view or add a comment, sign in
Onkar Lapate
1mo
Report this post
Django's ORM can't do complex filtering with chained .filter() calls alone. Filter chaining works for AND conditions. For OR conditions, for negations, for dynamic query composition - it silently does the wrong thing or fails entirely. Q() objects are Django's answer - composable query expressions that map directly to SQL boolean algebra. 1. Each Q() object wraps a condition. 2. Conditions combine using Python operators - & → SQL AND, | → SQL OR, ~ → SQL NOT 3. Django takes the composed expression tree and compiles it into a single WHERE clause. The trap - operator precedence. Python evaluates & before | - exactly like multiplication before addition. Takeaway - -> .filter().filter() → always AND → no OR, no NOT without Q() -> Q() & Q() → AND, Q() | Q() → OR, ~Q() → NOT → maps directly to SQL boolean algebra -> & binds tighter than | → always use parentheses in complex expressions -> Q() objects are composable Python - build query logic dynamically without raw SQL What's the most complex dynamic filter you've had to build? Did Q() hold up or did you reach for raw SQL? #Python #Django #BackendDevelopment #SoftwareEngineering
Like Comment
To view or add a comment, sign in
Sahina Rayeesa
1w
Report this post
🧠 Python Concept: dataclasses (Clean Data Models) Write less boilerplate code 😎 ❌ Traditional Class class User: def __init__(self, name, age): self.name = name self.age = age def __repr__(self): return f"User(name={self.name}, age={self.age})" 👉 More boilerplate 👉 Repetitive code ✅ Pythonic Way (dataclass) from dataclasses import dataclass @dataclass class User: name: str age: int 👉 Automatically generates: __init__ __repr__ __eq__ 🧒 Simple Explanation Think of it like a shortcut ➡️ You define data ➡️ Python builds the rest 💡 Why This Matters ✔ Cleaner code ✔ Less boilerplate ✔ Easier to maintain ✔ Used in real-world apps ⚡ Bonus Example @dataclass class User: name: str age: int = 18 👉 Default values supported 😎 🧠 Real-World Use ✨ API models ✨ Config objects ✨ Data handling 🐍 Write less code 🐍 Let Python do the work #Python #AdvancedPython #CleanCode #SoftwareEngineering #BackendDevelopment #Programming #DeveloperLife
Like Comment
To view or add a comment, sign in
Rajinder Verma
3w
Report this post
I had a Python UDF that was slow. Everyone told me to switch to a Pandas UDF. I switched. It got faster. I didn't stop there, which is where this gets interesting. I spent a weekend benchmarking the Arrow serialization overhead across different schema widths and batch sizes because I wanted to actually understand what I was paying for. Here is what I found. On a narrow schema, 4 columns, a Pandas UDF with default batch size of 10,000 records was 6.2x faster than the Python UDF. The serialization cost was trivial relative to the computation savings. On a wide schema, 180 columns, the Pandas UDF at default batch size was 2.1x faster. Still better. But the Arrow conversion was now a meaningful fraction of total execution time because converting 180 columns per batch is not free. When I dropped the batch size on the wide schema to 2,000 records, peak memory per conversion dropped and the job stopped spilling to disk on the executor with the largest partition. Total job time: 1.7x faster than the wide-schema default. A 23% improvement just from tuning spark.sql.execution.arrow.maxRecordsPerBatch. The configuration nobody sets: spark.sql.execution.arrow.pyspark.enabled=true. This is separate from Pandas UDFs. It accelerates toPandas() and createDataFrame() globally. Every time you collect to pandas interactively, you are either paying the Arrow overhead or the row-by-row serialization overhead. Arrow is always cheaper. It is not on by default in all environments. I set that flag. I set it in every cluster config I control. I set it so reflexively now that I had to think to remember whether it was a default or a choice. The point is not to memorize my numbers. Your schema is different. My point is that I ran the experiment and found a 23% improvement by changing one integer. You have not run the experiment. Run it. The number is different for your schema. Find it.
Like Comment
To view or add a comment, sign in
Lucas Pereira
3w
Report this post
OrJSON looks like a small optimization. Until you realize how much time your API spends just serializing JSON. In many Python APIs, the bottleneck isn’t only the database or the LLM. Sometimes it’s the most invisible step: turning Python objects into JSON. What is OrJSON? A high-performance JSON library for Python, written in Rust. It replaces the default json module and focuses on one thing: speed. It: → serializes faster → deserializes faster → supports dataclass, datetime, numpy, UUID out of the box → returns bytes instead of str So what’s happening under the hood? The idea is simple: optimize the hottest path in your API. → less overhead per operation → less work per payload → faster UTF-8 writing And it shows. In its own benchmarks: → dumps() can be ~10x faster than json → loads() can be ~2x faster Where this actually matters: → large payloads → APIs returning a lot of JSON → RAG metadata, events, telemetry → long lists Now the part most people ignore: Trade-offs. → orjson.dumps() returns bytes, not str → no built-in file read/write helpers → not always a perfect drop-in replacement → holds the GIL during serialization So when should you use it? → large responses → heavy metadata → serialization shows up in profiling And when won’t it help? → DB is your bottleneck → LLM latency dominates → responses are small → network / I/O dominates OrJSON won’t magically make your API fast. But if serialization is on your hot path, it’s one of the highest ROI optimizations you can make.
Like Comment
To view or add a comment, sign in
Rohit Kumar
4w
Report this post
✅ *Python Basics: Part-5* *Lambda Functions & Built-in Functions* ⚡🧠 🎯 *1. Lambda Functions* A *lambda* is an anonymous, one-line function. Syntax: ```python lambda arguments: expression ``` 🔹 *Example:* ```python add = lambda x, y: x + y print(add(3, 4)) # Output: 7 ``` Use-case: Often used with functions like `map()`, `filter()`, and `sorted()`. 🎯 *2. Built-in Functions (Must-Know)* 🔹 `len()` – Returns the length of an object ```python len("Hello") # 5 ``` 🔹 `type()` – Shows the type of variable ```python type(10) # <class 'int'> ``` 🔹 `int()`, `float()`, `str()` – Type conversion ```python int("5") → 5 ``` 🔹 `input()` – Takes user input ```python name = input("Enter your name: ") ``` 🔹 `range()` – Generates a sequence of numbers ```python for i in range(5): print(i) # 0 to 4 ``` 🔹 `sum()` – Adds up values in a list ```python sum([1, 2, 3]) # Output: 6 ``` 🔹 `max()`, `min()` – Find largest/smallest ```python max([10, 3, 8]) # 10 ``` 💬 *Double Tap ❤️ for Part-6!*
Like Comment
To view or add a comment, sign in
Vu Trinh
3w
Report this post
Here's what actually happens when you run a PySpark script: Two separate processes start. ◉ A Python process — this is the Python driver ◉ A JVM process — this is the actual Spark driver They talk through a library called Py4J. Python listens on port 25334. The JVM listens on port 25333. Every method call you write gets routed from Python to the JVM. The JVM driver asks the cluster manager to spawn the executor. So when you type .read.load("file.parquet"), Python isn't reading anything. It's passing the instruction to the JVM, which does the real work. The rest follows normal Spark flow from there. Logical plan. Physical plan. Tasks are distributed to executors. All of it happens in the JVM, not in your Python process. We're assuming all the logic could be handled in the JVM. What happens when you write a Python UDF? ◉ Spark spins up additional Python processes alongside the JVM executor ◉ Data gets serialized out of the JVM and sent over via IPC ◉ The Python process deserializes it, runs your function, and serializes the result ◉ Sends it back to the JVM Python UDFs, in general, are slower than native Spark functions. Because they do not benefit from Spark-optimized features such as the Catalyst (Spark’s optimizer) and Project Tungsten (which improves memory usage by operating directly on binary data rather than Java objects). Another factor that could impact the performance of the Python UDF is that it operates on a single data row at a time. In Spark 3.5, Spark introduced Arrow-optimized Python UDFs. The user can choose whether to use this feature. Both the JVM and Python processes now handle data in the Arrow format, which helps bypass the costly serialized and deserialized process. In addition, Arrow organized memory data in columnar fashion, which helps improve data processing time compared to Pickle, which serialized data in row-wise format. If you find this helpful, please: 𖤘 Save ↻ Repost #dataengineering #apachespark -- If you like this piece, you might love my newsletter, which includes 180+ articles to help you become a "production-ready" data engineer. Join 𝟭𝟴,𝟬𝟬𝟬+ DEs here for 𝗙𝗥𝗘𝗘: https://vutr.substack.com/
1 Comment
Like Comment
To view or add a comment, sign in
Anton Bryzgalov
2w
Report this post
⛓️💥 A silent patch in Airflow metaclass breaks Python MRO. This is something I was not expecting as an Airflow plugin maintainer. Recently I received a pull request to my airflow-clickhouse-plugin GitHub repo. A contributor found out that some plugin functionality is unusable because of a strange error: >>> ClickHouseSQLExecuteQueryOperator(task_id='id', sql='SELECT 1') ❌ TypeError: Invalid arguments were passed to ClickHouseSQLExecuteQueryOperator (task_id: id). Invalid arguments were: **kwargs: {'sql': 'SELECT 1'} Like, whaaaat? 😱 The only purpose of this operator is to literally execute SQL. The class definition is: class ClickHouseSQLExecuteQueryOperator( ClickHouseBaseDbApiOperator, sql.SQLExecuteQueryOperator, ): pass __init__ method is not defined in ClickHouseBaseDbApiOperator. So, it was pretty expected that it should call __init__ of SQLExecuteQueryOperator which has that sql parameter! It turned out that the reason was in BaseOperatorMeta. A metaclass that is used for every Airflow operator class to be created. Because of a dirty way to check “Does this class override __init__?” it was injecting an unexpected __init__ into the chain. Funny thing, this issue can be resolved as simply as adding a boilerplate __init__ method calling super().__init__(**kwargs). I have shared details of this investigation in my Medium post. What a sneaky bug it was! See it in the first comment. 💭 Any sneaky bugs you remember? #Airflow #Python #PythonMRO
1 Comment
Like Comment
To view or add a comment, sign in

4,574 followers

151 Posts

View Profile Follow

Preventing Silent Data Corruption in Django with F() Expressions

More Relevant Posts

Explore content categories