Schema-on-Read – Store first, decide later Schema-on-Read means you don’t define structure while storing data. You apply structure only when you read it. - Data is stored in raw format. - Schema is applied at query time. #Example You are collecting user activity logs, raw data stored as : {"user_id": 101, "action": "click", "time": "2026-04-01"} {"user": "102", "event": "purchase", "date": "April 1"} Both records are stored as-is. Now, when you read the data for analytics use case : user_id - standardized action/event - mapped to one column time/date - converted to proper date format So during query : "user" → 102 "event" → "purchase" "April 1" → proper date #Key point: - Data is messy while storing, but cleaned when used. - Faster ingestion and very flexible with changing data. - Extra work during queries and risk of inconsistent results - Needs strong logic while reading Follow Manish Kumar Singh #dataengineering #snowflake #etl #sql #linkedin
Well explained — this is exactly how modern data systems handle flexibility and scale 👌 #SchemaOnRead #BigData #DataEngineering #Analytics #DataLake #ETL #DataStrategy
Very good points. The key tenets of data lakehouses: the ability to store any data (structured, semi-structured, unstructured) and the ability to apply a schema to data either at-read or at-write. BTW: you highlighted some of the Snowflake capabilities, which motivated the person who coined the term data lakehouse for the first time (2017): https://www.garudax.id/pulse/fun-fact-i-coined-term-data-lakehouse-2017-jeremy-engle/ I wouldn't (at least fully) agree with some of your key points about such data being inherently "messy". Yes, it usually requires extra processing so consumers can use it more easily. There are often many consumers with different requirements. Often, not all consumers are known ahead. E.g., BI scenarios have different expectations than ML, so the pipelines (at least in their latter stages) differ. BI might want a dimensional model with pre-computed aggregates, ML might want one big table with hundreds of columns. AI might need semantic models... ❄️