📓 Fusion Diaries: Python models in preview, Semantic Layer support, a brand new CSV parser, and lazy compilation in alpha

📓 Fusion Diaries: Python models in preview, Semantic Layer support, a brand new CSV parser, and lazy compilation in alpha

Author: Anders Swanson, Senior DX Advocate at dbt Labs

Hi everyone, it’s been 52 working days since the last Fusion diaries in November. So, where are we at a high level? Humor me with this simile:

Just as the Eastern seaboard of the US is emerging from a polar vortex with an eye on inevitable spring, the dbt Fusion engine team is emerging from the depths of complex long-tail bugs with an eye on inevitable General Availability!

But we’re getting ahead of our skis here, this is a Fusion diary! In the diarist spirit, we’d love to share the usual tranche of big rocks, work-in-progress, dragons (both at large and vanquished), what to read, and (as always) a meme.

TL;DR

Velocity

  • 302 issues closed as completed across the dbt-fusion and internal repos
  • 788 merged PRs
  • 37 new preview releases (preview.73 to preview.119)

Per usual, for specifics, check out dbt-fusion’s CHANGELOG for more information.

Big rocks

  • Fusion-readiness for package ecosystem and dbt Hub’s compatibility!
  • New CSV parser
  • Preview: Python models for Snowflake , Databricks and BigQuery
  • Preview: Semantic Layer support
  • (Alpha) lazy compilation
  • One more experimental, top secret thing… 🤫 ‼️

The final push to general availability

Eight months ago, we published this blog: The Path to GA: How the dbt Fusion engine rolls out from beta to production.

Looking back at it, I’m impressed with how prescient it was. Especially this Joel Labes classic

Did you know that there are also over a bajillion undocumented features of dbt?

If we were to add the below items to the bajillion undocumented features, the correct total is closer to a gazillion.

  • Bugs we’ve found and fixed in Core (stories for another day)
  • Necessary detours to realize the vision of Fusion’s SQL understanding (we’ve learned so much from community feedback! More on that later)
  • Kooky corner cases where Rust doesn’t do what Python should. For example, Rust’s dict’s are not insertion-ordered, but Python’s are (by edict of BDFL)

In the coming weeks, we’ll share a clear-eyed picture of the remaining work. In the meantime, you can check out the dbt-fusion repo Milestones that we’ll be refining over the coming weeks.

Article content

Big rocks: What shipped since November

Package conformance and hub updates

Exciting update: all but two of the top 50 most popular packages now work on Fusion!

Not only that, but now the dbt Package hub (hub.getdbt.com) will provide you with Fusion compatibility information based not only on what the package maintainer specifies but also on tests that we now perform in the backend.

If you are a maintainer of a package or waiting for a package you depend upon to get compatible with Fusion, check out this great guide.

On top of all of that, dbt-autofix packages has continued to improve since November. It auto-upgrades your packages to the lowest Fusion compatible version available. Make sure you’ve got the latest version and give it a whirl!

Article content

Model governance (versioning, contracts, and access groups)!

We’ve had most of the support for model governance for a while now. The last remaining piece, a model’s deprecation_date was just shipped, so we can call this done!

New, more tolerant CSV parser

In Fusion's earliest days, we chose arrow-csv as the csv parser. It's highly performant and Arrow-native, which was great for integration into the static analysis pipeline. However, it is also much more strict than Python’s agate library, which was holding a lot of the community back from adopting Fusion!

Just like YAML before it, we have introduced our own new Rust crate: dbt-csv!

This is no easy task; reading CSVs is notoriously difficult (see: So You Want To Write Your Own CSV code? and this very upsetting Community Slack thread). Instead of giving up, @Mengdi Lin leaned in to ship a brand new csv parser that’s way more user friendly and tolerant of kooky problems that “real-world” environments won't have.

If you’ve had trouble with seed-ing .csv's, most issues should go away after upgrading to 2.0.0-preview.118. But reach out if you’re still seeing something weird!

The below ticket tracks the remaining work (it’s not much!)

In preview: Python models!

We haven't made a big announcement (yet), since we're still testing out this functionality in real-world projects with help from folks in the community.

Some known limitations are this one:

And ensuring the equivalent changes land in dbt Core.

Beyond that, the big remaining piece of work before GA is figuring out how "static analysis" can/should work with Python models (which, for obvious reasons, can't be analyzed as SQL). See this discussion for more of the latest.

More than anything, the signal we need to call them “GA” is for more folks to try them out and tell us if they’re working as expected. So please give them a go and report back.

In preview: Semantic layer in Fusion now

The Semantic layer team has worked tirelessly to not only get the semantic layer shipped to Fusion, but even more importantly, ship updates to the spec that makes authoring easier.

The new authoring experience and spec is planned to be released in dbt Core 1.12, but it’s available now in Fusion.

Learn more in the blog post below, but note the related dragon below with respect to dbt docs generate and the catalog.json .

In alpha: “Lazy” compilation

In previous Fusion diaries, we’ve talked about “incremental compilation” in the VS Code extension. The use case is this:

  1. Some dbt projects take at least a minute to fully compile in Fusion because they are large and/or complicated
  2. This is faster than Core, but not as fast as users (and agents) would like

Our first swing at addressing this problem was incremental compilation.

You still run a full compile as usual, but after that first compile completes, Fusion only needs to check the file you've modified and its descendants. This was a big performance increase: Our internal dbt project took four minutes to fully compile, but only 10 seconds to incrementally recompile after changing one file.

The idea of lazy compilation is to drop that first full compile, and only ever have the VS Code extension and language server analyze the models that you're currently editing.

What’s shipped now is that when you open your first model in VS Code, that model (and its parents and children) are analyzed first. The full background compile waits until after the initial lazy compilation is complete.

The ultimate goal is to make feedback as performant and specific as possible to keep you in “flow state” (credit: 👑 Sung).

We aspire towards truly lazy compilation in which only edited models are analyzed, but this is a great first step.

To try this out, set DBT_LSP_LAZY_COMPILATION_ENABLED=1 in the VS Code extension (guide for how to do so). Let us know what you think!

🚧 Work in progress

Heads up on features currently underway

We hope that in the coming weeks, we’ll have the following available. We’ll give more of a shout once they’re closer to ready!

  • microbatch incremental models
  • dbt retry
  • --fail-fast

“Baseline mode”

Something we’re hoping to put in Fusion users' hands next week is something we’re calling “baseline mode”. We’ve observed that migrating to Fusion isn’t just moving your YAML around to be compatible with the new authoring layer. For some customers, it can be more involved, especially in these scenarios:

  1. Users don’t have access to all the sources and models of a large dbt project
  2. Intricate jinja-based workflows that involve post-hooks and a great deal of introspection.
  3. Projects that make use of packages that aren’t yet Fusion compatible
  4. Models and sources that make use advanced data types (STRUCT, ARRAY, GEOGRAPHY) or built-in functions (AI.PREDICT, JSON_FLATTEN, st_pointfromgeohash) that aren’t yet supported by the dbt-fusion engine.

The way Fusion, the VS Code extension, and the Language Server are built is such that you are very limited with respect to what Fusion features you can get.

So we went back to the drawing board with a goal of creating a smoother transition. Expect a discussion that goes deeper on this next week!

Dragons

Deferral!!!

There’s a dragon afoot when it comes to deferral. The big challenge is that in Core, there are only ever two states of a given model:

  • The model in production
  • The model in your local dev branch

With Fusion, we’ve introduced a third place: the schema and logical plan for your models that were generated as a result of static analysis (living in target/db/schemas/... )

This can manifest as errors like dbt1053 (can't find a locally cached schema, because we're deferring to prod and it was never locally cached) or dbt1014 (trying to introspect the dev schema since the local cache is missing, but there's nothing to introspect because we're deferring to prod). We’re laser focused on this now, but we wanted to call this out to y’all.

Docs generate for those who have adopted the new SL spec

We’re almost ready to announce the new way to create a catalog.json, which is the key artifact that powers what we call dbt docs and dbt Platform Catalog. However, there are still some kinks to iron out.

Until Fusion can properly generate a catalog.json, Platform users may experience some roadbumps in keeping their Catalog updated. Catalog uses a Core-powered docs experience, which requires a job running dbt docs generate. This command only works on Core, so if you are using Fusion on Platform and have a job with documentation, under the hood this is being run with Core.

This may work for your project, but there are some cases where Core throws an error on this command. Here are some scenarios you may observe:

  • Error messages like 'ref' is undefined or references to jinja in .md or .yml files
  • A job that never finishes or fails for no reason

This is a short-term degradation; the Fusion docs generate command will soon be released, and Core will soon have the ability to generate catalog.json on Fusion enabled projects.

If you’re a dbt Platform customer on Fusion and experiencing this issue, please reach out to support, and we can deploy a hotfix for you.

👓 Stuff you should read:

I’ve been reading a lot about software engineering best practices—partly to LARP as an engineer, and partly to build a clearer mental model for how the dbt Fusion engine will shape the developer experience.

So my brain absolutely lit up when I read this article. It goes into how it’s an adjustment to get used to compilers and that initial instinct to squelch the compilers nagging ends up biting you in the long run. Instead, you have to evolve your mindset to consider the compiler as your friend.

So for those who have been using Fusion or are Fusion-curious, please give a read and let me know what resonates with you! Here are a few paragraphs that really spoke to me:

The compiler is always angry. It's always yelling at us for no good reason. It's only happy when we surrender to it and do what it tells us to do. Why do we agree to such an abusive relationship?
And for someone who yells that much, it's not even that smart. Remember all the times when the compiler yelled at you something about a type mismatch, but turns out that you were right and the compiler was wrong? In the end you had to use a cast (or mark something as any) just to shut the compiler up.
We can solve these issues by quitting the relationship, and jumping ship to a dynamic language. I suspect that compiler abuse is why in the past Python grew in popularity compared to Java (that, and Java's verbosity). But we just saw that even in historically dynamic languages, like JavaScript and Python, there's a trend towards a more static, more compiler-centric way of writing code. These days Python has not just one, but multiple competing tools for typechecking (mypy, pyright, pyrefly, ty). Why do people lean that way?

🏁 Made it to the meme

Not only is Valentine’s Day this week, but it’s also the 10th anniversary of Apache Arrow. In celebration of both, we present to you not a meme but a Valentine’s Day card replete w/ a dad joke.

Seriously, Apache Arrow has not only revolutionized how dbt works, given that it’s the predominant data format throughout the dbt Fusion engine. It also dramatically lowers the bar for vendors to build and maintain query engines. Most importantly, data practitioners' lives improve as a result of the performance improvement that Arrow brings to archaic data protocols.

Article content

P.S. One more thing…

🦆 quack check out 2.0.0-preview.119 quack 🦆

Article content


To view or add a comment, sign in

More articles by dbt Labs

Others also viewed

Explore content categories