Lazy/Pushdown in DuckDB/Polars - forced conversion to pandas dataframe

Hello

first of all, i am really impressed with the tool and excited for new functionalities on the roadmap (especially the scheduler). The tool is very simple to use and already has a lot of functionality. The code export functionality is a gamechanger. The problem is that every transformation forces me to convert the the result into a pandas dataframe. Even when using custom python transforms, I need to convert the result into a pandas df, otherwise it’ll throw me an error.

This means each step is materialized and duckdb cannot utilize its lazy evaluation functionality. This significantly increases the memory usage and leads to freezes /crashes when running on large datasets.

Is there any workaround I can use at the moment / are there any plans to change it? I’d love to be able to use apmhi but at this stage , this particular limitation makes it unusable for my use cases.

thank you, happy to elaborate on any of the above!

Hi @ad994, thanks for your message.

That’s actually a good feedback. I could introduce a new component that would allow to use either duckdb or polars code. Let me add it in the queue.

What kind of computation are you perfoming with DuckDB?

Thanks,
Thibaut

Thanks a lot for your response! Very excited to see this implemented!

For context I am a knime power user and do most of my ETL there using a duckdb jdbc client within the platform (I mostly write sql pipelines). The datasets I work with are not massive (10-20mm rows, 100+ cols) , but big enough to choke a laptop or a small compute node in the cloud if I used pandas instead.

Knime is a very mature and polished tool + has a nice functionality when interacting with databases. You essentially write small code snippets in each node, connect the nodes together and It merges all of your database nodes into one big sql query at the end. You can easily port this sql query to anywhere. This ensures all of the queries are not materialized till the very end and allows the query optimizer to do its work while keeping the pipeline modular and very easy to debug/adjust.

Knime also has python scripting capabilities but just like amphi, accepts outputs as pandas dataframes only.

While I do most of my transforms in duckdb, there is still an odd api connection, a knime native node or workflow branch control/ variable & loop logic that is not captured in the sql query itself.

While it’s super fast for development, it creates blockers when porting the workflow to run in production (mostly redshift/snowflake or databricks). The sql part is usually fairly simple most of the time (there are syntax differences but nothing too crazy). Porting the missing non-sql pieces of the pipeline is sometimes a lot of pain though. With amphi, I love the fact that I can export code that captures the entire workflow logic. I’m very excited for amphis future and believe it could be this missing piece I’m looking for, although at the moment the tool is not as feature complete/ mature - which is understandable.

Thanks a lot for all the great work!
Adam

Hello Adam. I think there are two different points here.
First one is about using different execution engine as backend (which is actually pandas even if some tools can be executed with polars or duckdb…I’m working on a join working with polars and duckdb but still the input and output are pandas df), the other one about in-database execution. I wrote a document (more user point of view) specification about the latter, hoping to work personnaly on that next year. I’m an alteryx user and I would like basically the same thing than you but I would rather have separate components for in-memory and in-database features.

Hey Simon, thanks for your reply and apologies for the delay…

I don’t believe the above requires separation of “in database” & “in-memory” features as long as we can display the result as a duckdbpyrelation or polars lazyframe without converting each step with “.df()” command.

DuckDB has a relational API in addition to their database API, where the queries are constructed incrementally and can reference previous relations. This results Ina “DuckDbpyrelation” python object. Polars has an equivalent called lazyframe, where you only trigger computation by calling “.collect()” function.

If you force the output to be pandas df each time, you’re losing about 90% of the benefits these engines (DuckDB, polars) bring in terms of performance optimizations so it kind of defeats the purpose of bringing them in at all (plus there’s a massive df conversion overhead you introduce).

If we could simply write custom SQL or custom python and construct pipelines incrementally and execute them in "lazy " mode, this would be huge and probably easier to implement vs string-based SQL query construction in the “in database” mode referenced above.

For instance, marimo notebooks implement it very nicely , namely you can display the results as a pandas dataframe , polars lazyframe, DuckDB relation or simply let it infer it from the code using “auto”.

Hello @ad994 I may have been unclear. The idea is not to use DuckDB for in-database but for in-memory (like pandas today… and polars would also be an option). The reference to in-database was for redshit/snowflake/databricks/postgresql, etc).
In the last release, you have two components where you can choose the engine : the advanced join and the compare dataframe. However, the console still display only pandas dataframe and we have to do a lot of work to make this choice available everywhere.