(Apr-13-2024, 09:36 PM)sawtooth500 Wrote: I've never used dask before, but my understanding is that each function needs to be passed a dask dataframe. I could convert the pandas dataframe to a dask dataframe before passing it - but then would all my pandas functions inside of process_date() still work? I read that each partition of a dask dataframe is essentially a pandas dataframe, so my understanding is it should work, but I'm seeking confirmation.Yes Dask dataframe works the same Pandas dataframe,and it parallelizing Pandas.
Also maybe easier an faster is to use Polars ,it's Parallel bye default.
Polars Wrote:Parallel: Utilises the power of your machine by dividing the workloadI did post a example in you earlier post that you me not have seen.among the available CPU cores without any additional configuration
.
Vectorized Query Engine: Using Apache Arrow, a columnar data format, to process your queries in a vectorized manner and SIMD to optimize CPU usage.
If i write a quick test,see that do not need to tell Polars to use all Cpu's it doing that bye default.
So Pandas 2-min and Polars use 2-sec,on this task which are Cpu heavy.
# --- Pandas --- import pandas as pd import numpy as np # Generate a large DataFrame np.random.seed(0) df = pd.DataFrame({ 'A': np.random.randint(1, 100, 1000000), 'B': np.random.randint(1, 100, 1000000), 'C': np.random.rand(1000000) }) df_sorted = df.sort_values(by=['A', 'B']) grouped = df.groupby('A')['C'].agg(['mean', 'sum']) # Create DataFrame to merge with df2 = pd.DataFrame({ 'A': np.random.randint(1, 100, 25000), 'D': np.random.rand(25000) }) merged_df = pd.merge(df, df2, on='A') # --- Polars --- import polars as pl import numpy as np # Generate a large Polars DataFrame np.random.seed(0) df = pl.DataFrame({ 'A': np.random.randint(1, 100, 1000000), 'B': np.random.randint(1, 100, 1000000), 'C': np.random.rand(1000000) }) df_sorted = df.sort(['A', 'B']) grouped = df.group_by('A').agg([ pl.col('C').mean().alias('mean'), pl.col('C').sum().alias('sum') ]) # Create Polars DataFrame to merge with df2 = pl.DataFrame({ 'A': np.random.randint(1, 1000, 25000), 'D': np.random.rand(25000) }) merged_df = df.join(df2, on='A', how='inner')