Seeking advice on dask distributed

***snippsat*** · (This post was last modified: Apr-15-2024, 11:17 AM by snippsat.)

(Apr-13-2024, 09:36 PM)sawtooth500 Wrote: I've never used dask before, but my understanding is that each function needs to be passed a dask dataframe. I could convert the pandas dataframe to a dask dataframe before passing it - but then would all my pandas functions inside of process_date() still work? I read that each partition of a dask dataframe is essentially a pandas dataframe, so my understanding is it should work, but I'm seeking confirmation.

Yes Dask dataframe works the same Pandas dataframe,and it parallelizing Pandas.

Also maybe easier an faster is to use Polars ,it's Parallel bye default.

Polars Wrote:Parallel: Utilises the power of your machine by dividing the workload among the available CPU cores without any additional configuration.
Vectorized Query Engine: Using Apache Arrow, a columnar data format, to process your queries in a vectorized manner and SIMD to optimize CPU usage.

I did post a example in you earlier post that you me not have seen.

If i write a quick test,see that do not need to tell Polars to use all Cpu's it doing that bye default.
So Pandas 2-min and Polars use 2-sec,on this task which are Cpu heavy.

# --- Pandas ---
import pandas as pd
import numpy as np

# Generate a large DataFrame
np.random.seed(0)
df = pd.DataFrame({
    'A': np.random.randint(1, 100, 1000000),
    'B': np.random.randint(1, 100, 1000000),
    'C': np.random.rand(1000000)
})

df_sorted = df.sort_values(by=['A', 'B'])
grouped = df.groupby('A')['C'].agg(['mean', 'sum'])
# Create DataFrame to merge with
df2 = pd.DataFrame({
    'A': np.random.randint(1, 100, 25000),
    'D': np.random.rand(25000)
})

merged_df = pd.merge(df, df2, on='A')

# --- Polars ---
import polars as pl
import numpy as np

# Generate a large Polars DataFrame
np.random.seed(0)
df = pl.DataFrame({
    'A': np.random.randint(1, 100, 1000000),
    'B': np.random.randint(1, 100, 1000000),
    'C': np.random.rand(1000000)
})

df_sorted = df.sort(['A', 'B'])
grouped = df.group_by('A').agg([
    pl.col('C').mean().alias('mean'),
    pl.col('C').sum().alias('sum')
])

# Create Polars DataFrame to merge with
df2 = pl.DataFrame({
    'A': np.random.randint(1, 1000, 25000),
    'D': np.random.rand(25000)
})

merged_df = df.join(df2, on='A', how='inner')

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	New user seeking help	EdRaponi	2	49,784	Jun-23-2020, 12:03 PM Last Post: EdRaponi
	seeking suggestions for function option name	Skaperen	1	2,574	Dec-22-2018, 05:27 AM Last Post: Gribouillis
	Newbie seeking help with DNS query	DaytonJones	0	2,267	Sep-21-2018, 06:29 PM Last Post: DaytonJones
	Class Modules, and Passing Variables: Seeking Advice	Robo_Pi	21	10,500	Mar-02-2018, 05:22 PM Last Post: snippsat
	Seeking understanding with the python import function.	Intelligent_Agent0	2	2,647	Feb-18-2018, 11:57 PM Last Post: snippsat
	Seeking feedback on my script-in-progress	league55	2	2,692	Feb-12-2018, 03:03 PM Last Post: league55
	Seeking creative and knowlegeable coder for help!	Elusth	4	5,535	Nov-07-2016, 08:26 AM Last Post: Skaperen

Seeking advice on dask distributed

User Panel Messages

Announcements