Python Forum
Seeking advice on dask distributed
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Seeking advice on dask distributed
#4
(Apr-13-2024, 09:36 PM)sawtooth500 Wrote: I've never used dask before, but my understanding is that each function needs to be passed a dask dataframe. I could convert the pandas dataframe to a dask dataframe before passing it - but then would all my pandas functions inside of process_date() still work? I read that each partition of a dask dataframe is essentially a pandas dataframe, so my understanding is it should work, but I'm seeking confirmation.
Yes Dask dataframe works the same Pandas dataframe,and it parallelizing Pandas.

Also maybe easier an faster is to use Polars ,it's Parallel bye default.
Polars Wrote:Parallel: Utilises the power of your machine by dividing the workload among the available CPU cores without any additional configuration.
Vectorized Query Engine: Using Apache Arrow, a columnar data format, to process your queries in a vectorized manner and SIMD to optimize CPU usage.
I did post a example in you earlier post that you me not have seen.

If i write a quick test,see that do not need to tell Polars to use all Cpu's it doing that bye default.
So Pandas 2-min and Polars use 2-sec,on this task which are Cpu heavy.
# --- Pandas ---
import pandas as pd
import numpy as np

# Generate a large DataFrame
np.random.seed(0)
df = pd.DataFrame({
    'A': np.random.randint(1, 100, 1000000),
    'B': np.random.randint(1, 100, 1000000),
    'C': np.random.rand(1000000)
})

df_sorted = df.sort_values(by=['A', 'B'])
grouped = df.groupby('A')['C'].agg(['mean', 'sum'])
# Create DataFrame to merge with
df2 = pd.DataFrame({
    'A': np.random.randint(1, 100, 25000),
    'D': np.random.rand(25000)
})

merged_df = pd.merge(df, df2, on='A')

# --- Polars ---
import polars as pl
import numpy as np

# Generate a large Polars DataFrame
np.random.seed(0)
df = pl.DataFrame({
    'A': np.random.randint(1, 100, 1000000),
    'B': np.random.randint(1, 100, 1000000),
    'C': np.random.rand(1000000)
})

df_sorted = df.sort(['A', 'B'])
grouped = df.group_by('A').agg([
    pl.col('C').mean().alias('mean'),
    pl.col('C').sum().alias('sum')
])

# Create Polars DataFrame to merge with
df2 = pl.DataFrame({
    'A': np.random.randint(1, 1000, 25000),
    'D': np.random.rand(25000)
})

merged_df = df.join(df2, on='A', how='inner')
Reply


Messages In This Thread
Seeking advice on dask distributed - by sawtooth500 - Apr-13-2024, 09:36 PM
RE: Seeking advice on dask distributed - by Larz60+ - Apr-14-2024, 10:17 AM
RE: Seeking advice on dask distributed - by snippsat - Apr-15-2024, 11:17 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  New user seeking help EdRaponi 2 49,784 Jun-23-2020, 12:03 PM
Last Post: EdRaponi
  seeking suggestions for function option name Skaperen 1 2,574 Dec-22-2018, 05:27 AM
Last Post: Gribouillis
  Newbie seeking help with DNS query DaytonJones 0 2,267 Sep-21-2018, 06:29 PM
Last Post: DaytonJones
  Class Modules, and Passing Variables: Seeking Advice Robo_Pi 21 10,500 Mar-02-2018, 05:22 PM
Last Post: snippsat
  Seeking understanding with the python import function. Intelligent_Agent0 2 2,647 Feb-18-2018, 11:57 PM
Last Post: snippsat
  Seeking feedback on my script-in-progress league55 2 2,692 Feb-12-2018, 03:03 PM
Last Post: league55
  Seeking creative and knowlegeable coder for help! Elusth 4 5,535 Nov-07-2016, 08:26 AM
Last Post: Skaperen

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020