Seeking advice on dask distributed

Seeking advice on dask distributed - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Seeking advice on dask distributed (/thread-41953.html)

Seeking advice on dask distributed - sawtooth500 - Apr-13-2024

I am using python to do stock backtesting for day trading.

I have a function called process_date(datedf), this is passed a pandas dataframe.

This function backtests the strategy across a single day, and returns me another pandas dataframe with the desired results.

Because this is day trading, each day is completely independent from the next so I parallelized my code to run multiple days at once using joblib, and it works great.

However, I want to take it to the next step - my main computer only has 16 cores, it's the fastest one I have but I have a menagerie of other computers strewn about. So my thought was what if I can parallelize this process_date function across multiple computers using dask distributed. Just like how joblib does on one computer, but now I'm also using the cores in multiple other computers at once.

I've never used dask before, but my understanding is that each function needs to be passed a dask dataframe. I could convert the pandas dataframe to a dask dataframe before passing it - but then would all my pandas functions inside of process_date() still work? I read that each partition of a dask dataframe is essentially a pandas dataframe, so my understanding is it should work, but I'm seeking confirmation.

Also, I do make use of some global variables within the function. This works fine with joblib, but would this work with dask across multiple machines? Or would I need to rewrite that code?

Let me know if conceptually my idea would work, or not. Thanks.

RE: Seeking advice on dask distributed - Larz60+ - Apr-14-2024

You might want to read some of the tutorials and blogs available.
try "true parallel processing using python" in duckduckgo and/or google to get a large list.
It sounds (without any further reasearch) that if Dask needs a pandas dataframe from each process, that it's not truly parallel, other than the presentation process.

have you investigated the python builtin multiprocessing?
some things to read:
Process-based parallelism

Parallel Processing in Python
Bypassing the GIL for Parallel Processing in Python

The following blog is more about threading (python section) than multiprocessing, but is still quite interesting:
Parallel processing in Python

RE: Seeking advice on dask distributed - sawtooth500 - Apr-14-2024

Yes I am currently using the python built in multiprocessing using joblib, and it works great but it's just relegated to a single computer. I was hoping to extend that to multiple computers.

I found ipyparallel which was listed in one of the articles you mentioned helpful - apparently it allows you to extend the parallelism across multiple machines - and at first glance it seems like it may be simpler to implement for me than dask - but I will just need to get into it and start coding to figure it out.

RE: Seeking advice on dask distributed - snippsat - Apr-15-2024

(Apr-13-2024, 09:36 PM)sawtooth500 Wrote: I've never used dask before, but my understanding is that each function needs to be passed a dask dataframe. I could convert the pandas dataframe to a dask dataframe before passing it - but then would all my pandas functions inside of process_date() still work? I read that each partition of a dask dataframe is essentially a pandas dataframe, so my understanding is it should work, but I'm seeking confirmation.

Yes Dask dataframe works the same Pandas dataframe,and it parallelizing Pandas.

Also maybe easier an faster is to use Polars ,it's Parallel bye default.

Polars Wrote:Parallel: Utilises the power of your machine by dividing the workload among the available CPU cores without any additional configuration.
Vectorized Query Engine: Using Apache Arrow, a columnar data format, to process your queries in a vectorized manner and SIMD to optimize CPU usage.

I did post a example in you earlier post that you me not have seen.

If i write a quick test,see that do not need to tell Polars to use all Cpu's it doing that bye default.
So Pandas 2-min and Polars use 2-sec,on this task which are Cpu heavy.

# --- Pandas ---
import pandas as pd
import numpy as np

# Generate a large DataFrame
np.random.seed(0)
df = pd.DataFrame({
    'A': np.random.randint(1, 100, 1000000),
    'B': np.random.randint(1, 100, 1000000),
    'C': np.random.rand(1000000)
})

df_sorted = df.sort_values(by=['A', 'B'])
grouped = df.groupby('A')['C'].agg(['mean', 'sum'])
# Create DataFrame to merge with
df2 = pd.DataFrame({
    'A': np.random.randint(1, 100, 25000),
    'D': np.random.rand(25000)
})

merged_df = pd.merge(df, df2, on='A')

# --- Polars ---
import polars as pl
import numpy as np

# Generate a large Polars DataFrame
np.random.seed(0)
df = pl.DataFrame({
    'A': np.random.randint(1, 100, 1000000),
    'B': np.random.randint(1, 100, 1000000),
    'C': np.random.rand(1000000)
})

df_sorted = df.sort(['A', 'B'])
grouped = df.group_by('A').agg([
    pl.col('C').mean().alias('mean'),
    pl.col('C').sum().alias('sum')
])

# Create Polars DataFrame to merge with
df2 = pl.DataFrame({
    'A': np.random.randint(1, 1000, 25000),
    'D': np.random.rand(25000)
})

merged_df = df.join(df2, on='A', how='inner')

RE: Seeking advice on dask distributed - sawtooth500 - Apr-15-2024

Hey snippsat,

My apologies for not responding to your earlier, I actually moved away from polars to pandas. In my benchmarking, yes I found that polars is better than pandas at raw math, but it's a lot slower when it comes to splitting/appending dataframes. I'm doing stock backtesting with time based series, however because the times constitute variable row counts of data with differing time slices, none of the polars rolling functions worked out. So I'm stuck with slicing dataframes, and then doing math on sliced dataframes. No single dataframe is greater than perhaps 20000 entries... and the raw math speed provided by polars is negated by the slower dataframe manipulation within polars.