Multiprocessing on python

sawtooth500 · Apr-01-2024, 01:31 AM

Hello!

So I've created a script that does the data crunching that I want it to do, mostly using numpy and pandas things - in summary depending on the number of input rows, and how I set certain calculation parameters, right now with my test data of about a million input rows depending on my calculation parameters execution is taking between 2-4 minutes. That's bearable for now... but eventually I plan on having input of over a billion rows and well I want to make my code run as fast as I can.

One one hand, I have ideas on how to optimize my code to make it more efficient. But on the other hand, I also want to learn about multiprocessing, which I have never done before.

I'm running Windows 11.

My computer has 6 cores with 12 logical processors according to task manager. I read https://urban-institute.medium.com/using...ea5ef996ba as a primer to mutiprocessing on python.

So when my script is executing, in Windows task manager, python seems to only take 19-22% on average CPU time, with my total load varying between 25-29%. So I have a lot of extra CPU time. My guess is that is because right now python is only executing on a single thread?

If I run the following script:

import os
import multiprocessing

print(f"Total CPU cores: {os.cpu_count()}")
print(f"Python is using {multiprocessing.cpu_count()} CPU cores")

I get:

Output:Total CPU cores: 12
Python is using 12 CPU cores

So that tells me that python is using 12 cores BUT....

From the article I posted - even though python may tell me it's using 12 CPU cores... unless I use something like the process or pool class from the python multiprocessing module... then python is not actually using more than one core when executing my script... is my understanding there correct? So if I really want to use multiprocessing then I need to implement something like the process or pool class then?

Also, I only have 12 cores, not some crazy machine with like 128 cores, for example - so would the additional coding required to implement this be even worth it for a machine with 12 cores?

Thanks for the help!

**deanhystad** · (This post was last modified: Apr-01-2024, 01:59 AM by deanhystad.)

2 minutes for 1 million rows is absurd. It should be a second or less. You need to work on algorithm improvement instead of thinking about multi processing. It is not uncommon to see a better algorithm make a program a thousand times faster. Multiprocessing improvements are more like 2 to 4 times faster. For 12 cores it could theoretically 11 times faster, but you'd be doing great if it was 5.

sawtooth500 · Apr-01-2024, 02:24 AM

Agreed, I need to get better with writing more efficient code. Need to look at it, think about it, and examine areas where I can optimize.

sawtooth500 · Apr-01-2024, 04:41 AM

So here is the crux of my issue:

For example, I have a dataframe with 1.8 million input rows. However, from that I extrapolate it into 62.1 million rows from which I need to find a weighted average. So really I'm crunching 62.1 million rows, and this takes 23-ish minutes, for example.

As a simplication, my dataframe df has two cols 'A' and 'B' filled with numbers and and I need to calculate the average of the numbers in col A weighted by the numbers in col B.

But... I don't need to find the weighted average of the original 1.8 million rows. I need to find the weighted averages of overlapping sets of rows - which in example total to 62.1 million rows.

For example - I would take rows

0-2832
672-3293
1189-4102
1382-4204
2902 - 4680
and so on....

I have an algorithm that determines the relevant row numbers for each subset. There is no consistent pattern of a fixed increment or anything like that with the row numbers.

So it starts with a while loop (this is the thing that is probably why the time to process is so long)

while condition:
   boolean indexing to create a new dataframe that contains my relevant subset, call it tempdf
   weighted_avg = (tempdf['A'] * tempdf['B']).sum() / tempdf['B'].sum()
   append the weighted average as a new row into a results dataframe called resultsdf

So, as you see, this is how I turn 1.8 million rows into 62.1 million rows to process, because there is a lot of overlap in the subset dataframe, and because of the math involved in calculated a weighted average, and given how there is no simple incremental pattern in the indices that are used to create the subset dataframe, I don't know how to do this any way other than with a while loop that takes a very long time to cycle through.

So, any better way to do it?

**deanhystad** · Apr-01-2024, 09:51 AM

"Why are you doing that?" Is the first question I always ask myself. Do you need to interpolate all these extra points? What is their purpose?

sawtooth500 · (This post was last modified: Apr-01-2024, 05:13 PM by sawtooth500.)

Curiosity to see the data really. So I wrote that an algorithm determines the subsets of rows - well it picks out rows for certain time intervals, and the time intervals can be overlapping. However, there isn't a constant number of rows per time interval - there could be 1000, 2000, 3000 etc any number really. So I'm looking to see how weighted average changes over time intervals and I'm interesting then in graphing it.

Any better way to do it? Or am I just stuck with the while loop?

**deanhystad** · Apr-01-2024, 05:51 PM

If you interplate or extrapolate the extra values do they have any meaning? Instead of making meaningless points so you can calculate a weighted average, why not calculate the weights based on the existing points?

If you want to distribute the points more evenly, how are you doing that? Did you look at the pandas resample function? Have you looked at what numpy has to offer?

jefsummers · Apr-01-2024, 07:19 PM

Minor suggestion - look at polars for python, in many cases better than pandas for large record sets.

sawtooth500 · Apr-02-2024, 01:53 AM

For right now, I don't want to do any resampling or such because I just want to use the whole dataset as practice.

Polars sounds fascinating - so I actually rewrote my code today to use polars instead of pandas. However, I'm doing something very wrong with polars - right now I'm taking 614,000 rows that I extrapolate to crunch on 20.5 mil rows - my pandas code takes 2:40 to do that, my polars code takes 15 minutes to do that... I get the exact same result with both. So I guess I got more learning to do with polars and more playing around to figure out why it's taking so much longer than pandas because it clearly shouldn't....

So how do you measure the performance of a script? I'm using the following:

import time
start_time = time.time()

def print_time(timetext):
    end_time = time.time()
    execution_time = end_time - start_time
    minutes, seconds = divmod(execution_time, 60)
    milliseconds = (execution_time - int(execution_time)) * 100
    formatted_time = "{:02d}:{:02d}.{:02d}".format(int(minutes), int(seconds), int(milliseconds))

I just put my print_time() function wherever I want in my script so that I see how long it took to get to a certain point. It's great for measuring the total execution time, or milestones in my script like right before my while loop starts or right after it finishes.

I know that the real timekiller in my script is my while loop - but each iteration goes through so fast, there are just a ton of iterations. Is it feasible to use something like my print_time() function inside my while loop so that I can break down what operations are taking how much time within the loop? Or is there a better way to determine execution times of components of your script?

sawtooth500 · (This post was last modified: Apr-02-2024, 03:08 AM by sawtooth500.)

Another thought - so in my while loop, each time the loop runs, I get a result row that I assemble into a results dataframe.

In the pandas version, I can just directly insert values for a new now into resultdf every time that I loop the loop.

In polars however, you can't insert a row into a dataframe - I need to create a new dataframe each iteration with the new row of data, then I use vstack to combine the new and old result dataframes.

This is in pandas:

resultdf.loc[loc_counter, 'eastern_time'] = convert_nano_timestamp(time_ender)
    resultdf.loc[loc_counter, 'price'] = tempdf.loc[tempdf.index[-1], 'price']
    resultdf.loc[loc_counter, 'high'] = tempdf['price'].max()
    resultdf.loc[loc_counter, 'low'] = tempdf['price'].min()
    resultdf.loc[loc_counter, 'volwa-price'] = wa
    resultdf.loc[loc_counter, 'size'] = weightsum

This is in polars:

new_row_data = {
        'int_calc': int_calc2,
        'int_dur': int_dur2,
        'eastern_time': convert_nano_timestamp(time_ender),
        'price': tempdf.select(pl.last('price')).to_series()[0],
        'high': tempdf['price'].max(),
        'low': tempdf['price'].min(),
        'volwa-price': wa,
        'volwa%': 0,
        'size': weightsum
    }
    resultdf = resultdf.vstack(pl.DataFrame({name: [value] for name, value in new_row_data.items()}, schema={name: dtype for name, dtype in schema.items()}))

Could this creation and combination of a new dataframe in each loop interation in polars be what is killing my performance? I could just put the result dataframe into a pandas dataframe - that's no big deal.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	How to run existing python script parallel using multiprocessing	lravikumarvsp	3	4,820	May-24-2018, 05:23 AM Last Post: lravikumarvsp

Multiprocessing on python

User Panel Messages

Announcements