Apr-01-2024, 01:31 AM
Hello!
So I've created a script that does the data crunching that I want it to do, mostly using numpy and pandas things - in summary depending on the number of input rows, and how I set certain calculation parameters, right now with my test data of about a million input rows depending on my calculation parameters execution is taking between 2-4 minutes. That's bearable for now... but eventually I plan on having input of over a billion rows and well I want to make my code run as fast as I can.
One one hand, I have ideas on how to optimize my code to make it more efficient. But on the other hand, I also want to learn about multiprocessing, which I have never done before.
I'm running Windows 11.
My computer has 6 cores with 12 logical processors according to task manager. I read https://urban-institute.medium.com/using...ea5ef996ba as a primer to mutiprocessing on python.
So when my script is executing, in Windows task manager, python seems to only take 19-22% on average CPU time, with my total load varying between 25-29%. So I have a lot of extra CPU time. My guess is that is because right now python is only executing on a single thread?
If I run the following script:
From the article I posted - even though python may tell me it's using 12 CPU cores... unless I use something like the process or pool class from the python multiprocessing module... then python is not actually using more than one core when executing my script... is my understanding there correct? So if I really want to use multiprocessing then I need to implement something like the process or pool class then?
Also, I only have 12 cores, not some crazy machine with like 128 cores, for example - so would the additional coding required to implement this be even worth it for a machine with 12 cores?
Thanks for the help!
So I've created a script that does the data crunching that I want it to do, mostly using numpy and pandas things - in summary depending on the number of input rows, and how I set certain calculation parameters, right now with my test data of about a million input rows depending on my calculation parameters execution is taking between 2-4 minutes. That's bearable for now... but eventually I plan on having input of over a billion rows and well I want to make my code run as fast as I can.
One one hand, I have ideas on how to optimize my code to make it more efficient. But on the other hand, I also want to learn about multiprocessing, which I have never done before.
I'm running Windows 11.
My computer has 6 cores with 12 logical processors according to task manager. I read https://urban-institute.medium.com/using...ea5ef996ba as a primer to mutiprocessing on python.
So when my script is executing, in Windows task manager, python seems to only take 19-22% on average CPU time, with my total load varying between 25-29%. So I have a lot of extra CPU time. My guess is that is because right now python is only executing on a single thread?
If I run the following script:
import os import multiprocessing print(f"Total CPU cores: {os.cpu_count()}") print(f"Python is using {multiprocessing.cpu_count()} CPU cores")I get:
Output:Total CPU cores: 12
Python is using 12 CPU cores
So that tells me that python is using 12 cores BUT....From the article I posted - even though python may tell me it's using 12 CPU cores... unless I use something like the process or pool class from the python multiprocessing module... then python is not actually using more than one core when executing my script... is my understanding there correct? So if I really want to use multiprocessing then I need to implement something like the process or pool class then?
Also, I only have 12 cores, not some crazy machine with like 128 cores, for example - so would the additional coding required to implement this be even worth it for a machine with 12 cores?
Thanks for the help!