(Apr-28-2023, 09:33 AM)mg24 Wrote: my csv file size is 10 gb,For data this big is Dask or Polars better.
Dask DataFrame copies the pandas DataFrame API,so it will work the same as Pandas
Example with timing.
# pip install "dask[complete]" import time from dask import dataframe as dd start = time.time() df = dd.read_csv('large.csv') end = time.time() print(f"Total Time: {(end-start)} sec")Just bye doing this so will Dask do a lot,eg a medium size
.csv 230 mb
,so dos Dask read it in 0.01-sec
and Pandas read it in 6.2 sec
.Dask utilizes multiple CPU cores by internally chunking dataframe and process in parallel.
Example want to import 10 GB data in your eg 6 GB RAM or more RAM.
This can’t be achieved via Pandas since whole data in a single shot doesn’t fit into memory(without chunking up),but Dask can.
Dask instead of computing first, create a
graph of tasks
which says about how to perform that task.It's
lazy computation
which means that Dask’s task scheduler creating a graph at first followed by computing that graph when requested.