please help - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: please help (/thread-41929.html) |
please help - natalie321 - Apr-10-2024 I'm working on a Python script to parse a large CSV file and extract specific data, but I'm encountering performance issues. Here's a simplified version of my code: import csv def extract_data(csv_file): with open(csv_file, 'r') as file: reader = csv.reader(file) next(reader) # Skip header row for row in reader: # Extracting data from specific columns data = row[1], row[3], row[5] process_data(data) def process_data(data): # Some processing on the extracted data print(data) csv_file = 'large_file.csv' extract_data(csv_file)The problem is that large_file.csv contains millions of rows, and my script is taking too long to process. I've tried optimizing the code, but it's still not efficient enough. Can someone suggest more efficient ways to parse and extract data from such a large CSV file in Python? Any help would be appreciated!build now gg RE: please help - Larz60+ - Apr-10-2024 you can read the data in as chunks, which will save a lot of time. Also, since you want to diaplay the data, using Pandas may save significant time. Perhaps reconsider displaying the entire file as no one will be able to read millions of lines anyway. what is the ultimate goal? RE: please help - Gribouillis - Apr-10-2024 I'm not a pandas user but pandas.read_csv() seems to have a chunksize argument which allows you to read portions of the entire file.
RE: please help - paul18fr - Apr-10-2024 Pandas is the tool best that can deal with a lot of configurations; nonetheless in some cases, other "light" tools can be used (Numpy for instance). Do you have the same number of columns in all rows? Is always the same type (float / integer)? => if so, you can have a look to np.loadtxt to directly retrieve an array (basic example here after)M = np.loadtxt('data/sample.csv', delimiter=',', dtype='int64') RE: please help - deanhystad - Apr-10-2024 I don't know if chunks are important or not. I think you would see significant speed gains if you used pandas to load your csv file and do your processing. RE: please help - snippsat - Apr-10-2024 Can also use Polars to speed things up. So for 1-GB file .csv Pandas use ca 13.5 seconds versus 350 milliseconds in Polars. In pandas your could be like this. Then if this is slow use Polars or and other opinion is Dask import pandas as pd def extract_data(csv_file): # Use chunksize to read the file in chunks chunksize = 10000 for chunk in pd.read_csv(csv_file, chunksize=chunksize): data = chunk.iloc[:, [1, 3, 5]].values.tolist() process_data(data) def process_data(data): for row in data: print(row) csv_file = 'large_file.csv' extract_data(csv_file) RE: please help - Pedroski55 - Apr-11-2024 Quote:I'm working on a Python script to parse a large CSV file and extract specific data, Got a sample of your csv? What do you want to extract? Is the first line column headers? Look up Data Pipelines: Quote:Creating Data Pipelines With Generators |