Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
please help
#1
I'm working on a Python script to parse a large CSV file and extract specific data, but I'm encountering performance issues. Here's a simplified version of my code:
import csv

def extract_data(csv_file):
    with open(csv_file, 'r') as file:
        reader = csv.reader(file)
        next(reader)  # Skip header row
        for row in reader:
            # Extracting data from specific columns
            data = row[1], row[3], row[5]
            process_data(data)

def process_data(data):
    # Some processing on the extracted data
    print(data)

csv_file = 'large_file.csv'
extract_data(csv_file)
The problem is that large_file.csv contains millions of rows, and my script is taking too long to process. I've tried optimizing the code, but it's still not efficient enough. Can someone suggest more efficient ways to parse and extract data from such a large CSV file in Python? Any help would be appreciated!
build now gg
Reply
#2
you can read the data in as chunks, which will save a lot of time.
Also, since you want to diaplay the data, using Pandas may save significant time.
Perhaps reconsider displaying the entire file as no one will be able to read millions of lines anyway.
what is the ultimate goal?
Reply
#3
I'm not a pandas user but pandas.read_csv() seems to have a chunksize argument which allows you to read portions of the entire file.
« We can solve any problem by introducing an extra level of indirection »
Reply
#4
Pandas is the tool best that can deal with a lot of configurations; nonetheless in some cases, other "light" tools can be used (Numpy for instance).

Do you have the same number of columns in all rows?
Is always the same type (float / integer)?

=> if so, you can have a look to np.loadtxt to directly retrieve an array (basic example here after)

M = np.loadtxt('data/sample.csv', delimiter=',', dtype='int64')
Reply
#5
I don't know if chunks are important or not. I think you would see significant speed gains if you used pandas to load your csv file and do your processing.
Reply
#6
Can also use Polars to speed things up.
So for 1-GB file .csv Pandas use ca 13.5 seconds versus 350 milliseconds in Polars.

In pandas your could be like this.
Then if this is slow use Polars or and other opinion is Dask
import pandas as pd

def extract_data(csv_file):
    # Use chunksize to read the file in chunks
    chunksize = 10000  
    for chunk in pd.read_csv(csv_file, chunksize=chunksize):  
        data = chunk.iloc[:, [1, 3, 5]].values.tolist()
        process_data(data)

def process_data(data):
    for row in data:
        print(row)

csv_file = 'large_file.csv'
extract_data(csv_file)
Gribouillis likes this post
Reply
#7
Quote:I'm working on a Python script to parse a large CSV file and extract specific data,

Got a sample of your csv? What do you want to extract?

Is the first line column headers?

Look up Data Pipelines:

Quote:Creating Data Pipelines With Generators

Data pipelines allow you to string together code to process large datasets or streams of data without maxing out your machine’s memory. Imagine that you have a large CSV file:
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020