Python Based ETL Using Pandas

HyperSpeed

Hello all,

I am currently using a Python based ETL which uses Pandas.

I won't post all my code here because most of it isn't relevant, but I will briefly explain my process.

Step 1: Getting the source data

query = pd.read_sql_query(
'''
    SELECT * FROM SourceTable
''', se)
source = pd.DataFrame(query)

Step 2: Get the IDs from the target data

query = pd.read_sql_query(
'''
    SELECT Id FROM TargetTable
''', te)
target = pd.DataFrame(query)

Step 3: Check for new rows

new = source.merge(target.drop_duplicates(), on=['Id'], how='left', indicator=True)
new = new[new['_merge']=='left_only']

Step 4: Insert new rows into TargetTable

new.to_sql(
    'TargetTable',
    te,
    schema = 'dbo',
    if_exists = 'append',
    chunksize = 1000,
    index = False
)

The process goes on to include updates and deletes.

It's worth pointing out here that there is no "rowversion" or "datetime" columns on the source tables, so it's not possible to identify new/changed rows at source. As far as I know it's necessary to bring all the data into memory to calculate what's new, changed, or deleted. This all works fine on small tables. However, it's all starting to become a problem because when the source tables has over 2 million rows, the Python script completely fails, as shown below.

Quote:ConnectException: Connection refused (Connection refused)
Error while obtaining a new communication channel

ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.

After some investigation work, this error is not caused by the connection being refused because it works fine on 1.5 million rows. 1 million rows of data is roughly 160 MB in RAM, and even 10 times this wouldn't exceed the server's memory capacity. As some of the tables have tens of millions of rows, I need a way to move this data quickly and successfully.

Any suggestions?

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Pandas Dataframe Filtering based on rows	mvdlm	0	1,491	Apr-02-2022, 06:39 PM Last Post: mvdlm
	update values in one dataframe based on another dataframe - Pandas	iliasb	2	9,476	Aug-14-2021, 12:38 PM Last Post: jefsummers
	Pandas Data frame column condition check based on length of the value	aditi06	1	2,747	Jul-28-2021, 11:08 AM Last Post: jefsummers
	Pandas - Dynamic column aggregation based on another column	theroadbacktonature	0	3,101	Apr-17-2020, 04:54 PM Last Post: theroadbacktonature
	Pandas fillna based on conditions	amyd	1	15,975	May-03-2019, 11:27 AM Last Post: scidam
	Splitting values in column in a pandas dataframe based on a condition	hey_arnold	1	4,223	Jul-24-2018, 02:18 PM Last Post: hey_arnold

Python Based ETL Using Pandas

User Panel Messages

Announcements