Jan-13-2022, 04:18 PM
(This post was last modified: Jan-13-2022, 04:18 PM by HyperSpeed.
Edit Reason: Typo
)
Hello all,
I am currently using a Python based ETL which uses Pandas.
I won't post all my code here because most of it isn't relevant, but I will briefly explain my process.
Step 1: Getting the source data
It's worth pointing out here that there is no "rowversion" or "datetime" columns on the source tables, so it's not possible to identify new/changed rows at source. As far as I know it's necessary to bring all the data into memory to calculate what's new, changed, or deleted. This all works fine on small tables. However, it's all starting to become a problem because when the source tables has over 2 million rows, the Python script completely fails, as shown below.
After some investigation work, this error is not caused by the connection being refused because it works fine on 1.5 million rows. 1 million rows of data is roughly 160 MB in RAM, and even 10 times this wouldn't exceed the server's memory capacity. As some of the tables have tens of millions of rows, I need a way to move this data quickly and successfully.
Any suggestions?
I am currently using a Python based ETL which uses Pandas.
I won't post all my code here because most of it isn't relevant, but I will briefly explain my process.
Step 1: Getting the source data
query = pd.read_sql_query( ''' SELECT * FROM SourceTable ''', se) source = pd.DataFrame(query)Step 2: Get the IDs from the target data
query = pd.read_sql_query( ''' SELECT Id FROM TargetTable ''', te) target = pd.DataFrame(query)Step 3: Check for new rows
new = source.merge(target.drop_duplicates(), on=['Id'], how='left', indicator=True) new = new[new['_merge']=='left_only']Step 4: Insert new rows into TargetTable
new.to_sql( 'TargetTable', te, schema = 'dbo', if_exists = 'append', chunksize = 1000, index = False )The process goes on to include updates and deletes.
It's worth pointing out here that there is no "rowversion" or "datetime" columns on the source tables, so it's not possible to identify new/changed rows at source. As far as I know it's necessary to bring all the data into memory to calculate what's new, changed, or deleted. This all works fine on small tables. However, it's all starting to become a problem because when the source tables has over 2 million rows, the Python script completely fails, as shown below.
Quote:ConnectException: Connection refused (Connection refused)
Error while obtaining a new communication channel
ConnectException error: This is often caused by an OOM error that causes the connection to the Python REPL to be closed. Check your query's memory usage.
After some investigation work, this error is not caused by the connection being refused because it works fine on 1.5 million rows. 1 million rows of data is roughly 160 MB in RAM, and even 10 times this wouldn't exceed the server's memory capacity. As some of the tables have tens of millions of rows, I need a way to move this data quickly and successfully.
Any suggestions?