Jan-28-2023, 06:58 PM
(This post was last modified: Jan-28-2023, 06:58 PM by deanhystad.)
I think I understand your question a bit better now.
df1 - Has most of the information you want.
df2 - Has some additional information you would like to add to df1.
Only some rows in df1 should be updated.
df1 and df1 both have a datetime column that can be used to synchronize the two.
How about something like this? It takes half a second to process a 20,000 row dataframe.
df1 - Has most of the information you want.
df2 - Has some additional information you would like to add to df1.
Only some rows in df1 should be updated.
df1 and df1 both have a datetime column that can be used to synchronize the two.
How about something like this? It takes half a second to process a 20,000 row dataframe.
import pandas as pd import numpy as np from time import time # Make some dummy dataframes # df1 has datetime, windspeed and pressure. Pressures are all 0/empty # df2 has datetime and pressure df1 = pd.DataFrame({ "Datetime": pd.date_range("1/1/2022", "1/1/2023", 20000), "WindSpeed": np.random.randint(2, size=20000), "Pressure": np.zeros(20000) }) df2 = pd.DataFrame({ "Datetime": pd.date_range("1/1/2022", "1/1/2023", 15000), "Pressure": np.random.random(15000) * 5 + 1, }) starting_time = time() # We are only interested in wind events. No reason to look at other rows. # We can roll changes back into df1 later on using the row index values. wind_events = df1[df1["WindSpeed"] > 0] # Extract the time and pressure info. Faster to work with these # as lists instead of getting from dataframes each time. a_times = wind_events["Datetime"].values b_times = df2["Datetime"].values a_pressure = wind_events.columns.get_loc("Pressure") b_pressure = df2["Pressure"].values # For each wind event, find the pressure from df2 that has the # closest timestamp. Only accept pressues within 10 minutes of # the wind event. window = np.timedelta64(10, "m") b = 0 for a, a_time in enumerate(a_times): best_time = window + 1 best_index = None for b in range(b, len(b_times)): diff = b_times[b] - a_time if diff > window: break diff = abs(diff) if diff < best_time: best_time = diff best_index = b if best_index is not None and best_time <= window: wind_events.iat[a, a_pressure] = b_pressure[best_index] df1.update(wind_events["Pressure"]) print(time() - starting_time)
Output:0.54099893569946293