Posts: 6
Threads: 1
Joined: Nov 2020
Hello guys, I have a large dataframe of 10000 rows and 2 columns .The first column is the dates(years,months,days,hours,minutes) as index. My dates column doesnt have a stable frequency. I am trying to get the average value of 2nd column of every 1 hour .
Any ideas on how can I iterate this problem for the whole time series?
My df looks like this and I want to get the average value from 20:00 to 21:00 and so on.
time value
2019-03-21 20:03:30.351000 22.6
2019-03-21 20:33:30.392000 22.7
2019-03-21 20:38:30.428000 22.6
2019-03-21 20:43:30.363000 22.7
2019-03-21 21:13:30.382000 22.6
2019-03-21 22:33:34.762000 22.5
2019-03-21 22:38:34.696000 22.6
Posts: 419
Threads: 34
Joined: May 2019
Something like this maybe?:
number_of_entries = 0
total = 0
For time in column:
number_of_entries += 1
total += time
average = total/number_of_entries
Posts: 6
Threads: 1
Joined: Nov 2020
(Nov-24-2020, 03:38 PM)michael1789 Wrote: Something like this maybe?:
number_of_entries = 0
total = 0
For time in column:
number_of_entries += 1
total += time
average = total/number_of_entries
No I am trying to get the average of every hour. For example if 4 values are in the range of 1 hour, I am trying to get the average of them, if I have 2 values in the next hour I am trying to get the average of those 2. Also, how does this code iterates the datetime index every 1 hour?
Posts: 64
Threads: 16
Joined: Nov 2019
You may want to use df.period.hour() to extract the hour from your datetime, then define a function to get the values for a given hour on a particular date and compute the average. If you can show us what you have tried we can help you if you get stuck.
Read about Periods here: https://pandas.pydata.org/pandas-docs/st...eriod.html
Posts: 6
Threads: 1
Joined: Nov 2020
count = 0
for year, z in df.groupby(df.index.map(lambda t: t.year)):
for month, x in df.groupby(df.index.map(lambda t: t.month)):
for day, i in df.groupby(df.index.map(lambda t: t.day)):
for hour, j in df.groupby(df.index.map(lambda t: t.hour)):
count = count + 1 This is what I was trying to do, problem is that it doesnt work as I thought it would be.
I want to check for example for the hour 14:00:00 of day 1 from my dataset and see how many values are in that range.Then get the average of them and store them in a new dataframe.Then keep doing this for every day and every hour in my dataset.
Posts: 6
Threads: 1
Joined: Nov 2020
(Nov-25-2020, 03:01 AM)palladium Wrote: You may want to use df.period.hour() to extract the hour from your datetime, then define a function to get the values for a given hour on a particular date and compute the average. If you can show us what you have tried we can help you if you get stuck.
Read about Periods here: https://pandas.pydata.org/pandas-docs/st...eriod.html
count = 0
for year, z in df.groupby(df.index.map(lambda t: t.year)):
for month, x in df.groupby(df.index.map(lambda t: t.month)):
for day, i in df.groupby(df.index.map(lambda t: t.day)):
for hour, j in df.groupby(df.index.map(lambda t: t.hour)):
count = count + 1 This is what I was trying to do, problem is that it doesnt work as I thought it would be.
I want to check for example for the hour 14:00:00 of day 1 from my dataset and see how many values are in that range.Then get the average of them and store them in a new dataframe.Then keep doing this for every day and every hour in my dataset.
Posts: 64
Threads: 16
Joined: Nov 2019
Dec-02-2020, 10:01 AM
(This post was last modified: Dec-02-2020, 10:02 AM by palladium.)
import pandas as pd
from datetime import datetime
import numpy as np
def get_hour(string):
"""Gets hour from datestring"""
hr = datetime.fromisoformat(string).hour
return hr
def get_date(string):
"""Gets date from datestring"""
date = datetime.strptime(string, '%Y-%m-%d %H:%M:%S.%f')
return str(date.date())
def get_average(df):
average = df['value'].sum()/len(df['value'])
return average
df = pd.read_csv('pandas.csv', header = 0, names = ['time', 'value'])
print(df)
hr = [get_hour(i) for i in df['time']]
date = [get_date(i) for i in df['time']]
df['hour'] = hr #I'm sure there is a way to vectorize this part
# df['hour'] = datetime.fromisoformat(df['time']).hour # does not work - need help here
# because something like this would be faster
df['date'] = date
df.drop('time', axis = 1, inplace = True) #cleanup
unique_dates = np.unique(df['date']) #returns list
for u in unique_dates: #this probably can be sped up too with large datasets - help!
unique_df = df[df['date'] == u] #returns df containing only u
unique_hr = np.unique(unique_df['hour']) #returns list
for h in unique_hr:
unique_hr_df = unique_df[unique_df['hour'] == h]
average = get_average(unique_hr_df)
print('Date:', u, 'Hour:', h, "Average:", average)
Gives the following output:
Output: time value
0 2019-03-21 20:03:30.351000 22.6
1 2019-03-21 20:33:30.392000 22.7
2 2019-03-21 20:38:30.428000 22.6
3 2019-03-23 20:43:30.363000 22.7
4 2019-03-23 21:13:30.382000 22.6
5 2019-03-23 22:33:34.762000 22.5
6 2019-03-23 22:38:34.696000 22.6
Date: 2019-03-21 Hour: 20 Average: 22.633333333333336
Date: 2019-03-23 Hour: 20 Average: 22.7
Date: 2019-03-23 Hour: 21 Average: 22.6
Date: 2019-03-23 Hour: 22 Average: 22.55
Note I changed your original dataset so that it has 2 different dates, for illustrative purposes.
I am sure some parts of the code can be sped up - see comments in code above. If some pandas guru can give me some tips that would be appreciated.
Posts: 6
Threads: 1
Joined: Nov 2020
Dec-14-2020, 05:40 PM
(This post was last modified: Dec-14-2020, 05:41 PM by Raskou07.)
Hello again and thanks for trying to help me.I've tried what you are doing and seems good to me, but i get this error:
ValueError: Invalid isoformat string: '2019-03-21 14:00:00.011000 '
Note this is the first date of my .csv so this means it doesnt get through dates at all.
Any ideas on how to fix this?
Posts: 64
Threads: 16
Joined: Nov 2019
Dec-15-2020, 09:31 AM
(This post was last modified: Dec-15-2020, 09:31 AM by palladium.)
I plugged in the problematic string you pointed out into my csv file and it worked for me.
Can you post the entire traceback?
Also, check that your csv file is formatted like your OP post, but with commas separating the time and the value. My csv looks like this in Notepad:
time, value
2019-03-21 20:03:30.351000, 22.6
2019-03-21 20:33:30.392000, 22.7
2019-03-21 20:38:30.428000, 22.6
2019-03-23 20:43:30.363000, 22.7
2019-03-23 21:13:30.382000, 22.6
2019-03-23 22:33:34.762000, 22.5
2019-03-23 22:38:34.696000, 22.6
2019-03-21 14:00:00.011000, 22.7 Otherwise, you may need to tweak the code that gets the datetime info from the string, i.e. the get_date and get_hour function. Readup the datetime docs and see what works for you.
Posts: 6
Threads: 1
Joined: Nov 2020
Dec-15-2020, 11:09 AM
(This post was last modified: Dec-15-2020, 11:09 AM by Raskou07.)
Traceback (most recent call last):
File "C:/Users/..../PycharmProjects/...../average_per_hour.py", line 25, in <module>
hr = [get_hour(i) for i in df['time']]
File "C:/Users/...../PycharmProjects/...../average_per_hour.py", line 25, in <listcomp>
hr = [get_hour(i) for i in df['time']]
File "C:/Users/...../PycharmProjects/...../average_per_hour.py", line 8, in get_hour
hr = datetime.fromisoformat(string).hour
ValueError: Invalid isoformat string: '2019-03-21 14:00:00.011000 ' This is the traceback. My .csv file in notepad looks like this(same as yours)
time,value
2019-03-21 14:00:00.011000 ,22.4
2019-03-21 15:00:00.008000 ,22.8
2019-03-21 16:00:00.016000 ,23.0
2019-03-21 17:00:00.012000 ,22.9
2019-03-21 18:00:00.011000 ,22.8
2019-03-21 19:00:00.010000 ,22.7
2019-03-21 20:03:30.351000 ,22.6
2019-03-21 20:33:30.392000 ,22.7
2019-03-21 20:38:30.428000 ,22.6
2019-03-21 20:43:30.363000 ,22.7
2019-03-21 21:13:30.382000 ,22.6
|