Getting the hourly average of a time series dataset

Raskou07 · Nov-24-2020, 03:31 PM

Hello guys, I have a large dataframe of 10000 rows and 2 columns .The first column is the dates(years,months,days,hours,minutes) as index. My dates column doesnt have a stable frequency. I am trying to get the average value of 2nd column of every 1 hour .
Any ideas on how can I iterate this problem for the whole time series?
My df looks like this and I want to get the average value from 20:00 to 21:00 and so on.

time value
2019-03-21 20:03:30.351000 22.6
2019-03-21 20:33:30.392000 22.7
2019-03-21 20:38:30.428000 22.6
2019-03-21 20:43:30.363000 22.7
2019-03-21 21:13:30.382000 22.6
2019-03-21 22:33:34.762000 22.5
2019-03-21 22:38:34.696000 22.6

michael1789 · Nov-24-2020, 03:38 PM

Something like this maybe?:

number_of_entries = 0
total = 0
For time in column:
    number_of_entries += 1
    total += time
average = total/number_of_entries

Raskou07 · Nov-24-2020, 09:19 PM

(Nov-24-2020, 03:38 PM)michael1789 Wrote: Something like this maybe?:
number_of_entries = 0
total = 0
For time in column:
    number_of_entries += 1
    total += time
average = total/number_of_entries
    

No I am trying to get the average of every hour. For example if 4 values are in the range of 1 hour, I am trying to get the average of them, if I have 2 values in the next hour I am trying to get the average of those 2. Also, how does this code iterates the datetime index every 1 hour?

palladium · Nov-25-2020, 03:01 AM

You may want to use df.period.hour() to extract the hour from your datetime, then define a function to get the values for a given hour on a particular date and compute the average. If you can show us what you have tried we can help you if you get stuck.

Read about Periods here: https://pandas.pydata.org/pandas-docs/st...eriod.html

Raskou07 · Nov-30-2020, 03:53 PM

count = 0
for year, z in df.groupby(df.index.map(lambda t: t.year)):
       for month, x  in df.groupby(df.index.map(lambda t: t.month)):
               for day, i in df.groupby(df.index.map(lambda t: t.day)):
                      for hour, j  in df.groupby(df.index.map(lambda t: t.hour)):
                          count = count + 1

This is what I was trying to do, problem is that it doesnt work as I thought it would be.
I want to check for example for the hour 14:00:00 of day 1 from my dataset and see how many values are in that range.Then get the average of them and store them in a new dataframe.Then keep doing this for every day and every hour in my dataset.

Raskou07 · Nov-30-2020, 04:06 PM

(Nov-25-2020, 03:01 AM)palladium Wrote: You may want to use df.period.hour() to extract the hour from your datetime, then define a function to get the values for a given hour on a particular date and compute the average. If you can show us what you have tried we can help you if you get stuck.

Read about Periods here: https://pandas.pydata.org/pandas-docs/st...eriod.html

count = 0
for year, z in df.groupby(df.index.map(lambda t: t.year)):
       for month, x  in df.groupby(df.index.map(lambda t: t.month)):
               for day, i in df.groupby(df.index.map(lambda t: t.day)):
                      for hour, j  in df.groupby(df.index.map(lambda t: t.hour)):
                          count = count + 1

This is what I was trying to do, problem is that it doesnt work as I thought it would be.
I want to check for example for the hour 14:00:00 of day 1 from my dataset and see how many values are in that range.Then get the average of them and store them in a new dataframe.Then keep doing this for every day and every hour in my dataset.

palladium · (This post was last modified: Dec-02-2020, 10:02 AM by palladium.)

import pandas as pd
from datetime import datetime
import numpy as np

def get_hour(string):
    """Gets hour from datestring"""
    hr = datetime.fromisoformat(string).hour
    return hr

def get_date(string):
    """Gets date from datestring"""
    date = datetime.strptime(string, '%Y-%m-%d %H:%M:%S.%f')
    return str(date.date())

def get_average(df):
    average = df['value'].sum()/len(df['value'])
    return average

df = pd.read_csv('pandas.csv', header = 0, names = ['time', 'value'])
print(df)
hr = [get_hour(i) for i in df['time']]
date = [get_date(i) for i in df['time']]

df['hour'] = hr  #I'm sure there is a way to vectorize this part
# df['hour'] = datetime.fromisoformat(df['time']).hour # does not work - need help here
                                                    # because something like this would be faster
df['date'] = date
df.drop('time', axis = 1, inplace = True) #cleanup

unique_dates = np.unique(df['date']) #returns list

for u in unique_dates: #this probably can be sped up too with large datasets - help!
    unique_df = df[df['date'] == u] #returns df containing only u
    unique_hr = np.unique(unique_df['hour']) #returns list
    for h in unique_hr:
        unique_hr_df = unique_df[unique_df['hour'] == h]
        average = get_average(unique_hr_df)
        print('Date:', u, 'Hour:', h, "Average:", average)

Gives the following output:

Output:                         time  value
0  2019-03-21 20:03:30.351000   22.6
1  2019-03-21 20:33:30.392000   22.7
2  2019-03-21 20:38:30.428000   22.6
3  2019-03-23 20:43:30.363000   22.7
4  2019-03-23 21:13:30.382000   22.6
5  2019-03-23 22:33:34.762000   22.5
6  2019-03-23 22:38:34.696000   22.6
Date: 2019-03-21 Hour: 20 Average: 22.633333333333336
Date: 2019-03-23 Hour: 20 Average: 22.7
Date: 2019-03-23 Hour: 21 Average: 22.6
Date: 2019-03-23 Hour: 22 Average: 22.55

Note I changed your original dataset so that it has 2 different dates, for illustrative purposes.

I am sure some parts of the code can be sped up - see comments in code above. If some pandas guru can give me some tips that would be appreciated.

Raskou07 · (This post was last modified: Dec-14-2020, 05:41 PM by Raskou07.)

Hello again and thanks for trying to help me.I've tried what you are doing and seems good to me, but i get this error:
ValueError: Invalid isoformat string: '2019-03-21 14:00:00.011000 '
Note this is the first date of my .csv so this means it doesnt get through dates at all.
Any ideas on how to fix this?

palladium · (This post was last modified: Dec-15-2020, 09:31 AM by palladium.)

I plugged in the problematic string you pointed out into my csv file and it worked for me.

Can you post the entire traceback?

Also, check that your csv file is formatted like your OP post, but with commas separating the time and the value. My csv looks like this in Notepad:

time, value
2019-03-21 20:03:30.351000, 22.6
2019-03-21 20:33:30.392000, 22.7
2019-03-21 20:38:30.428000, 22.6
2019-03-23 20:43:30.363000, 22.7
2019-03-23 21:13:30.382000, 22.6
2019-03-23 22:33:34.762000, 22.5
2019-03-23 22:38:34.696000, 22.6
2019-03-21 14:00:00.011000, 22.7

Otherwise, you may need to tweak the code that gets the datetime info from the string, i.e. the get_date and get_hour function. Readup the datetime docs and see what works for you.

Raskou07 · (This post was last modified: Dec-15-2020, 11:09 AM by Raskou07.)

Traceback (most recent call last):
  File "C:/Users/..../PycharmProjects/...../average_per_hour.py", line 25, in <module>
    hr = [get_hour(i) for i in df['time']]
  File "C:/Users/...../PycharmProjects/...../average_per_hour.py", line 25, in <listcomp>
    hr = [get_hour(i) for i in df['time']]
  File "C:/Users/...../PycharmProjects/...../average_per_hour.py", line 8, in get_hour
    hr = datetime.fromisoformat(string).hour
ValueError: Invalid isoformat string: '2019-03-21 14:00:00.011000 '

This is the traceback. My .csv file in notepad looks like this(same as yours)

time,value
2019-03-21 14:00:00.011000 ,22.4
2019-03-21 15:00:00.008000 ,22.8
2019-03-21 16:00:00.016000 ,23.0
2019-03-21 17:00:00.012000 ,22.9
2019-03-21 18:00:00.011000 ,22.8
2019-03-21 19:00:00.010000 ,22.7
2019-03-21 20:03:30.351000 ,22.6
2019-03-21 20:33:30.392000 ,22.7
2019-03-21 20:38:30.428000 ,22.6
2019-03-21 20:43:30.363000 ,22.7
2019-03-21 21:13:30.382000 ,22.6

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Date Time Series Help...Please	spra8560	2	479	Feb-01-2024, 01:38 PM Last Post: spra8560
	Print names in x-axis of a time-series values	hobbyist	4	1,337	Apr-22-2023, 09:29 PM Last Post: deanhystad
	Time series	JasminQuinn	0	1,072	Apr-22-2022, 10:33 PM Last Post: JasminQuinn
	How to read rainfall time series and insert missing data points	MadsM	4	2,324	Jan-06-2022, 10:39 AM Last Post: amdi40
	Plotting A Time Series With Shaded Recession Bars	adamszymanski	1	3,248	Jan-24-2021, 09:08 PM Last Post: nealc
	Tableau Time Series Prediction using Python Integration	tobimarsh43	0	1,977	Jul-24-2020, 10:38 AM Last Post: tobimarsh43
	Bode plot from time series experiment data	discus	4	7,586	Jun-20-2020, 07:46 AM Last Post: discus
	Convert quarterly time series to monthly time series	donnertrud	1	5,317	May-22-2020, 10:16 AM Last Post: pyzyx3qwerty
	Parsing Date/Time from Metar Reports with 6 hourly weather information	Lawrence	0	2,426	May-03-2020, 08:15 PM Last Post: Lawrence
	Time series manipulation	SinPy	0	1,464	Apr-01-2020, 12:48 PM Last Post: SinPy

Getting the hourly average of a time series dataset

User Panel Messages

Announcements