(Apr-22-2024, 09:51 PM)deanhystad Wrote: Quote: that's a good point using concat instead
That not what I was trying to say. concat is better than append, but both should be used sparingly. What I was trying to say is that I would use boolean indexing to make the new dataframe, and I would use your late prints identifier to create the boolean list. Maybe I could vectorize some of that process.
I would probably start with a shift of price and time. Now I can compute a change rate (price - shifted_price) / (time - shifted_time). If I see a rapid change, I start marking data rows as suspect. I stop suspecting the data when I see a shift in the opposite direction.
That's more or less the approach I'm taking. Finding the start of late prints is very easy actually - I just shift the dataframe by one row, and if it's beyond a certain pricedelta that's the start. The trick is finding the end of it. Consider the following example price action, let's set out pricedelta to be 1.00
180.01
180.01
180.02
180.02
180.03
180.01
181.32 LATE PRINT
181.31 LATE PRINT
181.32 LATE PRINT
180.08
180.08
180.09
180.10
180.19
I feel like once I ID the start of a late sequence, I need to use a for loop because I don't know how many there will be before it goes back to "normal" - there are 3 late in this case, but I've seen as many as 13 late in a row, but again 13 should not be considered an upper limit. If there is no upper limit for a the number of bad prints in a row, I don't know how to use a shift to find end since that requires knowing by how many rows to shift.