python用法 - text plt python




在多天數據丟失時用NaN填充數據幀 (3)

如果我理解正確,可以通過布爾索引來刪除不必要的行。 假設你在一個名為diff的列中有天diff ,你可以使用df.loc[df['diff'].dt.days < 5]

這是一個演示

df = pd.read_clipboard()

               col_1    vals
2017-10-01  0.000000    0.112869
2017-10-02  0.017143    0.112869
2017-10-12  0.003750    0.117274
2017-10-14  0.000000    0.161556
2017-10-17  0.000000    0.116264

轉換為時間列,並以天為單位獲取與下一個值的差異的新列

df = df.reset_index()
df['index']=pd.to_datetime(df['index'])
df['diff'] = df['index'] - df['index'].shift(1)


       index    col_1       vals       diff
0   2017-10-01  0.000000    0.112869    NaT
1   2017-10-02  0.017143    0.112869    1 days
2   2017-10-12  0.003750    0.117274    10 days
3   2017-10-14  0.000000    0.161556    2 days
4   2017-10-17  0.000000    0.116264    3 days

添加一個布盧姆過濾器

new_df = df.loc[df['diff'].dt.days < 5]
new_df = new_df.drop('diff', axis=1)
new_df.set_index('index', inplace=True)
new_df

               col_1    vals
index       
2017-10-02  0.017143    0.112869
2017-10-14  0.000000    0.161556
2017-10-17  0.000000    0.116264

我有一個熊貓數據框,我插入來獲得每日數據幀。 原始數據框如下所示:

               col_1      vals 
2017-10-01  0.000000  0.112869 
2017-10-02  0.017143  0.112869 
2017-10-12  0.003750  0.117274 
2017-10-14  0.000000  0.161556 
2017-10-17  0.000000  0.116264   

在插值的數據框中,我想將數據值更改為NaN,其中日期的差距超過5天。 例如,在上面的數據2017-10-022017-10-022017-10-12之間的差距超過了5天,因此2017-10-12插的數據框中,應該刪除這兩個日期之間的所有值。 我不知道如何做到這一點,也許combine_first

- EDIT:插值數據框如下所示:

            col_1      vals 
2017-10-01  0.000000  0.112869 
2017-10-02  0.017143  0.112869 
2017-10-03  0.015804  0.113309 
2017-10-04  0.014464  0.113750 
2017-10-05  0.013125  0.114190 
2017-10-06  0.011786  0.114631 
2017-10-07  0.010446  0.115071 
2017-10-08  0.009107  0.115512 
2017-10-09  0.007768  0.115953 
2017-10-10  0.006429  0.116393 
2017-10-11  0.005089  0.116834 
2017-10-12  0.003750  0.117274 
2017-10-13  0.001875  0.139415 
2017-10-14  0.000000  0.161556 
2017-10-15  0.000000  0.146459 
2017-10-16  0.000000  0.131361 
2017-10-17  0.000000  0.116264

預期產出:

               col_1      vals
2017-10-01  0.000000  0.112869
2017-10-02  0.017143  0.112869
2017-10-12  0.003750  0.117274
2017-10-13  0.001875  0.139415
2017-10-14  0.000000  0.161556
2017-10-15  0.000000  0.146459
2017-10-16  0.000000  0.131361
2017-10-17  0.000000  0.116264

我首先要確定差距超過5天的地方。 從那裡,我產生一個陣列,確定這些差距之間的組。 最後,我會使用groupby轉向每日頻率和插值。

# convenience: assign string to variable for easier access
daytype = 'timedelta64[D]'

# define five days for use when evaluating size of gaps
five = np.array(5, dtype=daytype)

# get the size of gaps
deltas = np.diff(df.index.values).astype(daytype)

# identify groups between gaps
groups = np.append(False, deltas > five).cumsum()

# handy function to turn to daily frequency and interpolate
to_daily = lambda x: x.asfreq('D').interpolate()

# and finally...
df.groupby(groups, group_keys=False).apply(to_daily)

               col_1      vals
2017-10-01  0.000000  0.112869
2017-10-02  0.017143  0.112869
2017-10-12  0.003750  0.117274
2017-10-13  0.001875  0.139415
2017-10-14  0.000000  0.161556
2017-10-15  0.000000  0.146459
2017-10-16  0.000000  0.131361
2017-10-17  0.000000  0.116264

如果你想提供你自己的插值方法。 你可以修改上面的這個:

daytype = 'timedelta64[D]'
five = np.array(5, dtype=daytype)
deltas = np.diff(df.index.values).astype(daytype)
groups = np.append(False, deltas > five).cumsum()

# custom interpolation function that takes a dataframe
def my_interpolate(df):
    """This can be whatever you want.
    I just provided what will result
    in the same thing as before."""
    return df.interpolate()

to_daily = lambda x: x.asfreq('D').pipe(my_interpolate)

df.groupby(groups, group_keys=False).apply(to_daily)

               col_1      vals
2017-10-01  0.000000  0.112869
2017-10-02  0.017143  0.112869
2017-10-12  0.003750  0.117274
2017-10-13  0.001875  0.139415
2017-10-14  0.000000  0.161556
2017-10-15  0.000000  0.146459
2017-10-16  0.000000  0.131361
2017-10-17  0.000000  0.116264

我在示例中添加了更多行,以便在行之間有兩個超過5天的空白塊。
我將這兩個表本地保存為.csv文件,並添加date作為第一個列名,以完成下面的合併:

建立

import pandas as pd
import numpy as np
df_1=pd.read_csv('df_1.csv', delimiter=r"\s+")
df_2=pd.read_csv('df_2.csv', delimiter=r"\s+")

合併(連接)兩個數據集並重命名列:
注意兩個團隊有5天以上的差距。

df=df_2.merge(df_1, how='left', on='Date').reset_index(drop=True)
df.columns=['date','col','val','col_na','val_na']    #purely aesthetic

df

    date        col         val         col_na      val_na
0   2017-10-01  0.000000    0.112869    0.000000    0.112869
1   2017-10-02  0.017143    0.112869    0.017143    0.112869
2   2017-10-03  0.015804    0.113309    NaN         NaN
3   2017-10-04  0.014464    0.113750    NaN         NaN
4   2017-10-05  0.013125    0.114190    NaN         NaN
5   2017-10-06  0.011786    0.114631    NaN         NaN
6   2017-10-07  0.010446    0.115071    NaN         NaN
7   2017-10-08  0.009107    0.115512    NaN         NaN
8   2017-10-09  0.007768    0.115953    NaN         NaN
9   2017-10-10  0.006429    0.116393    NaN         NaN
10  2017-10-11  0.005089    0.116834    NaN         NaN
11  2017-10-12  0.003750    0.117274    0.003750    0.117274
12  2017-10-13  0.001875    0.139415    NaN         NaN
13  2017-10-14  0.000000    0.161556    0.000000    0.161556
14  2017-10-15  0.000000    0.146459    NaN         NaN
15  2017-10-16  0.000000    0.131361    NaN         NaN
16  2017-10-17  0.000000    0.989999    0.000000    0.116264
17  2017-10-18  0.000000    0.412311    NaN         NaN
18  2017-10-19  0.000000    0.166264    NaN         NaN
19  2017-10-20  0.000000    0.123464    NaN         NaN
20  2017-10-21  0.000000    0.149767    NaN         NaN
21  2017-10-22  0.000000    0.376455    NaN         NaN
22  2017-10-23  0.000000    0.000215    NaN         NaN
23  2017-10-24  0.000000    0.940219    NaN         NaN
24  2017-10-25  0.000000    0.030352    0.000000    0.030352
25  2017-10-26  0.000000    0.111112    NaN         NaN
26  2017-10-27  0.000000    0.002500    NaN         NaN

方法來執行任務

def my_func(my_df):
    non_na_index=[]                                      #define empty list
    for i in range(len(my_df.iloc[:,[1]])):
        if not pd.isnull(my_df.iloc[i,[3]][0]):
            non_na_index.append(i)                       #add indexes of rows that that have non NaN value  
    sub=np.roll(non_na_index, shift=-1)-non_na_index     #subract column in indexes to find row count of NaN   
    sub=sub[:-1]                                         #get rid of last element (calculation artifact)
    for i in reversed(range(len(sub))):
        if sub[i]>=5:                       #identidy indexes with more than 5 NaN in between
            b=non_na_index[i+1]             #assign end index
            a=non_na_index[i]+1             #assign start index
            my_df=my_df.drop(my_df.index[[range(a,b)]])  #drop the rows within the range
    return(my_df)

使用df執行該功能

new_df=my_func(df)
new_df=df.drop(['col_na','val_na'],1)    # drop the two extra columns
new_df

    date        col         val
0   2017-10-01  0.000000    0.112869
1   2017-10-02  0.017143    0.112869
11  2017-10-12  0.003750    0.117274
12  2017-10-13  0.001875    0.139415
13  2017-10-14  0.000000    0.161556
14  2017-10-15  0.000000    0.146459
15  2017-10-16  0.000000    0.131361
16  2017-10-17  0.000000    0.989999
24  2017-10-25  0.000000    0.030352
25  2017-10-26  0.000000    0.111112
26  2017-10-27  0.000000    0.002500