I want to find all values in a Pandas dataframe that contain whitespace (any arbitrary amount) and replace those values with NaNs.
Any ideas how this can be improved?
Basically I want to turn this:
A B C
2000-01-01 -0.532681 foo 0
2000-01-02 1.490752 bar 1
2000-01-03 -1.387326 foo 2
2000-01-04 0.814772 baz
2000-01-05 -0.222552 4
2000-01-06 -1.176781 qux
Into this:
A B C
2000-01-01 -0.532681 foo 0
2000-01-02 1.490752 bar 1
2000-01-03 -1.387326 foo 2
2000-01-04 0.814772 baz NaN
2000-01-05 -0.222552 NaN 4
2000-01-06 -1.176781 qux NaN
I've managed to do it with the code below, but man is it ugly. It's not Pythonic and I'm sure it's not the most efficient use of pandas either. I loop through each column and do boolean replacement against a column mask generated by applying a function that does a regex search of each value, matching on whitespace.
for i in df.columns:
df[i][df[i].apply(lambda i: True if re.search('^\s*$', str(i)) else False)]=None
It could be optimized a bit by only iterating through fields that could contain empty strings:
if df[i].dtype == np.dtype('object')
But that's not much of an improvement
そして最後に、このコードはターゲット文字列を None に設定します。これは、 などの Pandas の関数で機能しますが、の代わりに を直接fillna()
挿入できれば完全性を保つことができます。NaN
None
ベストアンサー1
pandas 0.13df.replace()
以降では機能すると思います:
df = pd.DataFrame([
[-0.532681, 'foo', 0],
[1.490752, 'bar', 1],
[-1.387326, 'foo', 2],
[0.814772, 'baz', ' '],
[-0.222552, ' ', 4],
[-1.176781, 'qux', ' '],
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))
# replace field that's entirely space (or empty) with NaN
print(df.replace(r'^\s*$', np.nan, regex=True))
生産:
A B C
2000-01-01 -0.532681 foo 0
2000-01-02 1.490752 bar 1
2000-01-03 -1.387326 foo 2
2000-01-04 0.814772 baz NaN
2000-01-05 -0.222552 NaN 4
2000-01-06 -1.176781 qux NaN
Temak が指摘したように、df.replace(r'^\s+$', np.nan, regex=True)
有効なデータに空白が含まれている場合に使用します。