python columns函数_Python数据清理

数据丢失在现实生活中是一个问题。机器学习和数据挖掘等领域由于数据缺失导致数据质量差，因此在模型预测的准确性方面面临严峻的问题。在这些领域，缺失值处理是使模型更加准确和有效的关键。

# import the pandas libraryimport pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print (df)Python

它将输出如下结果 -

 one two threea 0.077988 0.476149 0.965836b NaN NaN NaNc -0.390208 -0.551605 -2.301950d NaN NaN NaNe -2.000303 -0.788201 1.510072f -0.930230 -0.670473 1.146615g NaN NaN NaNh 0.085100 0.532791 0.887415Shell

使用reindexing，创建了一个缺失值的DataFrame。在输出中，NaN表示不是数字。

检查缺失值

为了更容易地检测缺失值(以及跨越不同的数组dtype)，Pandas提供了isnull()和notnull()函数，它们也是Series和DataFrame对象的方法 -

示例

import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print (df['one'].isnull())Shell

它将输出如下结果 -

a Falseb Truec Falsed Truee Falsef Falseg Trueh FalseName: one, dtype: boolShell

清理/填充缺少数据

Pandas提供了各种方法来清除缺失值。 fillna函数可以通过几种方式用非空数据“填充”NA值，我们在后面的章节中将解释说明。

用标量值替换NaN

以下程序显示了如何将“NaN”替换为“0”。

import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one','two', 'three'])df = df.reindex(['a', 'b', 'c'])print (df)print ("NaN replaced with '0':")print (df.fillna(0))Python

它将输出如下结果 -

 one two threea -0.576991 -0.741695 0.553172b NaN NaN NaNc 0.744328 -1.735166 1.749580NaN replaced with '0': one two threea -0.576991 -0.741695 0.553172b 0.000000 0.000000 0.000000c 0.744328 -1.735166 1.749580Shell

在这里，填充零值; 相反，我们也可以填写任何其他值。

正向和反向填充NA

示例代码

import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print (df.fillna(method='pad'))Python

执行上面示例代码，得到以下输出结果 -

 one two threea 0.077988 0.476149 0.965836b 0.077988 0.476149 0.965836c -0.390208 -0.551605 -2.301950d -0.390208 -0.551605 -2.301950e -2.000303 -0.788201 1.510072f -0.930230 -0.670473 1.146615g -0.930230 -0.670473 1.146615h 0.085100 0.532791 0.887415Shell

　丢失缺失值

如果只想排除缺少的值，则使用dropna函数和axis参数。默认情况下，axis = 0，即沿着一行行查找，这意味着如果行内的任何值是NA，那么排除整行。

示例

import pandas as pdimport numpy as npdf = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])print (df.dropna())Python

它将输出如下结果 -

 one two threea 0.077988 0.476149 0.965836c -0.390208 -0.551605 -2.301950e -2.000303 -0.788201 1.510072f -0.930230 -0.670473 1.146615h 0.085100 0.532791 0.887415Shell

　替换丢失(或)通用值

很多时候，我们必须用一些特定的值替换一个通用值。可以通过应用替换方法来实现这一点。

用标量值替换NA是fillna()函数的等效行为。

示例代码

import pandas as pdimport numpy as npdf = pd.DataFrame({'one':[10,20,30,40,50,2000],'two':[1000,0,30,40,50,60]})print (df.replace({1000:10,2000:60}))Python

执行上面示例代码，得到以下结果 -

 one two0 10 101 20 02 30 303 40 404 50 505 60 60