1.重新索引
In [3]: obj = Series([4.5,7.2,-5.3,3.6], index=["d","b","a","c"])
In [4]: obj
Out[4]:
d 4.5
b 7.2
a -5.3
c 3.6
dtype: float64
In [6]: obj2 = obj.reindex(["a","b","c","d","e"])
In [7]: obj2
Out[7]:
a -5.3
b 7.2
c 3.6
d 4.5
e NaN
dtype: float64
ffill可以实现前向值填充:
In [8]: obj3 = Series(["blue","purple","yellow"], index=[0,2,4])
In [9]: obj3.reindex(range(6), method="ffill")
Out[9]:
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
2.丢弃指定轴上的项
drop方法返回在指定轴上删除了指定值的新对象:
In [12]: obj = Series(np.arange(5.), index=["a","b","c","d","e"])
In [13]: new_obj = obj.drop("c")
In [14]: new_obj
Out[14]:
a 0.0
b 1.0
d 3.0
e 4.0
dtype: float64
DataFrame可以删除任意轴上的索引值
3.索引,选取和过滤
Series的索引可以不止是整数:
In [4]: obj = Series(np.arange(4.), index=["a","b","c","d"])Out[6]:
a 0.0
b 1.0
dtype: float64
In [7]: obj[obj<2]
Out[7]:
a 0.0
b 1.0
dtype: float64
Series切片与普通的python切片不一样,末端也是包含的:
In [8]: obj["b":"c"]
Out[8]:
b 1.0
c 2.0
dtype: float64
DataFrame进行索引:
In [10]: data
Out[10]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
In [11]: data['two']
Out[11]:
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32
In [12]: data[:2]
Out[12]:
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
布尔型DataFrame进行索引:
In [13]: data > 5
Out[13]:
one two three four
Ohio False False False False
Colorado False False True True
Utah True True True True
New York True True True True
利用ix可以选取行和列的子集:
In [18]: data.ix['Colorado',['two','three']]
Out[18]:
two 5
three 6
Name: Colorado, dtype: int32
In [19]: data.ix[['Colorado','Utah'],[3,0,1]]
Out[19]:
four one two
Colorado 7 4 5
Utah 11 8 9
4.算数运算和数据对齐
对不同索引的对象进行算数运算,如果存在不同的索引,则结果的索引取其并集:
In [20]: s1 = Series([7.3,-2.5,3.4,1.5],index=['a','c','d','e'])
In [21]: s2 = Series([-2.1, 3.6, -1.5, 4, 3.1],index=['a','c','e','f','g'])
In [22]: s1+s2
Out[22]:
a 5.2
c 1.1
d NaN
e 0.0
f NaN
g NaN
dtype: float64
对于DataFrame,对齐操作会同时发生在行和列上:
In [26]: df1
Out[26]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
OreGon 9.0 10.0 11.0
In [27]: df2
Out[27]:
b c d
Ohio 0.0 1.0 2.0
Texas 3.0 4.0 5.0
Colorado 6.0 7.0 8.0
In [28]: df1+df2
Out[28]:
b c d e
Colorado NaN NaN NaN NaN
Ohio 3.0 NaN 6.0 NaN
Oregon NaN NaN NaN NaN
Texas 9.0 NaN 12.0 NaN
Utah NaN NaN NaN NaN
使用add方法相加:
In [30]: df2.add(df1,fill_value=0)
Out[30]:
b c d e
Colorado 6.0 7.0 8.0 NaN
Ohio 3.0 1.0 6.0 5.0
Oregon 9.0 NaN 10.0 11.0
Texas 9.0 4.0 12.0 8.0
Utah 0.0 NaN 1.0 2.0
5.DataFrame和Series之间的运算:
计算二维数组和某一行的差:
In [31]: arr = np.arange(12.).reshape((3,4))
In [32]: arr
Out[32]:
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]])
In [33]: arr - arr[1]
Out[33]:
array([[-4., -4., -4., -4.],
[ 0., 0., 0., 0.],
[ 4., 4., 4., 4.]])
DataFrame和Series之间的运算:
In [35]: frame = DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
In [39]: series = frame.iloc[0]
In [40]: frame
Out[40]:
b d e
Utah 0.0 1.0 2.0
Ohio 3.0 4.0 5.0
Texas 6.0 7.0 8.0
Oregon 9.0 10.0 11.0
In [41]: series
Out[41]:
b 0.0
d 1.0
e 2.0
Name: Utah, dtype: float64
In [43]: frame - series
Out[43]:
b d e
Utah 0.0 0.0 0.0
Ohio 3.0 3.0 3.0
Texas 6.0 6.0 6.0
Oregon 9.0 9.0 9.0
如果某个索引值找不到,则与运算的两个对象会被重新索引以形成并集:
In [45]: frame + series2
Out[45]:
b d e f
Utah 0.0 NaN 3.0 NaN
Ohio 3.0 NaN 6.0 NaN
Texas 6.0 NaN 9.0 NaN
Oregon 9.0 NaN 12.0 NaN
匹配列并在列上广播:
In [46]: series3 = frame['d']
In [47]: frame.sub(series3, axis=0)
Out[47]:
b d e
Utah -1.0 0.0 1.0
Ohio -1.0 0.0 1.0
Texas -1.0 0.0 1.0
Oregon -1.0 0.0 1.0
6.函数应用和映射
Numpy的ufuncs也可用于操作pandas对象:
In [49]: frame = DataFrame(np.random.randn(4,3), columns=list('bde'),index=['Utah','Ohio','Texas','Oregon'])
In [50]: frame
Out[50]:
b d e
Utah 0.913051 -1.289725 -0.590573
Ohio 1.417612 -1.835357 -0.010755
Texas 0.328839 -0.121878 -1.209583
Oregon 1.315330 -1.026557 -1.777427
In [51]: np.abs(frame)
Out[51]:
b d e
Utah 0.913051 1.289725 0.590573
Ohio 1.417612 1.835357 0.010755
Texas 0.328839 0.121878 1.209583
Oregon 1.315330 1.026557 1.777427
DataFrame的apply方法可以实现将函数应用到由各行或列形成的一维数组上:
In [52]: f = lambda x:x.max() - x.min()
In [53]: frame.apply(f)
Out[53]:
b 1.088773
d 1.713479
e 1.766671
dtype: float64
In [54]: frame.apply(f, axis=1)
Out[54]:
Utah 2.202776
Ohio 3.252969
Texas 1.538421
Oregon 3.092757
dtype: float64
7.排序和排名
sort_index方法可以返回一个已排序的对象
In [57]: obj = Series(range(4), index=['d','a','b','c'])
In [58]: obj
Out[58]:
d 0
a 1
b 2
c 3
dtype: int64
In [59]: obj.sort_index
Out[59]:
<bound method Series.sort_index of d 0
a 1
b 2
c 3
dtype: int64>
In [62]: frame.sort_index()
Out[62]:
b d e
Ohio 1.417612 -1.835357 -0.010755
Oregon 1.315330 -1.026557 -1.777427
Texas 0.328839 -0.121878 -1.209583
Utah 0.913051 -1.289725 -0.590573
In [63]: frame.sort_index(axis=1)
Out[63]:
b d e
Utah 0.913051 -1.289725 -0.590573
Ohio 1.417612 -1.835357 -0.010755
Texas 0.328839 -0.121878 -1.209583
Oregon 1.315330 -1.026557 -1.777427
倒序查看:
In [65]: frame.sort_index(axis=1,ascending=False)
Out[65]:
e d b
Utah -0.590573 -1.289725 0.913051
Ohio -0.010755 -1.835357 1.417612
Texas -1.209583 -0.121878 0.328839
Oregon -1.777427 -1.026557 1.315330
按某一列的值进行排序:
In [67]: frame.sort_values(by='b')
Out[67]:
b d e
Texas 0.328839 -0.121878 -1.209583
Utah 0.913051 -1.289725 -0.590573
Oregon 1.315330 -1.026557 -1.777427
Ohio 1.417612 -1.835357 -0.010755
排名(rank)与排序类似,它会设置一个排名值,并且可以根据某种规则破坏平级关系
In [70]: obj
Out[70]:
0 7
1 -5
2 7
3 4
4 2
5 0
6 4
dtype: int64
In [71]: obj.rank()
Out[71]:
0 6.5
1 1.0
2 6.5
3 4.5
4 3.0
5 2.0
6 4.5
dtype: float64
根据值在原数据中出现的顺序给出排名
In [72]: obj.rank(method='first')
Out[72]:
0 6.0
1 1.0
2 7.0
3 4.0
4 3.0
5 2.0
6 5.0
dtype: float64
8.带有重复值的轴索引
使用is_unique查看值是否唯一
In [73]: obj = Series(range(5),index=['a','a','b','b','c'])
In [74]: obj
Out[74]:
a 0
a 1
b 2
b 3
c 4
dtype: int64
In [75]: obj.index.is_unique
Out[75]: False
对重复索引选取数据:
In [76]: obj['a']
Out[76]:
a 0
a 1
dtype: int64
DataFrame也是同样的道理
0