In [1]: data = ('a,b,c,d\n' '1,2,3,4\n' '5,6,7,8\n' '9,10,11')
In [2]: df = pd.read_csv(io.StringIO(data)) In [3]: print(df.dtypes) Out[3]: a int64 b int64 c int64 d float64 dtype: object
1 2 3 4 5 6 7 8 9 10 11
In [2]: df = pd.read_csv(io.StringIO(data), dtype=object) In [3]: print(df.dtypes) Out[3]: a object b object c object d object dtype: object
In [3]: df['a'][0] Out[3]: '1'
1 2 3 4 5 6 7 8 9
In [2]: df = pd.read_csv(io.StringIO(data), dtype={'b': object, 'c': np.float64, 'd': 'Int64'}) In [3]: print(df.dtypes) Out[3]: a int64 b object c float64 d Int64 dtype: object
1 2 3 4 5 6 7 8 9
In [1]: data = ('a,b,c\n' '1,Yes,2\n' '3,No,4')
In [2]: pd.read_csv(io.StringIO(data)) Out[1]: a b c 01 Yes 2 13 No 4
1 2 3 4 5
In [2]: pd.read_csv(io.StringIO(data), true_values=['Yes'], false_values=['No']) Out[1]: a b c 01True2 13False4
1 2 3 4 5 6 7 8 9 10 11
In [1]: data = ('something,a,b,c,d,message\n' 'one,1,2,3,4,NA\n' 'two,5,6,,8,world\n' 'three,9,10,11,12,foo\n')
In [2]: pd.read_csv(io.StringIO(data)) Out[1]: something a b c d message 0 one 123.04 NaN 1 two 56 NaN 8 world 2 three 91011.012 foo
可以看出默认情况下空字符串 和字符串 NA 都被识别为了 NAN 值,如果我们手动设置 na_values,那么 foo 也会被认作 NAN。
1 2 3 4 5 6
In [1]: pd.read_csv(io.StringIO(data), na_values=['foo']) Out[1]: something a b c d message 0 one 123.04 NaN 1 two 56 NaN 8 world 2 three 91011.012 NaN
禁用默认 NAN 值,此时只有 foo 会被解析为 NAN,空字符串和字符串 NA 都保留了原值。
1 2 3 4 5 6
In [1]: pd.read_csv(io.StringIO(data), na_values=['foo'], keep_default_na=False) Out[1]: something a b c d message 0 one 1234 NA 1 two 568 world 2 three 9101112 NaN
还可以为不同的列指定不同的 NAN 值。
1 2 3 4 5 6 7
In [1]: sentinels = {'message': ['foo', 'NA'], 'something': ['two']} In [1]: pd.read_csv(io.StringIO(data), na_values=sentinels) Out[1]: something a b c d message 0 one 123.04 NaN 1 NaN 56 NaN 8 world 2 three 91011.012 NaN
1 2 3 4 5 6 7 8 9 10 11 12 13 14
In [1]: data = ('# hey!\n' 'a,b,c,d,message\n' '# just wanted to make things more difficult for you\n' '# who reads CSV files with computers, anyway?\n' '1,2,3,4,hello\n' '5,6,7,8,world\n' '9,10,11,12,foo\n')
In [2]: pd.read_csv(io.StringIO(data), skiprows=[0, 2, 3]) Out[1]: a b c d message 01234 hello 15678 world 29101112 foo
1 2 3 4 5 6
In [2]: pd.read_csv(io.StringIO(data), comment='#') Out[1]: a b c d message 01234 hello 15678 world 29101112 foo
1 2 3 4 5 6 7
In [1]: data = 'a,b,c~1,2,3~4,5,6'
In [2]: pd.read_csv(io.StringIO(data), lineterminator='~') Out[1]: a b c 0123 1456
In [1]: data = ('|0|1|2|3\n' '0|0.4691122999071863|-0.2828633443286633|-1.5090585031735124|-1.1356323710171934\n' '1|1.2121120250208506|-0.17321464905330858|0.11920871129693428|-1.0442359662799567\n' '2|-0.8618489633477999|-2.1045692188948086|-0.4949292740687813|1.071803807037338\n' '3|0.7215551622443669|-0.7067711336300845|-1.0395749851146963|0.27185988554282986\n' '4|-0.42497232978883753|0.567020349793672|0.27623201927771873|-1.0874006912859915\n' '5|-0.6736897080883706|0.1136484096888855|-1.4784265524372235|0.5249876671147047\n' '6|0.4047052186802365|0.5770459859204836|-1.7150020161146375|-1.0392684835147725\n' '7|-0.3706468582364464|-1.1578922506419993|-1.344311812731667|0.8448851414248841\n' '8|1.0757697837155533|-0.10904997528022223|1.6435630703622064|-1.4693879595399115\n' '9|0.35702056413309086|-0.6746001037299882|-1.776903716971867|-0.9689138124473498\n')
In [2]: reader = pd.read_csv(io.StringIO(data), sep='|', chunksize=4) In [2]: print(reader) Out[1]: <pandas.io.parsers.TextFileReader object at 0x0000027B441839B0>
In [1]: data = ('label1,label2,label3\n' 'index1,"a","c""g",e\n' 'index2,b,d,f\n')
In [1]: pd.read_csv(io.StringIO(data)) Out[1]: label1 label2 label3 index1 a c"g e index2 b d f In [1]: pd.read_csv(io.StringIO(data), quoting=csv.QUOTE_NONE) Out[1]: label1 label2 label3 index1 "a" "c""g" e index2 b d f