utf-8 error with pandas read_csv - logues - Oct-23-2018
I'm trying to read in several large data files (~600-700k rows) as dataframes so I can clean and append them to create a large panel dataset. When I'm importing, I get the following error Error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd6 in position 4: invalid continuation byte
When I restrict to nrows=5000, the read works, but somewhere between 5000 and 6000 rows, the error happens again. There isn't anything wrong with the file, and I've had no issues importing it, and the other files, into R. Here's the link to the publicly available .xslx file that I converted into a CSV before reading into Python: https://www.foreignlaborcert.doleta.gov/pdf/PerformanceData/2017/H-1B_Disclosure_Data_FY17.xlsx. Thanks in advance for your help in getting this issue resolved!
import pandas as pd
df_17 = pd.read_csv("C:\\Users\\bryanlm\\Python Projects\\Immigration\\LCA Dataset\\Aggregation\\17_H-1B_Disclosure_Data_FY17.csv") Output: df_17 = pd.read_csv("C:\\Users\\bryanlm\\Python Projects\\Immigration\\LCA Dataset\\Aggregation\\17_H-1B_Disclosure_Data_FY17.csv", nrows = 5900)
Traceback (most recent call last):
File "<ipython-input-4-c62aa366fb87>", line 1, in <module>
df_17 = pd.read_csv("C:\\Users\\bryanlm\\Python Projects\\Immigration\\LCA Dataset\\Aggregation\\17_H-1B_Disclosure_Data_FY17.csv", nrows = 5900)
File "C:\Users\bryanlm\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 678, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\bryanlm\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 446, in _read
data = parser.read(nrows)
File "C:\Users\bryanlm\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1036, in read
ret = self._engine.read(nrows)
File "C:\Users\bryanlm\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\io\parsers.py", line 1848, in read
data = self._reader.read(nrows)
File "pandas\_libs\parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
File "pandas\_libs\parsers.pyx", line 903, in pandas._libs.parsers.TextReader._read_low_memory
File "pandas\_libs\parsers.pyx", line 968, in pandas._libs.parsers.TextReader._read_rows
File "pandas\_libs\parsers.pyx", line 1094, in pandas._libs.parsers.TextReader._convert_column_data
File "pandas\_libs\parsers.pyx", line 1141, in pandas._libs.parsers.TextReader._convert_tokens
File "pandas\_libs\parsers.pyx", line 1240, in pandas._libs.parsers.TextReader._convert_with_dtype
File "pandas\_libs\parsers.pyx", line 1256, in pandas._libs.parsers.TextReader._string_convert
File "pandas\_libs\parsers.pyx", line 1494, in pandas._libs.parsers._string_box_utf8
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd6 in position 4: invalid continuation byte
|