Removing characters from columns in data frame

kiton · (This post was last modified: Mar-02-2017, 10:20 PM by kiton.)

Hi! So, I came up with the following code to extract Twitter data from JSON and create a data frame with several columns:

# Import libraries
import json
import pandas as pd

# Extract data from JSON
tweets = []
for line in open('00.json'):
  try: 
    tweets.append(json.loads(line))
  except:
    pass

# Tweets often have missing data, therefore use -if- when extracting "keys"
tweet = tweets[0]
ids = [tweet['id_str'] for tweet in tweets if 'id_str' in tweet] 
text = [tweet['text'] for tweet in tweets if 'text' in tweet]
lang = [tweet['lang'] for tweet in tweets if 'lang' in tweet]
geo = [tweet['geo'] for tweet in tweets if 'geo' in tweet]
place = [tweet['place'] for tweet in tweets if 'place' in tweet]

# Create a data frame (using pd.Index may be "incorrect", but I am a noob)
df=pd.DataFrame({'Ids':pd.Index(ids),
                'Text':pd.Index(text),
                'Lang':pd.Index(lang),
                'Geo':pd.Index(geo),
                'Place':pd.Index(place)})

# Convert "object" to "string" type
df.Lang.apply(str)
df.Geo.apply(str)

# Select tweets in English and with geo tag
df[(df['Lang']==('en',)) & (df['Geo'] != (None,))]

So far, everything seems more or less fine.

Now, the problem.

For example:

"Ids" value is recorded as "(396154642666913792,)" ;
Or "Geo" value is recorded as "({'coordinates': [41.63349811, -93.65831894], 'type': 'Point'},)"

Question: How do I remove the "extra" characters -- i.e., (), {}, 'coordinates':, etc.?

Thank you in advance for help!

***zivoni*** · Mar-02-2017, 10:37 PM

If your transformations are not too difficults, you can use pandas' str.replace to clean it - it works as re.sub, so

df.Geo.str.replace("[({':]", "")

to remove matches of regular expression.

If its too complicated, you can define some "filtering function" and map it to column with .apply()

kiton · Mar-03-2017, 07:00 PM

Thank you, zivoni! I'll try this approach and check how it works.

kiton · (This post was last modified: Mar-07-2017, 05:12 PM by kiton.)

Unfortunately, I am having difficulties with the proposed solution. So, here is the value that I have:

df.loc[1921,'Geo']
{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}

And here is what I tried:

df.Geo.str.replace("[({':]", "") ### results in NaN
# and also this:
df['Geo'] = df['Geo'].map(lambda x: x.lstrip('{'coordinates': [').rstrip('], 'type': 'Point'')) ### results in syntax error

Please advise on the correct way to obtain coordinates values only (i.e., part within square brackets).

kiton · Mar-07-2017, 06:14 PM

Admittedly, I messed up with understanding the data type of the returned object. The following addresses the issue properly:

df2.loc[1921,'Geo']
{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}
df2.loc[1921,'Geo']['coordinates'][0]
39.11890951
# creating new column with desired value
df2["geo_coord0"] = [x['coordinates'][0] for x in df2['Geo']]
df2["geo_coord1"] = [x['coordinates'][1] for x in df2['Geo']]

***zivoni*** · (This post was last modified: Mar-07-2017, 06:18 PM by zivoni.)

Is your Geo column converted to a string? df.Geo.apply(str) does not work in place and from your output it does not look that you actually converted it (no quotes, but it could be interpreter).

This worked for me:

Output:In [176]: df = pd.DataFrame({'Geo':["{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}"]})

In [177]: df.loc[0, "Geo"]
Out[177]: "{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}"

In [178]: pattern = r".*\[(-*\d+\.\d+), (-*\d+\.\d+)\].*"

In [179]: df.Geo = df.Geo.str.replace(pattern, r"\1 \2")

In [180]: df.Geo[0]
Out[180]: '39.11890951 -84.48903638'

If you want to extract coordinates as seperate columns, you will need another split and convert to number. Or you can extract coordinates separately:

df['long'] = np.float(df.Geo.str.replace(pattern, r"\1"))
df['lat'] = np.float(df.Geo.str.replace(pattern, r"\2"))

My re knowlodge is rudimentary, so its possible that the pattern I have used isnt the right one ....

EDIT: yes, if your column consists of dict/list, its much better to do it with direct access

kiton · Mar-07-2017, 06:29 PM

zivoni, thank you for reply. Somehow, everything becomes "bad" once I do df.Geo.apply(str) -- the values become NaN. I'll experiment more with you code though and try to make it work.

kiton · Apr-13-2017, 06:01 PM

Dear zivoni, again I am seeking your help on the following issue that I face trying to run the code you previously suggested. So, I follow the steps outlined in you previous post, but unexpectedly (i.e., it did not occur before) face an error:

df_filter.loc[17765091,'geo']
"{u'type': u'Point', u'coordinates': [36.26511117, -115.10532525]}" 
#Note, the above outcome contains additional components compared to that in your example

pattern = r".*\[(-*\d+\.\d+), (-*\d+\.\d+)\].*"
df_filter.geo = df_filter.geo.str.replace(pattern, r"\1 \2")
df_filter.geo[17765091]
'36.26511117 -115.10532525' # same as yours
# Then extract coordinates to columns
df_filter['lon'] = np.float(df_filter.geo.str.replace(pattern, r"\1"))
df_filter['lat'] = np.float(df_filter.geo.str.replace(pattern, r"\2"))

Error:---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-75-7ac8460a87be> in <module>()
----> 1 df_filter['lon'] = np.float(df_filter.geo.str.replace(pattern, r"\1"))
     2 df_filter['lat'] = np.float(df_filter.geo.str.replace(pattern, r"\2"))

/Users/mymac/anaconda/lib/python3.6/site-packages/pandas/core/series.py in wrapper(self)
    91             return converter(self.iloc[0])
    92         raise TypeError("cannot convert the series to "
---> 93                         "{0}".format(str(converter)))
    94 
    95     return wrapper

TypeError: cannot convert the series to <class 'float'>

I tried multiple ways to convert the values in geo column to "float" using different approaches, but they all failed. Please advise if there is some kind other error.

***zivoni*** · Apr-13-2017, 07:57 PM

(Mar-07-2017, 06:14 PM)kiton Wrote: Admittedly, I messed up with understanding the data type of the returned object. The following addresses the issue properly:
df2.loc[1921,'Geo']
{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}
df2.loc[1921,'Geo']['coordinates'][0]
39.11890951
# creating new column with desired value
df2["geo_coord0"] = [x['coordinates'][0] for x in df2['Geo']]
df2["geo_coord1"] = [x['coordinates'][1] for x in df2['Geo']]

(Apr-13-2017, 06:01 PM)kiton Wrote:

Error:---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-75-7ac8460a87be> in <module>()
----> 1 df_filter['lon'] = np.float(df_filter.geo.str.replace(pattern, r"\1"))
     2 df_filter['lat'] = np.float(df_filter.geo.str.replace(pattern, r"\2"))

/Users/mymac/anaconda/lib/python3.6/site-packages/pandas/core/series.py in wrapper(self)
    91             return converter(self.iloc[0])
    92         raise TypeError("cannot convert the series to "
---> 93                         "{0}".format(str(converter)))
    94 
    95     return wrapper

TypeError: cannot convert the series to <class 'float'>

I tried multiple ways to convert the values in geo column to "float" using different approaches, but they all failed. Please advise if there is some kind other error.

As was stated previously, str methods work only with string column and your column seems to have object (dictionary) dtype. Instead of using str you should use native access for dictionary/list. Try something like:

df['lat'] = df.geo.apply(lambda x:x['coordinates'][0])
df['lon'] = df.geo.apply(lambda x:x['coordinates'][1])

kiton · Apr-14-2017, 07:01 PM

Yep, I understand that issue with string. It used to work correctly. So, I tried the code as you suggested:

df_filter['lat'] = df_filter.geo.apply(lambda x:x['coordinates'][0])
df_filter['lon'] = df_filter.geo.apply(lambda x:x['coordinates'][1])

But it still gives an error:

Error:---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-81-2ed1bdd10a03> in <module>()
----> 1 df_filter['lat'] = df_filter.geo.apply(lambda x:x['coordinates'][0])
      2 df_filter['lon'] = df_filter.geo.apply(lambda x:x['coordinates'][1])

/Users/mymac/anaconda/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
   2292             else:
   2293                 values = self.asobject
-> 2294                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   2295 
   2296         if len(mapped) and isinstance(mapped[0], Series):

pandas/src/inference.pyx in pandas.lib.map_infer (pandas/lib.c:66124)()

<ipython-input-81-2ed1bdd10a03> in <lambda>(x)
----> 1 df_filter['lat'] = df_filter.geo.apply(lambda x:x['coordinates'][0])
      2 df_filter['lon'] = df_filter.geo.apply(lambda x:x['coordinates'][1])

TypeError: string indices must be integers

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Grouping in pandas/multi-index data frame	Aleqsie	3	797	Jan-06-2024, 03:55 PM Last Post: deanhystad
	Filtering Data Frame, with another value	NewBiee	9	1,569	Aug-21-2023, 10:53 AM Last Post: NewBiee
	Deleting characters between certain characters	stahorse	7	1,265	Jul-03-2023, 12:59 AM Last Post: Pedroski55
	Exporting data frame to excel	dyerlee91	0	1,678	Oct-05-2021, 11:34 AM Last Post: dyerlee91
	Pandas Data frame column condition check based on length of the value	aditi06	1	2,762	Jul-28-2021, 11:08 AM Last Post: jefsummers
	Adding a new column to a Panda Data Frame	rsherry8	2	2,176	Jun-06-2021, 06:49 PM Last Post: jefsummers
	import columns of data from local csv file	CatherineKan	2	3,439	May-10-2021, 05:10 AM Last Post: ricslato
	pandas.to_datetime: Combine data from 2 columns	ju21878436312	1	2,500	Feb-20-2021, 08:25 PM Last Post: perfringo
	grouped data frame	glitter	0	1,640	Feb-02-2021, 11:22 AM Last Post: glitter
	how to filter data frame dynamically with the columns	psahay	0	2,443	Aug-24-2020, 01:10 PM Last Post: psahay

Removing characters from columns in data frame

User Panel Messages

Announcements