Posts: 70
Threads: 17
Joined: Feb 2017
Mar-02-2017, 10:19 PM
(This post was last modified: Mar-02-2017, 10:20 PM by kiton.)
Hi! So, I came up with the following code to extract Twitter data from JSON and create a data frame with several columns:
# Import libraries
import json
import pandas as pd
# Extract data from JSON
tweets = []
for line in open('00.json'):
try:
tweets.append(json.loads(line))
except:
pass
# Tweets often have missing data, therefore use -if- when extracting "keys"
tweet = tweets[0]
ids = [tweet['id_str'] for tweet in tweets if 'id_str' in tweet]
text = [tweet['text'] for tweet in tweets if 'text' in tweet]
lang = [tweet['lang'] for tweet in tweets if 'lang' in tweet]
geo = [tweet['geo'] for tweet in tweets if 'geo' in tweet]
place = [tweet['place'] for tweet in tweets if 'place' in tweet]
# Create a data frame (using pd.Index may be "incorrect", but I am a noob)
df=pd.DataFrame({'Ids':pd.Index(ids),
'Text':pd.Index(text),
'Lang':pd.Index(lang),
'Geo':pd.Index(geo),
'Place':pd.Index(place)})
# Convert "object" to "string" type
df.Lang.apply(str)
df.Geo.apply(str)
# Select tweets in English and with geo tag
df[(df['Lang']==('en',)) & (df['Geo'] != (None,))] So far, everything seems more or less fine.
Now, the problem.
For example:
"Ids" value is recorded as " (396154642666913792,)" ;
Or "Geo" value is recorded as "({'coordinates': [41.63349811, -93.65831894], 'type': 'Point'},)"
Question: How do I remove the "extra" characters -- i.e., (), {}, 'coordinates':, etc.?
Thank you in advance for help!
Posts: 331
Threads: 2
Joined: Feb 2017
If your transformations are not too difficults, you can use pandas' str.replace to clean it - it works as re.sub, so
df.Geo.str.replace("[({':]", "") to remove matches of regular expression.
If its too complicated, you can define some "filtering function" and map it to column with .apply()
Posts: 70
Threads: 17
Joined: Feb 2017
Thank you, zivoni! I'll try this approach and check how it works.
Posts: 70
Threads: 17
Joined: Feb 2017
Mar-07-2017, 05:11 PM
(This post was last modified: Mar-07-2017, 05:12 PM by kiton.)
Unfortunately, I am having difficulties with the proposed solution. So, here is the value that I have:
df.loc[1921,'Geo']
{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'} And here is what I tried:
df.Geo.str.replace("[({':]", "") ### results in NaN
# and also this:
df['Geo'] = df['Geo'].map(lambda x: x.lstrip('{'coordinates': [').rstrip('], 'type': 'Point'')) ### results in syntax error Please advise on the correct way to obtain coordinates values only (i.e., part within square brackets).
Posts: 70
Threads: 17
Joined: Feb 2017
Admittedly, I messed up with understanding the data type of the returned object. The following addresses the issue properly:
df2.loc[1921,'Geo']
{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}
df2.loc[1921,'Geo']['coordinates'][0]
39.11890951
# creating new column with desired value
df2["geo_coord0"] = [x['coordinates'][0] for x in df2['Geo']]
df2["geo_coord1"] = [x['coordinates'][1] for x in df2['Geo']]
Posts: 331
Threads: 2
Joined: Feb 2017
Mar-07-2017, 06:18 PM
(This post was last modified: Mar-07-2017, 06:18 PM by zivoni.)
Is your Geo column converted to a string? df.Geo.apply(str) does not work in place and from your output it does not look that you actually converted it (no quotes, but it could be interpreter).
This worked for me:
Output: In [176]: df = pd.DataFrame({'Geo':["{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}"]})
In [177]: df.loc[0, "Geo"]
Out[177]: "{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}"
In [178]: pattern = r".*\[(-*\d+\.\d+), (-*\d+\.\d+)\].*"
In [179]: df.Geo = df.Geo.str.replace(pattern, r"\1 \2")
In [180]: df.Geo[0]
Out[180]: '39.11890951 -84.48903638'
If you want to extract coordinates as seperate columns, you will need another split and convert to number. Or you can extract coordinates separately:
df['long'] = np.float(df.Geo.str.replace(pattern, r"\1"))
df['lat'] = np.float(df.Geo.str.replace(pattern, r"\2")) My re knowlodge is rudimentary, so its possible that the pattern I have used isnt the right one ....
EDIT: yes, if your column consists of dict/list, its much better to do it with direct access
Posts: 70
Threads: 17
Joined: Feb 2017
zivoni, thank you for reply. Somehow, everything becomes "bad" once I do df.Geo.apply(str) -- the values become NaN. I'll experiment more with you code though and try to make it work.
Posts: 70
Threads: 17
Joined: Feb 2017
Dear zivoni, again I am seeking your help on the following issue that I face trying to run the code you previously suggested. So, I follow the steps outlined in you previous post, but unexpectedly (i.e., it did not occur before) face an error:
df_filter.loc[17765091,'geo']
"{u'type': u'Point', u'coordinates': [36.26511117, -115.10532525]}"
#Note, the above outcome contains additional components compared to that in your example
pattern = r".*\[(-*\d+\.\d+), (-*\d+\.\d+)\].*"
df_filter.geo = df_filter.geo.str.replace(pattern, r"\1 \2")
df_filter.geo[17765091]
'36.26511117 -115.10532525' # same as yours
# Then extract coordinates to columns
df_filter['lon'] = np.float(df_filter.geo.str.replace(pattern, r"\1"))
df_filter['lat'] = np.float(df_filter.geo.str.replace(pattern, r"\2")) Error: ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-75-7ac8460a87be> in <module>()
----> 1 df_filter['lon'] = np.float(df_filter.geo.str.replace(pattern, r"\1"))
2 df_filter['lat'] = np.float(df_filter.geo.str.replace(pattern, r"\2"))
/Users/mymac/anaconda/lib/python3.6/site-packages/pandas/core/series.py in wrapper(self)
91 return converter(self.iloc[0])
92 raise TypeError("cannot convert the series to "
---> 93 "{0}".format(str(converter)))
94
95 return wrapper
TypeError: cannot convert the series to <class 'float'>
I tried multiple ways to convert the values in geo column to "float" using different approaches, but they all failed. Please advise if there is some kind other error.
Posts: 331
Threads: 2
Joined: Feb 2017
(Mar-07-2017, 06:14 PM)kiton Wrote: Admittedly, I messed up with understanding the data type of the returned object. The following addresses the issue properly:
df2.loc[1921,'Geo']
{'coordinates': [39.11890951, -84.48903638], 'type': 'Point'}
df2.loc[1921,'Geo']['coordinates'][0]
39.11890951
# creating new column with desired value
df2["geo_coord0"] = [x['coordinates'][0] for x in df2['Geo']]
df2["geo_coord1"] = [x['coordinates'][1] for x in df2['Geo']]
(Apr-13-2017, 06:01 PM)kiton Wrote: Error: ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-75-7ac8460a87be> in <module>()
----> 1 df_filter['lon'] = np.float(df_filter.geo.str.replace(pattern, r"\1"))
2 df_filter['lat'] = np.float(df_filter.geo.str.replace(pattern, r"\2"))
/Users/mymac/anaconda/lib/python3.6/site-packages/pandas/core/series.py in wrapper(self)
91 return converter(self.iloc[0])
92 raise TypeError("cannot convert the series to "
---> 93 "{0}".format(str(converter)))
94
95 return wrapper
TypeError: cannot convert the series to <class 'float'>
I tried multiple ways to convert the values in geo column to "float" using different approaches, but they all failed. Please advise if there is some kind other error.
As was stated previously, str methods work only with string column and your column seems to have object (dictionary) dtype. Instead of using str you should use native access for dictionary/list. Try something like:
df['lat'] = df.geo.apply(lambda x:x['coordinates'][0])
df['lon'] = df.geo.apply(lambda x:x['coordinates'][1])
Posts: 70
Threads: 17
Joined: Feb 2017
Yep, I understand that issue with string. It used to work correctly. So, I tried the code as you suggested:
df_filter['lat'] = df_filter.geo.apply(lambda x:x['coordinates'][0])
df_filter['lon'] = df_filter.geo.apply(lambda x:x['coordinates'][1]) But it still gives an error:
Error: ---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-81-2ed1bdd10a03> in <module>()
----> 1 df_filter['lat'] = df_filter.geo.apply(lambda x:x['coordinates'][0])
2 df_filter['lon'] = df_filter.geo.apply(lambda x:x['coordinates'][1])
/Users/mymac/anaconda/lib/python3.6/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
2292 else:
2293 values = self.asobject
-> 2294 mapped = lib.map_infer(values, f, convert=convert_dtype)
2295
2296 if len(mapped) and isinstance(mapped[0], Series):
pandas/src/inference.pyx in pandas.lib.map_infer (pandas/lib.c:66124)()
<ipython-input-81-2ed1bdd10a03> in <lambda>(x)
----> 1 df_filter['lat'] = df_filter.geo.apply(lambda x:x['coordinates'][0])
2 df_filter['lon'] = df_filter.geo.apply(lambda x:x['coordinates'][1])
TypeError: string indices must be integers
|