Long story short, I managed to deal with the problem using Stata [yeah, sorry I am more of a statistics rather than programmer guy :)] -- I stripped off the leading and the trailing stuff and extracted lon/lat as integers to new columns.
While the task is solved for now, I will have more similar tasks in the future. So, I am still curious what is wrong with the extracted -geo- coordinates. Initially (during test runs), I used the following code, which had no such issues:
# DF Parser
import os
import json
import pandas as pd
import numpy as np
from collections import defaultdict
elements_keys = ['created_at', 'text', 'lang', 'geo']
elements = defaultdict(list)
for dirs, subdirs, files in os.walk('/DIR'):
for file in files:
if file.endswith('.json'):
with open(file, 'r') as input_file:
for line in input_file:
try:
tweet = json.loads(line)
items = [(key, tweet[key]) for key in elements_keys] # should raise error if any key is missing
for key, value in items:
elements[key].append(value)
except:
continue
df=pd.DataFrame({'created_at': pd.Index(elements['created_at']),
'text': pd.Index(elements['text']),
'lang': pd.Index(elements['lang']),
'geo': pd.Index(elements['geo'])})
df #then clean it a little bit, then save to CSV
But then for the "big" task, I adapted the following code suggested by zivoni (code's performance was substantially better).
# CSV Parser
import csv, json, os
elements_keys = ['created_at', 'text', 'lang', 'geo']
with open('file.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(elements_keys) # header
for dirs, subdirs, files in os.walk('/DIR'):
for file in files:
if file.endswith('.json'):
with open(file, 'r') as input_file:
for line in input_file:
try:
tweet = json.loads(line)
row = [tweet[key] for key in elements_keys]
writer.writerow(row) # writing tweet into file
except:
continue
Although the code worked perfectly fine, the extracted -geo- coordinates gave me the problem that we were recently discussing. Does anybody see anything specific in the above codes that could possibly lead to that problem (i.e., any possible discrepancy in the extracted coordinates)?