Sometimes you might want to transform or unify a bunch of addresses. For example, you might have a list of addresses in different formats and might want to match them to official administrative geographic boarders. To do this, you can take advantage of GeoPy. GeoPy is a Python Client for geocoding web services and works via API querys. GeoPy allows you to connect to a large number of geocoding webs. For a full overview see the documentation. One example is the Nominatim web service, which we will make use of in this tutorial. We will closely follow this tutorial.

!pip install geopy
import pandas as pd
from geopy.geocoders import Nominatim

Import data

data = pd.read_pickle(r"C:\Users\Rude\Documents\Jupyter Notebooks\finaltweets_2014_P1_M7.pkl")
data.shape
(215956, 49)
data.columns
Index(['tweet_created_at', 'lang', 'attachments', 'possibly_sensitive',
       'source', 'text', 'tweet_id', 'conversation_id', 'author_id',
       'reply_settings', '__twarc', 'referenced_tweets', 'in_reply_to_user_id',
       'in_reply_to_user', 'withheld', 'coordinates', 'place_id', 'geo',
       'name', 'geo_id', 'full_name', 'place_type', 'country', 'country_code',
       'description', 'verified', 'protected', 'created_at', 'username',
       'pinned_tweet_id', 'url', 'name', 'entities', 'location',
       'profile_image_url', 'withheld', 'followers_count', 'following_count',
       'tweet_count', 'listed_count', 'retweet_count', 'reply_count',
       'like_count', 'quote_count', 'hashtags', 'urls', 'mentions',
       'annotations', 'cashtags'],
      dtype='object')
data[['tweet_id', 'username', 'location']].head()
tweet_id username location
0 489868328748720128 GbxgbxAde Zion
1 489868314748133376 MichalaRudman South Wales
2 489868314068262912 dougmcbot North Texas
3 489868301581836288 DebiJackson50 Cincinnati, OH
4 489868289221595136 spasskultur NaN

Prepare our query

data['query'] = data["location"]
data = data[['tweet_id', 'username', 'location', 'query']]
data.head()
tweet_id username location query
0 489868328748720128 GbxgbxAde Zion Zion
1 489868314748133376 MichalaRudman South Wales South Wales
2 489868314068262912 dougmcbot North Texas North Texas
3 489868301581836288 DebiJackson50 Cincinnati, OH Cincinnati, OH
4 489868289221595136 spasskultur NaN NaN

Geopy will give us the specific address, latitude and longitude.

data["location_lat"]=""
data["location_long"]=""
data["location_address"]=""
data.head()
tweet_id username location query location_lat location_long location_address
0 489868328748720128 GbxgbxAde Zion GbxgbxAde Zion
1 489868314748133376 MichalaRudman South Wales MichalaRudman South Wales
2 489868314068262912 dougmcbot North Texas dougmcbot North Texas
3 489868301581836288 DebiJackson50 Cincinnati, OH DebiJackson50 Cincinnati, OH
4 489868289221595136 spasskultur NaN NaN

Use GeoPy to fetch the geocode data

df = data.iloc[:100]
df.shape
(100, 4)
geolocator = Nominatim(user_agent="myApp")

for i in df.index:
    try:
        #tries fetch address from geopy
        location = geolocator.geocode(df['query'][i])
        
        #append lat/long to column using dataframe location
        df.loc[i,'location_lat'] = location.latitude
        df.loc[i,'location_long'] = location.longitude
        df.loc[i,'location_address'] = location.address
    except:
        #catches exception for the case where no value is returned
        #appends null value to column
        df.loc[i,'location_lat'] = ""
        df.loc[i,'location_long'] = ""
        df.loc[i,'location_address'] = ""

#print first rows as sample
df.head()
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_rows", None)
df.head()
tweet_id username location query location_lat location_long location_address
0 489868328748720128 GbxgbxAde Zion Zion
1 489868314748133376 MichalaRudman South Wales South Wales 42.708949 -78.57808 South Wales, Town of Wales, Erie County, New York, 14139, United States
2 489868314068262912 dougmcbot North Texas North Texas 36.197937 -76.009923 Texas, Camden County, North Carolina, 2, United States
3 489868301581836288 DebiJackson50 Cincinnati, OH Cincinnati, OH 39.101454 -84.51246 Cincinnati, Hamilton County, Ohio, United States
4 489868289221595136 spasskultur NaN NaN 46.314475 11.048029 Nanno, Ville d'Anaunia, Comunità della Val di Non, Provincia di Trento, Trentino-Alto Adige/Südtirol, 38012, Italia

Now, go ahead and extract the necessary geo information from your dataset! :-)