How to use GeoPy
Sometimes you might want to transform or unify a bunch of addresses. For example, you might have a list of addresses in different formats and might want to match them to official administrative geographic boarders. To do this, you can take advantage of GeoPy. GeoPy is a Python Client for geocoding web services and works via API querys. GeoPy allows you to connect to a large number of geocoding webs. For a full overview see the documentation. One example is the Nominatim web service, which we will make use of in this tutorial. We will closely follow this tutorial.
!pip install geopy
import pandas as pd
from geopy.geocoders import Nominatim
Import data
data = pd.read_pickle(r"C:\Users\Rude\Documents\Jupyter Notebooks\finaltweets_2014_P1_M7.pkl")
data.shape
(215956, 49)
data.columns
Index(['tweet_created_at', 'lang', 'attachments', 'possibly_sensitive',
'source', 'text', 'tweet_id', 'conversation_id', 'author_id',
'reply_settings', '__twarc', 'referenced_tweets', 'in_reply_to_user_id',
'in_reply_to_user', 'withheld', 'coordinates', 'place_id', 'geo',
'name', 'geo_id', 'full_name', 'place_type', 'country', 'country_code',
'description', 'verified', 'protected', 'created_at', 'username',
'pinned_tweet_id', 'url', 'name', 'entities', 'location',
'profile_image_url', 'withheld', 'followers_count', 'following_count',
'tweet_count', 'listed_count', 'retweet_count', 'reply_count',
'like_count', 'quote_count', 'hashtags', 'urls', 'mentions',
'annotations', 'cashtags'],
dtype='object')
data[['tweet_id', 'username', 'location']].head()
tweet_id | username | location | |
---|---|---|---|
0 | 489868328748720128 | GbxgbxAde | Zion |
1 | 489868314748133376 | MichalaRudman | South Wales |
2 | 489868314068262912 | dougmcbot | North Texas |
3 | 489868301581836288 | DebiJackson50 | Cincinnati, OH |
4 | 489868289221595136 | spasskultur | NaN |
Prepare our query
data['query'] = data["location"]
data = data[['tweet_id', 'username', 'location', 'query']]
data.head()
tweet_id | username | location | query | |
---|---|---|---|---|
0 | 489868328748720128 | GbxgbxAde | Zion | Zion |
1 | 489868314748133376 | MichalaRudman | South Wales | South Wales |
2 | 489868314068262912 | dougmcbot | North Texas | North Texas |
3 | 489868301581836288 | DebiJackson50 | Cincinnati, OH | Cincinnati, OH |
4 | 489868289221595136 | spasskultur | NaN | NaN |
Geopy will give us the specific address, latitude and longitude.
data["location_lat"]=""
data["location_long"]=""
data["location_address"]=""
data.head()
tweet_id | username | location | query | location_lat | location_long | location_address | |
---|---|---|---|---|---|---|---|
0 | 489868328748720128 | GbxgbxAde | Zion | GbxgbxAde Zion | |||
1 | 489868314748133376 | MichalaRudman | South Wales | MichalaRudman South Wales | |||
2 | 489868314068262912 | dougmcbot | North Texas | dougmcbot North Texas | |||
3 | 489868301581836288 | DebiJackson50 | Cincinnati, OH | DebiJackson50 Cincinnati, OH | |||
4 | 489868289221595136 | spasskultur | NaN | NaN |
Use GeoPy to fetch the geocode data
df = data.iloc[:100]
df.shape
(100, 4)
geolocator = Nominatim(user_agent="myApp")
for i in df.index:
try:
#tries fetch address from geopy
location = geolocator.geocode(df['query'][i])
#append lat/long to column using dataframe location
df.loc[i,'location_lat'] = location.latitude
df.loc[i,'location_long'] = location.longitude
df.loc[i,'location_address'] = location.address
except:
#catches exception for the case where no value is returned
#appends null value to column
df.loc[i,'location_lat'] = ""
df.loc[i,'location_long'] = ""
df.loc[i,'location_address'] = ""
#print first rows as sample
df.head()
pd.set_option("display.max_colwidth", None)
pd.set_option("display.max_rows", None)
df.head()
tweet_id | username | location | query | location_lat | location_long | location_address | |
---|---|---|---|---|---|---|---|
0 | 489868328748720128 | GbxgbxAde | Zion | Zion | |||
1 | 489868314748133376 | MichalaRudman | South Wales | South Wales | 42.708949 | -78.57808 | South Wales, Town of Wales, Erie County, New York, 14139, United States |
2 | 489868314068262912 | dougmcbot | North Texas | North Texas | 36.197937 | -76.009923 | Texas, Camden County, North Carolina, 2, United States |
3 | 489868301581836288 | DebiJackson50 | Cincinnati, OH | Cincinnati, OH | 39.101454 | -84.51246 | Cincinnati, Hamilton County, Ohio, United States |
4 | 489868289221595136 | spasskultur | NaN | NaN | 46.314475 | 11.048029 | Nanno, Ville d'Anaunia, Comunità della Val di Non, Provincia di Trento, Trentino-Alto Adige/Südtirol, 38012, Italia |
Now, go ahead and extract the necessary geo information from your dataset! :-)