import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings(action='ignore')
From Data, there are 4505930 trips between Nov. 1 2019 to Nov. 14 2019 (two weeks)
trip = pd.read_csv('data/trip.csv')
print('shape of dataset:', trip.shape)
trip.head(3)
trip.info()
Column Name | Description | Type |
---|---|---|
Trip Start Timestamp | When the trip started, rounded to the nearest 15 minutes. | Date & Time |
Trip End Timestamp | When the trip ended, rounded to the nearest 15 minutes. | Date & Time |
Trip Seconds | Time of the trip in seconds. | Number |
Trip Miles | Distance of the trip in miles. | Number |
Pickup Census Tract | The Census Tract where the trip began. This column often will be blank for locations outside Chicago. |
Plain Text |
Dropoff Census Tract | The Census Tract where the trip ended. This column often will be blank for locations outside Chicago. |
Plain Text |
Pickup Community Area | The Community Area where the trip began. This column will be blank for locations outside Chicago. |
Number |
Dropoff Community Area | The Community Area where the trip ended. This column will be blank for locations outside Chicago. |
Number |
Fare | The fare for the trip, rounded to the nearest \$2.50. | Number |
Tip | The tip for the trip, rounded to the nearest \$1.00. Cash tips will not be recorded. | Number |
Additional Charges | The taxes, fees, and any other charges for the trip. | Number |
Trip Total | Total cost of the trip. This is calculated as the total of the previous columns, including rounding. |
Number |
Shared Trip Authorized | Whether the customer agreed to a shared trip with another customer, regardless of whether the customer was actually matched for a shared trip. |
Checkbox |
Trips Pooled | If customers were matched for a shared trip, how many trips, including this one, were pooled. All customer trips from the time the vehicle was empty until it was empty again contribute to this count, even if some customers were never present in the vehicle at the same time. Each trip making up the overall shared trip will have a separate record in this dataset, with the same value in this column. |
Number |
Pickup Centroid Latitude | The latitude of the center of the pickup census tract or the community area if the census tract has been hidden for privacy. This column often will be blank for locations outside Chicago. |
Number |
Pickup Centroid Longitude | The longitude of the center of the pickup census tract or the community area if the census tract has been hidden for privacy. This column often will be blank for locations outside Chicago. |
Number |
Dropoff Centroid Latitude | The latitude of the center of the dropoff census tract or the community area if the census tract has been hidden for privacy. This column often will be blank for locations outside Chicago. |
Number |
Dropoff Centroid Longitude | The longitude of the center of the dropoff census tract or the community area if the census tract has been hidden for privacy. This column often will be blank for locations outside Chicago. |
Number |
summary statistics
trip.describe()
I modify columns name:
trip.columns = trip.columns.str.replace(' ', '_')
trip.columns = trip.columns.str.lower()
trip['pickup_census_tract'] = trip['pickup_census_tract'].apply(lambda x: "{:.0f}".format(x) if not pd.isnull(x) else x)
trip['dropoff_census_tract'] = trip['dropoff_census_tract'].apply(lambda x: "{:.0f}".format(x) if not pd.isnull(x) else x)
The data include instance of missing values and incorrect observations. I used the following steps to remove incorrect observations:
# remove trips less than 60 secs or 5 hours
trip['trip_seconds'] = trip['trip_seconds'][(trip['trip_seconds'] > 60) & (trip['trip_seconds'] < 5*3600)]
# remove trips with extreme or zero miles
trip['trip_miles'] = trip['trip_miles'][(trip['trip_miles'] < 300) & (trip['trip_miles'] > 0.25)]
# trip total greater than zero
trip['trip_total'] = trip['trip_total'][(trip['trip_total'] > 0)]
# remove trips with extreme speeds
trip = trip[(trip['trip_miles']/(trip['trip_seconds']/3600) < 80) & (trip['trip_miles']/(trip['trip_seconds']/3600) < 80)]
trip.shape
trip.isna().sum()
Pickup_Census_Tract and Dropoff_Census_Tract are two variables with the most missing values. The City of Chicago has 77 community areas and within each community area there are multiple census tracts. I grouped the data frame by community area, and replaced census tracts within each community area with the median census tract of that group.
trip.groupby('pickup_community_area').get_group(1).head(3)
trip['pickup_census_tract'] = trip.groupby('pickup_community_area')['pickup_census_tract'].apply(lambda x:x.fillna(x.median()))
trip['dropoff_census_tract'] = trip.groupby('dropoff_community_area')['dropoff_census_tract'].apply(lambda x:x.fillna(x.median()))
trip.isna().sum()
trip.shape
My objective is to analysis trips within chicago boundary. So, I remove areas out side of chicago. So, if there are any missing values in outside areas, they will be removed.
For this purpous, I used Boundaries - Census Tracts - 2010 dataset.
geoid = pd.read_csv('data/CensusTractsTIGER2010.csv')
geoid.head(2)
geoid['GEOID10'] = geoid['GEOID10'].astype('str')
list_of_chicago_geoid = geoid['GEOID10'].tolist()
trip = trip[trip['pickup_census_tract'].isin(list_of_chicago_geoid)]
trip = trip[trip['dropoff_census_tract'].isin(list_of_chicago_geoid)]
print(trip.shape)
trip.isna().sum()
trip.dropna(inplace=True)
trip.isna().sum()
trip.shape
print(round((4505930-3051324)/4505930 ,2)*100, '% of data were removed')
My first question is what portion of total trips were requested to be pooled.
I found that 11% of all trips had been requested to be pooled.
print(trip.shared_trip_authorized.value_counts())
percent_of_shared = trip.shared_trip_authorized.value_counts()[1]/len(trip)
print (f'{round(percent_of_shared,2)}% of total trips are requested to be pooled')
trip.shared_trip_authorized.value_counts().plot(kind='bar')
plt.title('What portion of trips are requested to be pooled?')
plt.xticks(rotation=0)
plt.ylabel('Number of trips')
plt.savefig('images/target_distribution.png', dpi=300)
plt.show()
Since the target varible is imbalanced, we should consider precison and recall rather than accuracy.
trip.groupby('shared_trip_authorized').mean()
We can see that trips that are authorized to be shared are longer and less expensive.
Also, we can see that of authorized shared trips 68% percent of them are pooled.
import seaborn as sns
to_drop=['trip_start_timestamp', 'trip_end_timestamp', 'pickup_census_tract',
'pickup_community_area','dropoff_community_area',
'dropoff_census_tract','pickup_centroid_latitude', 'pickup_centroid_longitude',
'dropoff_centroid_latitude', 'dropoff_centroid_longitude']
fig, ax = plt.subplots(figsize=(8,8))
matrix = np.triu(trip.drop(to_drop, axis=1).corr())
sns.heatmap(trip.drop(to_drop, axis=1).corr(), annot=False, mask=matrix, fmt='.1g', square=True, linewidths=.5, cbar_kws={"shrink": .5}, \
center=0, cmap='coolwarm',ax=ax)
plt.tight_layout()
plt.savefig('images/heatmap.png', dpi=300)
I found that “Additional Charges” has conspicuous correlation with “shared_trip_authorized”. So, this variable might be one of the important features in model prediction. This is expected, since economical trip is most likely to be one of the incentives for a person to request a shared trip.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report
s = trip.sample(n=5000, random_state=4)
X=s[['trip_seconds', 'trip_miles', 'fare', 'tip', 'additional_charges','trip_total']]
y=s['shared_trip_authorized']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, stratify = y )
rf = RandomForestClassifier(max_depth=7).fit(X_train, y_train)
rf_pred=rf.predict(X_test)
rf_probs = rf.predict_proba(X_test)
print('log_loss:', log_loss(y_test, rf_probs))
print('--'*12)
print('train score accuracy', rf.score(X_train, y_train))
print('test score accuracy', rf.score(X_test, y_test))
print('--'*12)
print('train score precision', precision_score(y_test, rf_pred))
print('test score precision', precision_score(y_test, rf_pred))
print('--'*12)
print('train score recall', recall_score(y_test, rf_pred))
print('test score recall', recall_score(y_test, rf_pred))
print('--'*12)
print('\n')
print(classification_report(y_test, rf_pred, labels=[0, 1]))
feat_importances = pd.Series(rf.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
#trip.to_csv('data/data_1to14_eda.csv', index=False)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings(action='ignore')
trip= pd.read_csv('data/data_1to14_eda.csv')
Creating a new variable to whether the trip was successfully pooled or not:
Traveler could authorize a shared trip but no matching could be made through the trip (variable Shared.Trip.Authorized could be true or false).
trip['trips_pooled'] = np.where(trip['trips_pooled']>1, 1, 0)
Change trip_start_timestamp and trip_end_timestamp to datetime object:
trip['trip_start_timestamp'] = pd.to_datetime(trip['trip_start_timestamp'])
trip['trip_end_timestamp'] = pd.to_datetime(trip['trip_end_timestamp'])
Creating a column showing day of week of each trip.
The week starts on Monday:
trip['dayofweek'] = trip['trip_start_timestamp'].dt.dayofweek
Creating a column showing if the trip is done on weekdays or weekends?
trip['is_weekend']=np.where(trip['dayofweek'].isin([5,6]), 1, 0)
Creating a column showing the session of the trip:
b = [0,5,9,16,20,24]
l = ['midnight', 'm_peak','non_peak','e_peak', 'night']
trip['day_session'] = pd.cut(trip['trip_start_timestamp'].dt.hour, bins=b, labels=l, include_lowest=True)
trip['hour'] = trip['trip_start_timestamp'].dt.hour
trip['month'] = trip['trip_start_timestamp'].dt.month
trip.info()
ACS_data = pd.read_csv('data/chicago_sociodem_tract.csv',
sep=',')
ACS_data.info()
data = pd.merge(trip, ACS_data.add_suffix('_pickup'), how='left' , left_on='pickup_census_tract', right_on='GEOID_pickup', suffixes=(None, '_pickup')).drop(['GEOID_pickup'],axis=1 )
data = pd.merge(data, ACS_data.add_suffix('_dropoff'), how='left' , left_on='dropoff_census_tract', right_on='GEOID_dropoff', suffixes=(None, '_dropoff')).drop(['GEOID_dropoff'],axis=1 )
crime = pd.read_csv('data/Control_Variables/Crime_Rate/Chicago_crime_GEOID_mapped.csv')
data = pd.merge(data, crime.add_suffix('_pickup'), left_on='pickup_census_tract', \
right_on='GEOID_pickup', suffixes=(None, '_pickup')).drop(['GEOID_pickup', 'n_pickup'],axis=1 )
data = pd.merge(data, crime.add_suffix('_dropoff'), left_on='dropoff_census_tract', \
right_on='GEOID_dropoff', suffixes=(None, '_dropoff')).drop(['GEOID_dropoff', 'n_dropoff'],axis=1 )
data.info()
job = pd.read_csv('data/Control_Variables/Employment_Density/Chicago_Job_Density.csv')
data = pd.merge(data, job.add_suffix('_pickup'), left_on='pickup_census_tract', \
right_on='GEOID_pickup', suffixes=(None, '_pickup')).drop(['GEOID_pickup', 'count_jobs_pickup', 'area_pickup'],axis=1 )
data = pd.merge(data, job.add_suffix('_dropoff'), left_on='dropoff_census_tract', \
right_on='GEOID_dropoff', suffixes=(None, '_dropoff')).drop(['GEOID_dropoff', 'count_jobs_dropoff', 'area_dropoff'],axis=1 )
hp = pd.read_csv('data/Control_Variables/Housing_and_Population/Chicago_house_pop_GEOID_mapped.csv')
data = pd.merge(data, hp.add_suffix('_pickup'), left_on='pickup_census_tract', \
right_on='GEOID_pickup', suffixes=(None, '_pickup')).drop(['GEOID_pickup', 'house_pickup', 'pop_pickup'],axis=1 )
data = pd.merge(data, hp.add_suffix('_dropoff'), left_on='dropoff_census_tract', \
right_on='GEOID_dropoff', suffixes=(None, '_dropoff')).drop(['GEOID_dropoff', 'house_dropoff', 'pop_dropoff'],axis=1 )
poi = pd.read_csv('data/Control_Variables/POI/POI_GEOID_mapped.csv')
data = pd.merge(data, poi.add_suffix('_pickup'), left_on='pickup_census_tract', \
right_on='GEOID_pickup', suffixes=(None, '_pickup')).drop(['GEOID_pickup'],axis=1 )
data = pd.merge(data, poi.add_suffix('_dropoff'), left_on='dropoff_census_tract', \
right_on='GEOID_dropoff', suffixes=(None, '_dropoff')).drop(['GEOID_dropoff'],axis=1 )
transit = pd.read_csv('data/Control_Variables/Transit_Supply/Chicago_transit_supply_GEOID_mapped.csv')
data = pd.merge(data, transit.add_suffix('_pickup'), left_on='pickup_census_tract', \
right_on='GEOID_pickup', suffixes=(None, '_pickup')).drop(['GEOID_pickup', 'rail_stops_count_pickup', 'bus_stops_count_pickup'],axis=1 )
data = pd.merge(data, transit.add_suffix('_dropoff'), left_on='dropoff_census_tract', \
right_on='GEOID_dropoff', suffixes=(None, '_dropoff')).drop(['GEOID_dropoff', 'rail_stops_count_dropoff', 'bus_stops_count_dropoff'],axis=1 )
#pd.read_csv('data/Control_Variables/Weather_Data/weatherdata.csv')
data.info()
airport_geoid = ['17031980100', '17031980000']
data['is_airport']=np.where(data['pickup_census_tract'].isin(airport_geoid) | data['dropoff_census_tract'].isin(airport_geoid) , 1, 0)
data.info()
to_drop = ['trip_start_timestamp', 'trip_end_timestamp']
# to_drop = ['trip_start_timestamp', 'trip_end_timestamp',
# 'pickup_community_area', 'dropoff_community_area',
# 'pickup_centroid_latitude', 'pickup_centroid_longitude',
# 'dropoff_centroid_latitude', 'dropoff_centroid_longitude',
# 'portion_twoplus_pickup','portion_twoplus_dropoff',
# 'portion_hispanic_pickup','portion_hispanic_dropoff',
# 'portion_renter_occ_pickup','portion_renter_occ_dropoff']
data.drop(to_drop, axis=1, inplace=True)
data.shared_trip_authorized = np.where(data.shared_trip_authorized == False, 0, 1)
data.info()
group = ['pickup_census_tract', 'dropoff_census_tract', 'hour', 'is_weekend', 'is_airport']
s = data.groupby(group)
s = s.agg({'trip_seconds':'mean',
'trip_miles': 'mean',
'trip_total':'mean',
'portion_twoplus_pickup': 'median','portion_twoplus_dropoff':'median',
'portion_hispanic_pickup':'median','portion_hispanic_dropoff':'median',
'portion_renter_occ_pickup':'median','portion_renter_occ_dropoff':'median',
'crime_density_pickup':'median','house_density_pickup':'median',
'job_density_pickup': 'median', 'job_density_dropoff': 'median',
'house_density_pickup':'median', 'house_density_dropoff': 'median',
'pop_density_pickup': 'median', 'pop_density_dropoff': 'median',
'POI_density_pickup':'median', 'POI_density_dropoff': 'median',
'bus_stops_density_pickup': 'median', 'bus_stops_density_dropoff': 'median',
'rail_stops_density_pickup': 'median', 'rail_stops_density_dropoff': 'median',
'trips_pooled': 'count',
'shared_trip_authorized': lambda x: x.sum()/len(x)
}).reset_index()
s['midnight'] = s['hour'].apply(lambda x: 1 if (x>=0 and x<5) else 0)
s['m_peak'] = s['hour'].apply(lambda x: 1 if (x>=5 and x<9) else 0)
s['non_peak'] = s['hour'].apply(lambda x: 1 if (x>=9 and x<16) else 0)
s['e_peak'] = s['hour'].apply(lambda x: 1 if (x>=16 and x<20) else 0)
s['night'] = s['hour'].apply(lambda x: 1 if (x>=20 and x<24) else 0)
s = s.rename(columns={'trips_pooled': 'count'})
s.shape
s=s[s['count'] > 100]
s.shape
s = s.fillna(0)
s.isna().sum()
s.info()
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
s_scaled = pd.DataFrame(scaler.fit_transform(s), columns=s.columns)
s_scaled.head(2)
cols = [0,1,30]
X=s.drop(s.columns[cols], axis=1)
y=s['shared_trip_authorized']
import statsmodels.api as sm
X = sm.add_constant(X)
est = sm.OLS(y, X).fit()
est.summary()
import seaborn as sns
to_drop=[0,1]
fig, ax = plt.subplots(figsize=(20,20))
matrix = np.triu(s.drop(s.columns[to_drop], axis=1).corr())
sns.heatmap(s.drop(s.columns[to_drop], axis=1).corr(), annot=True, fmt='.1g', square=True, linewidths=.5, cbar_kws={"shrink": .5},ax=ax)
plt.tight_layout()
plt.savefig('images/heatmap.png', dpi=300)
data.info()
pd.crosstab(data.is_airport, data.shared_trip_authorized, normalize=False).plot(kind="bar")
plt.title('WTP (trip shared authorized ~ additional charges)')
plt.xticks(rotation=0)
plt.ylabel('Frequency')
#plt.savefig('1.png')
plt.show()
from scipy.stats import chi2_contingency
chi2, p, dof, ex = chi2_contingency( pd.crosstab(data.is_airport, data.shared_trip_authorized, margins=True))
print("chi2 = ", chi2)
print("p-val = ", p)
print("degree of freedom = ",dof)
From the bar chart we can see chance of WTP is more in not-airport area. from chi2 analysis, p-value is less than 0.05. Therefor, we have enough evidence to reject the null hypothesis, which was there was no effect of airport area on authorizing a trip to be pooled.
pd.crosstab(data.is_airport, data.shared_trip_authorized, normalize=False).plot(kind="bar")
plt.title('WTP (trip shared authorized ~ is_airport)')
plt.xticks(rotation=0)
plt.ylabel('Frequency')
#plt.savefig('1.png')
plt.show()
from scipy.stats import chi2_contingency
chi2, p, dof, ex = chi2_contingency( pd.crosstab(data.is_airport, data.shared_trip_authorized, margins=True))
print("chi2 = ", chi2)
print("p-val = ", p)
print("degree of freedom = ",dof)
From the bar chart we can see chance of WTP is more in not-airport area. from chi2 analysis, p-value is less than 0.05. Therefor, we have enough evidence to reject the null hypothesis, which was there was no effect of weekend on authorizing a trip to be pooled.
import statsmodels.api as sm
import statsmodels.formula.api as smf
cols = ['trip_seconds', 'trip_miles', 'fare', 'tip', 'additional_charges',
'trip_total', 'hour', 'month', 'portion_low_income_pickup',
'portion_high_income_pickup', 'portion_bachelor_up_pickup',
'portion_female_pickup', 'portion_no_car_pickup',
'portion_white_pickup', 'portion_black_pickup',
'portion_twoplus_pickup', 'portion_hispanic_pickup',
'portion_renter_occ_pickup', 'portion_young18_44_pickup',
'portion_low_income_dropoff', 'portion_high_income_dropoff',
'portion_bachelor_up_dropoff', 'portion_female_dropoff',
'portion_no_car_dropoff', 'portion_white_dropoff',
'portion_black_dropoff', 'portion_twoplus_dropoff',
'portion_hispanic_dropoff', 'portion_renter_occ_dropoff',
'portion_young18_44_dropoff']
dict_pval={}
for item in cols:
model1 = smf.ols(f'{item} ~ C(shared_trip_authorized)', data=data).fit()
dict_pval[item] = model1.f_pvalue
dict_pval
pd.DataFrame.from_dict(dict_pval,'index').stack().reset_index(level=0)
#data.to_csv('trip_Nov_1to14_clean.csv', index=False)