In [45]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings(action='ignore')

Load Data and Take a Look

From Data, there are 4505930 trips between Nov. 1 2019 to Nov. 14 2019 (two weeks)

In [46]:
trip = pd.read_csv('data/trip.csv')
In [47]:
print('shape of dataset:', trip.shape)
trip.head(3)
shape of dataset: (4505930, 18)
Out[47]:
Trip Start Timestamp Trip End Timestamp Trip Seconds Trip Miles Pickup Census Tract Dropoff Census Tract Pickup Community Area Dropoff Community Area Fare Tip Additional Charges Trip Total Shared Trip Authorized Trips Pooled Pickup Centroid Latitude Pickup Centroid Longitude Dropoff Centroid Latitude Dropoff Centroid Longitude
0 11/01/2019 12:00:00 AM 11/01/2019 12:00:00 AM 165.0 0.6 1.703132e+10 1.703132e+10 32.0 32.0 2.5 0.0 2.55 5.05 False 1 41.870607 -87.622173 41.877406 -87.621972
1 11/01/2019 12:00:00 AM 11/01/2019 12:15:00 AM 823.0 5.4 1.703103e+10 1.703108e+10 3.0 8.0 10.0 0.0 2.55 12.55 False 1 41.958155 -87.653022 41.893216 -87.637844
2 11/01/2019 12:00:00 AM 11/01/2019 12:15:00 AM 683.0 5.6 1.703110e+10 1.703116e+10 10.0 16.0 10.0 3.0 2.55 15.55 False 1 41.972132 -87.797635 41.950212 -87.712814
In [48]:
trip.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4505930 entries, 0 to 4505929
Data columns (total 18 columns):
 #   Column                      Dtype  
---  ------                      -----  
 0   Trip Start Timestamp        object 
 1   Trip End Timestamp          object 
 2   Trip Seconds                float64
 3   Trip Miles                  float64
 4   Pickup Census Tract         float64
 5   Dropoff Census Tract        float64
 6   Pickup Community Area       float64
 7   Dropoff Community Area      float64
 8   Fare                        float64
 9   Tip                         float64
 10  Additional Charges          float64
 11  Trip Total                  float64
 12  Shared Trip Authorized      bool   
 13  Trips Pooled                int64  
 14  Pickup Centroid Latitude    float64
 15  Pickup Centroid Longitude   float64
 16  Dropoff Centroid Latitude   float64
 17  Dropoff Centroid Longitude  float64
dtypes: bool(1), float64(14), int64(1), object(2)
memory usage: 588.7+ MB

Columns in this Dataset and their description:

Column Name Description Type
Trip Start Timestamp When the trip started, rounded to the nearest 15 minutes. Date & Time
Trip End Timestamp When the trip ended, rounded to the nearest 15 minutes. Date & Time
Trip Seconds Time of the trip in seconds. Number
Trip Miles Distance of the trip in miles. Number
Pickup Census Tract The Census Tract where the trip began.
This column often will be blank for locations outside Chicago.
Plain Text
Dropoff Census Tract The Census Tract where the trip ended.
This column often will be blank for locations outside Chicago.
Plain Text
Pickup Community Area The Community Area where the trip began.
This column will be blank for locations outside Chicago.
Number
Dropoff Community Area The Community Area where the trip ended.
This column will be blank for locations outside Chicago.
Number
Fare The fare for the trip, rounded to the nearest \$2.50. Number
Tip The tip for the trip, rounded to the nearest \$1.00. Cash tips will not be recorded. Number
Additional Charges The taxes, fees, and any other charges for the trip. Number
Trip Total Total cost of the trip.
This is calculated as the total of the previous columns, including rounding.
Number
Shared Trip Authorized Whether the customer agreed to a shared trip with another customer,
regardless of whether the customer was actually matched for a shared trip.
Checkbox
Trips Pooled If customers were matched for a shared trip,
how many trips, including this one, were pooled.
All customer trips from the time the vehicle was empty
until it was empty again contribute to this count,
even if some customers were never present in the vehicle at the same time.
Each trip making up the overall shared trip will have a separate record in this dataset,
with the same value in this column.
Number
Pickup Centroid Latitude The latitude of the center of the pickup census tract or the community area
if the census tract has been hidden for privacy.
This column often will be blank for locations outside Chicago.
Number
Pickup Centroid Longitude The longitude of the center of the pickup census tract or the community area
if the census tract has been hidden for privacy.
This column often will be blank for locations outside Chicago.
Number
Dropoff Centroid Latitude The latitude of the center of the dropoff census tract or the community
area if the census tract has been hidden for privacy.
This column often will be blank for locations outside Chicago.
Number
Dropoff Centroid Longitude The longitude of the center of the dropoff census tract or the community
area if the census tract has been hidden for privacy.
This column often will be blank for locations outside Chicago.
Number

summary statistics

In [49]:
trip.describe()
Out[49]:
Trip Seconds Trip Miles Pickup Census Tract Dropoff Census Tract Pickup Community Area Dropoff Community Area Fare Tip Additional Charges Trip Total Trips Pooled Pickup Centroid Latitude Pickup Centroid Longitude Dropoff Centroid Latitude Dropoff Centroid Longitude
count 4.502375e+06 4.505930e+06 3.280119e+06 3.263206e+06 4.212010e+06 4.181385e+06 4.505927e+06 4.505927e+06 4.505929e+06 4.505927e+06 4.505930e+06 4.215877e+06 4.215877e+06 4.185304e+06 4.185304e+06
mean 1.091856e+03 6.456160e+00 1.703137e+10 1.703138e+10 2.578688e+01 2.649017e+01 1.193603e+01 6.870422e-01 3.035391e+00 1.565847e+01 1.138735e+00 4.189287e+01 -8.766735e+01 4.189376e+01 -8.766967e+01
std 7.964650e+02 7.271941e+00 3.383429e+05 3.427131e+05 2.047745e+01 2.094711e+01 9.477744e+00 1.838201e+00 2.042552e+00 1.145579e+01 4.934787e-01 6.071416e-02 6.397958e-02 6.090235e-02 6.816556e-02
min 0.000000e+00 0.000000e+00 1.703101e+10 1.703101e+10 1.000000e+00 1.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 7.200000e-01 1.000000e+00 4.165022e+01 -8.791362e+01 4.165022e+01 -8.791362e+01
25% 5.380000e+02 1.800000e+00 1.703108e+10 1.703108e+10 8.000000e+00 8.000000e+00 5.000000e+00 0.000000e+00 2.550000e+00 8.550000e+00 1.000000e+00 4.187741e+01 -8.767995e+01 4.187867e+01 -8.768323e+01
50% 8.730000e+02 3.800000e+00 1.703124e+10 1.703124e+10 2.400000e+01 2.400000e+01 1.000000e+01 0.000000e+00 2.550000e+00 1.255000e+01 1.000000e+00 4.189322e+01 -8.765177e+01 4.189322e+01 -8.765156e+01
75% 1.412000e+03 8.200000e+00 1.703183e+10 1.703183e+10 3.200000e+01 3.200000e+01 1.500000e+01 0.000000e+00 2.550000e+00 1.755000e+01 1.000000e+00 4.192908e+01 -8.763141e+01 4.192933e+01 -8.763141e+01
max 6.498000e+04 3.192000e+02 1.703198e+10 1.703198e+10 7.700000e+01 7.700000e+01 7.000000e+02 1.000000e+02 3.069000e+01 7.021000e+02 1.000000e+01 4.202122e+01 -8.752995e+01 4.202122e+01 -8.752995e+01

I modify columns name:

  • Replace space with undeline
  • Make them lower case
In [50]:
trip.columns = trip.columns.str.replace(' ', '_')
trip.columns = trip.columns.str.lower()
In [51]:
trip['pickup_census_tract'] = trip['pickup_census_tract'].apply(lambda x: "{:.0f}".format(x) if not pd.isnull(x) else x)
In [52]:
trip['dropoff_census_tract'] = trip['dropoff_census_tract'].apply(lambda x: "{:.0f}".format(x) if not pd.isnull(x) else x)

Incorrect observations:

The data include instance of missing values and incorrect observations. I used the following steps to remove incorrect observations:

  • remove trips with total trip time less than 1 minutes and longer than 5 hours
  • remove trips with total distance traveled less than 0.25 miles and greater than 300 miles
  • remove trips with total fare equal to zero (fares are already rounded)
  • remove trips with extreme speeds (below 0.2 mph and above 80 mph)
In [53]:
# remove trips less than 60 secs or 5 hours
trip['trip_seconds'] = trip['trip_seconds'][(trip['trip_seconds'] > 60) & (trip['trip_seconds'] < 5*3600)]

# remove trips with extreme or zero miles
trip['trip_miles'] = trip['trip_miles'][(trip['trip_miles'] < 300) & (trip['trip_miles'] > 0.25)]

# trip total greater than zero
trip['trip_total'] = trip['trip_total'][(trip['trip_total'] > 0)]

# remove trips with extreme speeds
trip = trip[(trip['trip_miles']/(trip['trip_seconds']/3600) < 80) & (trip['trip_miles']/(trip['trip_seconds']/3600) < 80)]
In [54]:
trip.shape
Out[54]:
(4486332, 18)

Missing Values

In [55]:
trip.isna().sum()
Out[55]:
trip_start_timestamp                0
trip_end_timestamp                  0
trip_seconds                        0
trip_miles                          0
pickup_census_tract           1219419
dropoff_census_tract          1236069
pickup_community_area          293032
dropoff_community_area         323126
fare                                3
tip                                 3
additional_charges                  1
trip_total                          3
shared_trip_authorized              0
trips_pooled                        0
pickup_centroid_latitude       289195
pickup_centroid_longitude      289195
dropoff_centroid_latitude      319240
dropoff_centroid_longitude     319240
dtype: int64

Remove trips out of the boundary of chicago

Pickup_Census_Tract and Dropoff_Census_Tract are two variables with the most missing values. The City of Chicago has 77 community areas and within each community area there are multiple census tracts. I grouped the data frame by community area, and replaced census tracts within each community area with the median census tract of that group.

In [56]:
trip.groupby('pickup_community_area').get_group(1).head(3)
Out[56]:
trip_start_timestamp trip_end_timestamp trip_seconds trip_miles pickup_census_tract dropoff_census_tract pickup_community_area dropoff_community_area fare tip additional_charges trip_total shared_trip_authorized trips_pooled pickup_centroid_latitude pickup_centroid_longitude dropoff_centroid_latitude dropoff_centroid_longitude
34 11/01/2019 12:00:00 AM 11/01/2019 12:00:00 AM 721.0 5.8 17031010600 17031070101 1.0 7.0 10.0 2.0 2.55 14.55 False 1 42.006279 -87.667684 41.929263 -87.635891
70 11/01/2019 12:00:00 AM 11/01/2019 12:15:00 AM 200.0 1.0 17031010300 17031010502 1.0 1.0 2.5 1.0 2.55 6.05 False 1 42.015934 -87.666536 42.004517 -87.663328
93 11/01/2019 12:00:00 AM 11/01/2019 12:00:00 AM 375.0 1.1 17031010400 17031030300 1.0 77.0 5.0 0.0 2.55 7.55 False 1 42.004765 -87.659122 41.994381 -87.672538
In [57]:
trip['pickup_census_tract'] = trip.groupby('pickup_community_area')['pickup_census_tract'].apply(lambda x:x.fillna(x.median()))
trip['dropoff_census_tract'] = trip.groupby('dropoff_community_area')['dropoff_census_tract'].apply(lambda x:x.fillna(x.median()))
In [58]:
trip.isna().sum()
Out[58]:
trip_start_timestamp               0
trip_end_timestamp                 0
trip_seconds                       0
trip_miles                         0
pickup_census_tract           293032
dropoff_census_tract          323126
pickup_community_area         293032
dropoff_community_area        323126
fare                               3
tip                                3
additional_charges                 1
trip_total                         3
shared_trip_authorized             0
trips_pooled                       0
pickup_centroid_latitude      289195
pickup_centroid_longitude     289195
dropoff_centroid_latitude     319240
dropoff_centroid_longitude    319240
dtype: int64
In [59]:
trip.shape
Out[59]:
(4486332, 18)

My objective is to analysis trips within chicago boundary. So, I remove areas out side of chicago. So, if there are any missing values in outside areas, they will be removed.

For this purpous, I used Boundaries - Census Tracts - 2010 dataset.

In [60]:
geoid = pd.read_csv('data/CensusTractsTIGER2010.csv')
geoid.head(2)
Out[60]:
the_geom STATEFP10 COUNTYFP10 TRACTCE10 GEOID10 NAME10 NAMELSAD10 COMMAREA COMMAREA_N NOTES
0 MULTIPOLYGON (((-87.62404799998049 41.73021699... 17 31 842400 17031842400 8424.0 Census Tract 8424 44 44 NaN
1 MULTIPOLYGON (((-87.6860799999848 41.822956000... 17 31 840300 17031840300 8403.0 Census Tract 8403 59 59 NaN
In [61]:
geoid['GEOID10'] = geoid['GEOID10'].astype('str')
In [62]:
list_of_chicago_geoid = geoid['GEOID10'].tolist()
In [63]:
trip = trip[trip['pickup_census_tract'].isin(list_of_chicago_geoid)]
trip = trip[trip['dropoff_census_tract'].isin(list_of_chicago_geoid)]
print(trip.shape)
(3051325, 18)
In [64]:
trip.isna().sum()
Out[64]:
trip_start_timestamp          0
trip_end_timestamp            0
trip_seconds                  0
trip_miles                    0
pickup_census_tract           0
dropoff_census_tract          0
pickup_community_area         0
dropoff_community_area        0
fare                          1
tip                           1
additional_charges            0
trip_total                    1
shared_trip_authorized        0
trips_pooled                  0
pickup_centroid_latitude      0
pickup_centroid_longitude     0
dropoff_centroid_latitude     0
dropoff_centroid_longitude    0
dtype: int64
In [65]:
trip.dropna(inplace=True)
In [66]:
trip.isna().sum()
Out[66]:
trip_start_timestamp          0
trip_end_timestamp            0
trip_seconds                  0
trip_miles                    0
pickup_census_tract           0
dropoff_census_tract          0
pickup_community_area         0
dropoff_community_area        0
fare                          0
tip                           0
additional_charges            0
trip_total                    0
shared_trip_authorized        0
trips_pooled                  0
pickup_centroid_latitude      0
pickup_centroid_longitude     0
dropoff_centroid_latitude     0
dropoff_centroid_longitude    0
dtype: int64
In [67]:
trip.shape
Out[67]:
(3051324, 18)
In [68]:
print(round((4505930-3051324)/4505930 ,2)*100, '% of data were removed')
32.0 % of data were removed

My first question is what portion of total trips were requested to be pooled.

I found that 11% of all trips had been requested to be pooled.

In [69]:
print(trip.shared_trip_authorized.value_counts())
percent_of_shared = trip.shared_trip_authorized.value_counts()[1]/len(trip)
print (f'{round(percent_of_shared,2)}% of total trips are requested to be pooled')
trip.shared_trip_authorized.value_counts().plot(kind='bar')
plt.title('What portion of trips are requested to be pooled?')
plt.xticks(rotation=0)
plt.ylabel('Number of trips')
plt.savefig('images/target_distribution.png', dpi=300)
plt.show()
False    2724980
True      326344
Name: shared_trip_authorized, dtype: int64
0.11% of total trips are requested to be pooled

Since the target varible is imbalanced, we should consider precison and recall rather than accuracy.

In [70]:
trip.groupby('shared_trip_authorized').mean()
Out[70]:
trip_seconds trip_miles pickup_community_area dropoff_community_area fare tip additional_charges trip_total trips_pooled pickup_centroid_latitude pickup_centroid_longitude dropoff_centroid_latitude dropoff_centroid_longitude
shared_trip_authorized
False 944.781605 4.542668 21.793658 22.581960 10.362360 0.669748 3.128046 14.160153 1.004789 41.900359 -87.657419 41.901192 -87.659827
True 1178.342448 6.319077 24.813819 24.562201 8.572013 0.250236 1.189183 10.011432 2.111447 41.891531 -87.660659 41.892490 -87.659300

We can see that trips that are authorized to be shared are longer and less expensive.

Also, we can see that of authorized shared trips 68% percent of them are pooled.

In [73]:
import seaborn as sns

to_drop=['trip_start_timestamp', 'trip_end_timestamp',  'pickup_census_tract',
         'pickup_community_area','dropoff_community_area',
         'dropoff_census_tract','pickup_centroid_latitude', 'pickup_centroid_longitude',
         'dropoff_centroid_latitude', 'dropoff_centroid_longitude']


fig, ax = plt.subplots(figsize=(8,8)) 
matrix = np.triu(trip.drop(to_drop, axis=1).corr())
sns.heatmap(trip.drop(to_drop, axis=1).corr(), annot=False, mask=matrix, fmt='.1g', square=True, linewidths=.5, cbar_kws={"shrink": .5}, \
            center=0, cmap='coolwarm',ax=ax)
plt.tight_layout()
plt.savefig('images/heatmap.png', dpi=300)

I found that “Additional Charges” has conspicuous correlation with “shared_trip_authorized”. So, this variable might be one of the important features in model prediction. This is expected, since economical trip is most likely to be one of the incentives for a person to request a shared trip.

Baseline

In [31]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report


s = trip.sample(n=5000, random_state=4)
X=s[['trip_seconds', 'trip_miles', 'fare', 'tip',  'additional_charges','trip_total']]
y=s['shared_trip_authorized']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, stratify = y )


rf = RandomForestClassifier(max_depth=7).fit(X_train, y_train)
rf_pred=rf.predict(X_test)
rf_probs = rf.predict_proba(X_test)

print('log_loss:', log_loss(y_test, rf_probs))
print('--'*12)
print('train score accuracy', rf.score(X_train, y_train))
print('test score accuracy', rf.score(X_test, y_test))
print('--'*12)

print('train score precision', precision_score(y_test, rf_pred))
print('test score precision', precision_score(y_test, rf_pred))
print('--'*12)
print('train score recall', recall_score(y_test, rf_pred))
print('test score recall', recall_score(y_test, rf_pred))
print('--'*12)
print('\n')
print(classification_report(y_test, rf_pred, labels=[0, 1]))
log_loss: 0.09434126345148544
------------------------
train score accuracy 0.9722666666666666
test score accuracy 0.9752
------------------------
train score precision 0.9626168224299065
test score precision 0.9626168224299065
------------------------
train score recall 0.7923076923076923
test score recall 0.7923076923076923
------------------------


              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1120
           1       0.96      0.79      0.87       130

    accuracy                           0.98      1250
   macro avg       0.97      0.89      0.93      1250
weighted avg       0.97      0.98      0.97      1250

In [32]:
feat_importances = pd.Series(rf.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')
Out[32]:
<AxesSubplot:>
In [ ]:
#trip.to_csv('data/data_1to14_eda.csv', index=False)

Feature Engineering

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings(action='ignore')

trip= pd.read_csv('data/data_1to14_eda.csv')

Creating a new variable to whether the trip was successfully pooled or not:

  • pooled : 1
  • not pooled (solo): 0

Traveler could authorize a shared trip but no matching could be made through the trip (variable Shared.Trip.Authorized could be true or false).

In [2]:
trip['trips_pooled'] = np.where(trip['trips_pooled']>1, 1, 0)

Change trip_start_timestamp and trip_end_timestamp to datetime object:

In [3]:
trip['trip_start_timestamp'] = pd.to_datetime(trip['trip_start_timestamp'])
trip['trip_end_timestamp'] = pd.to_datetime(trip['trip_end_timestamp'])

Creating a column showing day of week of each trip.

The week starts on Monday:

  • Monday: 0
  • .
  • .
  • .
  • Sunday: 6
In [4]:
trip['dayofweek'] = trip['trip_start_timestamp'].dt.dayofweek

Creating a column showing if the trip is done on weekdays or weekends?

In [5]:
trip['is_weekend']=np.where(trip['dayofweek'].isin([5,6]), 1, 0)

Creating a column showing the session of the trip:

  • Late Night
  • Eearly Morning
  • Morning
  • Noon
  • Eve
  • Night
In [6]:
b = [0,5,9,16,20,24]
l = ['midnight', 'm_peak','non_peak','e_peak', 'night']
trip['day_session'] = pd.cut(trip['trip_start_timestamp'].dt.hour, bins=b, labels=l, include_lowest=True)
In [7]:
trip['hour'] = trip['trip_start_timestamp'].dt.hour

Creating a column showing the month of the trip:

- January:1

-

  • December: 12
In [8]:
trip['month'] = trip['trip_start_timestamp'].dt.month
In [10]:
trip.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3051324 entries, 0 to 3051323
Data columns (total 23 columns):
 #   Column                      Dtype         
---  ------                      -----         
 0   trip_start_timestamp        datetime64[ns]
 1   trip_end_timestamp          datetime64[ns]
 2   trip_seconds                float64       
 3   trip_miles                  float64       
 4   pickup_census_tract         int64         
 5   dropoff_census_tract        int64         
 6   pickup_community_area       float64       
 7   dropoff_community_area      float64       
 8   fare                        float64       
 9   tip                         float64       
 10  additional_charges          float64       
 11  trip_total                  float64       
 12  shared_trip_authorized      int64         
 13  trips_pooled                int64         
 14  pickup_centroid_latitude    float64       
 15  pickup_centroid_longitude   float64       
 16  dropoff_centroid_latitude   float64       
 17  dropoff_centroid_longitude  float64       
 18  dayofweek                   int64         
 19  is_weekend                  int64         
 20  day_session                 category      
 21  hour                        int64         
 22  month                       int64         
dtypes: category(1), datetime64[ns](2), float64(12), int64(8)
memory usage: 515.1 MB
In [ ]:
 
In [11]:
ACS_data = pd.read_csv('data/chicago_sociodem_tract.csv',
                   sep=',')
ACS_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 801 entries, 0 to 800
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   GEOID                801 non-null    int64  
 1   portion_low_income   801 non-null    float64
 2   portion_high_income  801 non-null    float64
 3   portion_bachelor_up  801 non-null    float64
 4   portion_female       801 non-null    float64
 5   portion_no_car       801 non-null    float64
 6   portion_white        801 non-null    float64
 7   portion_black        801 non-null    float64
 8   portion_twoplus      801 non-null    float64
 9   portion_hispanic     801 non-null    float64
 10  portion_renter_occ   801 non-null    float64
 11  portion_young18_44   801 non-null    float64
dtypes: float64(11), int64(1)
memory usage: 75.2 KB
In [12]:
data = pd.merge(trip, ACS_data.add_suffix('_pickup'), how='left' , left_on='pickup_census_tract', right_on='GEOID_pickup',  suffixes=(None, '_pickup')).drop(['GEOID_pickup'],axis=1 )
data = pd.merge(data, ACS_data.add_suffix('_dropoff'), how='left' , left_on='dropoff_census_tract', right_on='GEOID_dropoff',  suffixes=(None, '_dropoff')).drop(['GEOID_dropoff'],axis=1 )

Crime Density

In [13]:
crime = pd.read_csv('data/Control_Variables/Crime_Rate/Chicago_crime_GEOID_mapped.csv')
In [14]:
data = pd.merge(data, crime.add_suffix('_pickup'), left_on='pickup_census_tract', \
         right_on='GEOID_pickup',  suffixes=(None, '_pickup')).drop(['GEOID_pickup', 'n_pickup'],axis=1 )

data = pd.merge(data, crime.add_suffix('_dropoff'), left_on='dropoff_census_tract', \
         right_on='GEOID_dropoff',  suffixes=(None, '_dropoff')).drop(['GEOID_dropoff', 'n_dropoff'],axis=1 )
In [15]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3051324 entries, 0 to 3051323
Data columns (total 47 columns):
 #   Column                       Dtype         
---  ------                       -----         
 0   trip_start_timestamp         datetime64[ns]
 1   trip_end_timestamp           datetime64[ns]
 2   trip_seconds                 float64       
 3   trip_miles                   float64       
 4   pickup_census_tract          int64         
 5   dropoff_census_tract         int64         
 6   pickup_community_area        float64       
 7   dropoff_community_area       float64       
 8   fare                         float64       
 9   tip                          float64       
 10  additional_charges           float64       
 11  trip_total                   float64       
 12  shared_trip_authorized       int64         
 13  trips_pooled                 int64         
 14  pickup_centroid_latitude     float64       
 15  pickup_centroid_longitude    float64       
 16  dropoff_centroid_latitude    float64       
 17  dropoff_centroid_longitude   float64       
 18  dayofweek                    int64         
 19  is_weekend                   int64         
 20  day_session                  category      
 21  hour                         int64         
 22  month                        int64         
 23  portion_low_income_pickup    float64       
 24  portion_high_income_pickup   float64       
 25  portion_bachelor_up_pickup   float64       
 26  portion_female_pickup        float64       
 27  portion_no_car_pickup        float64       
 28  portion_white_pickup         float64       
 29  portion_black_pickup         float64       
 30  portion_twoplus_pickup       float64       
 31  portion_hispanic_pickup      float64       
 32  portion_renter_occ_pickup    float64       
 33  portion_young18_44_pickup    float64       
 34  portion_low_income_dropoff   float64       
 35  portion_high_income_dropoff  float64       
 36  portion_bachelor_up_dropoff  float64       
 37  portion_female_dropoff       float64       
 38  portion_no_car_dropoff       float64       
 39  portion_white_dropoff        float64       
 40  portion_black_dropoff        float64       
 41  portion_twoplus_dropoff      float64       
 42  portion_hispanic_dropoff     float64       
 43  portion_renter_occ_dropoff   float64       
 44  portion_young18_44_dropoff   float64       
 45  crime_density_pickup         float64       
 46  crime_density_dropoff        float64       
dtypes: category(1), datetime64[ns](2), float64(36), int64(8)
memory usage: 1.1 GB

Job Density

In [16]:
job = pd.read_csv('data/Control_Variables/Employment_Density/Chicago_Job_Density.csv')
In [17]:
data = pd.merge(data, job.add_suffix('_pickup'), left_on='pickup_census_tract', \
         right_on='GEOID_pickup',  suffixes=(None, '_pickup')).drop(['GEOID_pickup', 'count_jobs_pickup', 'area_pickup'],axis=1 )

data = pd.merge(data, job.add_suffix('_dropoff'), left_on='dropoff_census_tract', \
         right_on='GEOID_dropoff',  suffixes=(None, '_dropoff')).drop(['GEOID_dropoff', 'count_jobs_dropoff', 'area_dropoff'],axis=1 )

Housing

In [18]:
hp = pd.read_csv('data/Control_Variables/Housing_and_Population/Chicago_house_pop_GEOID_mapped.csv')
In [19]:
data = pd.merge(data, hp.add_suffix('_pickup'), left_on='pickup_census_tract', \
         right_on='GEOID_pickup',  suffixes=(None, '_pickup')).drop(['GEOID_pickup', 'house_pickup', 'pop_pickup'],axis=1 )

data = pd.merge(data, hp.add_suffix('_dropoff'), left_on='dropoff_census_tract', \
         right_on='GEOID_dropoff',  suffixes=(None, '_dropoff')).drop(['GEOID_dropoff', 'house_dropoff', 'pop_dropoff'],axis=1 )

POI

In [20]:
poi = pd.read_csv('data/Control_Variables/POI/POI_GEOID_mapped.csv')
In [21]:
data = pd.merge(data, poi.add_suffix('_pickup'), left_on='pickup_census_tract', \
         right_on='GEOID_pickup',  suffixes=(None, '_pickup')).drop(['GEOID_pickup'],axis=1 )

data = pd.merge(data, poi.add_suffix('_dropoff'), left_on='dropoff_census_tract', \
         right_on='GEOID_dropoff',  suffixes=(None, '_dropoff')).drop(['GEOID_dropoff'],axis=1 )

Transit Supply

In [22]:
transit = pd.read_csv('data/Control_Variables/Transit_Supply/Chicago_transit_supply_GEOID_mapped.csv')
In [23]:
data = pd.merge(data, transit.add_suffix('_pickup'), left_on='pickup_census_tract', \
         right_on='GEOID_pickup',  suffixes=(None, '_pickup')).drop(['GEOID_pickup', 'rail_stops_count_pickup', 'bus_stops_count_pickup'],axis=1 )

data = pd.merge(data, transit.add_suffix('_dropoff'), left_on='dropoff_census_tract', \
         right_on='GEOID_dropoff',  suffixes=(None, '_dropoff')).drop(['GEOID_dropoff', 'rail_stops_count_dropoff', 'bus_stops_count_dropoff'],axis=1 )
In [ ]:
 

Weather

In [72]:
#pd.read_csv('data/Control_Variables/Weather_Data/weatherdata.csv')
In [ ]:
data.info()

airport

In [24]:
airport_geoid = ['17031980100', '17031980000']

data['is_airport']=np.where(data['pickup_census_tract'].isin(airport_geoid) | data['dropoff_census_tract'].isin(airport_geoid) , 1, 0)
In [139]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3051324 entries, 0 to 3051323
Data columns (total 60 columns):
 #   Column                       Dtype         
---  ------                       -----         
 0   trip_start_timestamp         datetime64[ns]
 1   trip_end_timestamp           datetime64[ns]
 2   trip_seconds                 float64       
 3   trip_miles                   float64       
 4   pickup_census_tract          int64         
 5   dropoff_census_tract         int64         
 6   pickup_community_area        float64       
 7   dropoff_community_area       float64       
 8   fare                         float64       
 9   tip                          float64       
 10  additional_charges           float64       
 11  trip_total                   float64       
 12  shared_trip_authorized       int64         
 13  trips_pooled                 int64         
 14  pickup_centroid_latitude     float64       
 15  pickup_centroid_longitude    float64       
 16  dropoff_centroid_latitude    float64       
 17  dropoff_centroid_longitude   float64       
 18  dayofweek                    int64         
 19  is_weekend                   int64         
 20  day_session                  category      
 21  hour                         int64         
 22  month                        int64         
 23  portion_low_income_pickup    float64       
 24  portion_high_income_pickup   float64       
 25  portion_bachelor_up_pickup   float64       
 26  portion_female_pickup        float64       
 27  portion_no_car_pickup        float64       
 28  portion_white_pickup         float64       
 29  portion_black_pickup         float64       
 30  portion_twoplus_pickup       float64       
 31  portion_hispanic_pickup      float64       
 32  portion_renter_occ_pickup    float64       
 33  portion_young18_44_pickup    float64       
 34  portion_low_income_dropoff   float64       
 35  portion_high_income_dropoff  float64       
 36  portion_bachelor_up_dropoff  float64       
 37  portion_female_dropoff       float64       
 38  portion_no_car_dropoff       float64       
 39  portion_white_dropoff        float64       
 40  portion_black_dropoff        float64       
 41  portion_twoplus_dropoff      float64       
 42  portion_hispanic_dropoff     float64       
 43  portion_renter_occ_dropoff   float64       
 44  portion_young18_44_dropoff   float64       
 45  crime_density_pickup         float64       
 46  crime_density_dropoff        float64       
 47  job_density_pickup           float64       
 48  job_density_dropoff          float64       
 49  house_density_pickup         float64       
 50  pop_density_pickup           float64       
 51  house_density_dropoff        float64       
 52  pop_density_dropoff          float64       
 53  POI_density_pickup           float64       
 54  POI_density_dropoff          float64       
 55  bus_stops_density_pickup     float64       
 56  rail_stops_density_pickup    float64       
 57  bus_stops_density_dropoff    float64       
 58  rail_stops_density_dropoff   float64       
 59  is_airport                   int64         
dtypes: category(1), datetime64[ns](2), float64(48), int64(9)
memory usage: 1.4 GB
In [25]:
to_drop = ['trip_start_timestamp', 'trip_end_timestamp']


# to_drop = ['trip_start_timestamp', 'trip_end_timestamp',
#            'pickup_community_area', 'dropoff_community_area',
#            'pickup_centroid_latitude', 'pickup_centroid_longitude',
#            'dropoff_centroid_latitude', 'dropoff_centroid_longitude',
#            'portion_twoplus_pickup','portion_twoplus_dropoff',
#            'portion_hispanic_pickup','portion_hispanic_dropoff',
#            'portion_renter_occ_pickup','portion_renter_occ_dropoff']

data.drop(to_drop, axis=1, inplace=True)
In [26]:
data.shared_trip_authorized  = np.where(data.shared_trip_authorized == False, 0, 1)
In [27]:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3051324 entries, 0 to 3051323
Data columns (total 58 columns):
 #   Column                       Dtype   
---  ------                       -----   
 0   trip_seconds                 float64 
 1   trip_miles                   float64 
 2   pickup_census_tract          int64   
 3   dropoff_census_tract         int64   
 4   pickup_community_area        float64 
 5   dropoff_community_area       float64 
 6   fare                         float64 
 7   tip                          float64 
 8   additional_charges           float64 
 9   trip_total                   float64 
 10  shared_trip_authorized       int64   
 11  trips_pooled                 int64   
 12  pickup_centroid_latitude     float64 
 13  pickup_centroid_longitude    float64 
 14  dropoff_centroid_latitude    float64 
 15  dropoff_centroid_longitude   float64 
 16  dayofweek                    int64   
 17  is_weekend                   int64   
 18  day_session                  category
 19  hour                         int64   
 20  month                        int64   
 21  portion_low_income_pickup    float64 
 22  portion_high_income_pickup   float64 
 23  portion_bachelor_up_pickup   float64 
 24  portion_female_pickup        float64 
 25  portion_no_car_pickup        float64 
 26  portion_white_pickup         float64 
 27  portion_black_pickup         float64 
 28  portion_twoplus_pickup       float64 
 29  portion_hispanic_pickup      float64 
 30  portion_renter_occ_pickup    float64 
 31  portion_young18_44_pickup    float64 
 32  portion_low_income_dropoff   float64 
 33  portion_high_income_dropoff  float64 
 34  portion_bachelor_up_dropoff  float64 
 35  portion_female_dropoff       float64 
 36  portion_no_car_dropoff       float64 
 37  portion_white_dropoff        float64 
 38  portion_black_dropoff        float64 
 39  portion_twoplus_dropoff      float64 
 40  portion_hispanic_dropoff     float64 
 41  portion_renter_occ_dropoff   float64 
 42  portion_young18_44_dropoff   float64 
 43  crime_density_pickup         float64 
 44  crime_density_dropoff        float64 
 45  job_density_pickup           float64 
 46  job_density_dropoff          float64 
 47  house_density_pickup         float64 
 48  pop_density_pickup           float64 
 49  house_density_dropoff        float64 
 50  pop_density_dropoff          float64 
 51  POI_density_pickup           float64 
 52  POI_density_dropoff          float64 
 53  bus_stops_density_pickup     float64 
 54  rail_stops_density_pickup    float64 
 55  bus_stops_density_dropoff    float64 
 56  rail_stops_density_dropoff   float64 
 57  is_airport                   int64   
dtypes: category(1), float64(48), int64(9)
memory usage: 1.3 GB
In [29]:
group = ['pickup_census_tract', 'dropoff_census_tract', 'hour', 'is_weekend', 'is_airport']

s = data.groupby(group)
In [30]:
s = s.agg({'trip_seconds':'mean', 
           'trip_miles': 'mean', 
           'trip_total':'mean',
           'portion_twoplus_pickup': 'median','portion_twoplus_dropoff':'median',
           'portion_hispanic_pickup':'median','portion_hispanic_dropoff':'median',
           'portion_renter_occ_pickup':'median','portion_renter_occ_dropoff':'median',
           'crime_density_pickup':'median','house_density_pickup':'median',
           'job_density_pickup': 'median', 'job_density_dropoff': 'median',
           'house_density_pickup':'median', 'house_density_dropoff': 'median',
           'pop_density_pickup': 'median', 'pop_density_dropoff': 'median',
           'POI_density_pickup':'median', 'POI_density_dropoff': 'median',
           'bus_stops_density_pickup': 'median', 'bus_stops_density_dropoff': 'median',
          'rail_stops_density_pickup': 'median', 'rail_stops_density_dropoff': 'median',
          'trips_pooled': 'count',
          'shared_trip_authorized': lambda x: x.sum()/len(x)
         }).reset_index()
In [144]:
s['midnight'] = s['hour'].apply(lambda x: 1 if (x>=0 and x<5) else 0)
s['m_peak'] = s['hour'].apply(lambda x: 1 if (x>=5 and x<9) else 0)
s['non_peak'] = s['hour'].apply(lambda x: 1 if (x>=9 and x<16) else 0)
s['e_peak'] = s['hour'].apply(lambda x: 1 if (x>=16 and x<20) else 0)
s['night'] = s['hour'].apply(lambda x: 1 if (x>=20 and x<24) else 0)
In [31]:
s = s.rename(columns={'trips_pooled': 'count'})
In [32]:
s.shape
Out[32]:
(993986, 31)
In [33]:
s=s[s['count'] > 100]
In [34]:
s.shape
Out[34]:
(992056, 31)
In [ ]:
 
In [35]:
s = s.fillna(0)
s.isna().sum()
Out[35]:
pickup_census_tract           0
dropoff_census_tract          0
hour                          0
is_weekend                    0
is_airport                    0
trip_seconds                  0
trip_miles                    0
additional_charges            0
fare                          0
tip                           0
portion_twoplus_pickup        0
portion_twoplus_dropoff       0
portion_hispanic_pickup       0
portion_hispanic_dropoff      0
portion_renter_occ_pickup     0
portion_renter_occ_dropoff    0
crime_density_pickup          0
house_density_pickup          0
job_density_pickup            0
job_density_dropoff           0
house_density_dropoff         0
pop_density_pickup            0
pop_density_dropoff           0
POI_density_pickup            0
POI_density_dropoff           0
bus_stops_density_pickup      0
bus_stops_density_dropoff     0
rail_stops_density_pickup     0
rail_stops_density_dropoff    0
count                         0
shared_trip_authorized        0
dtype: int64
In [36]:
s.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 992056 entries, 0 to 993985
Data columns (total 31 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   pickup_census_tract         992056 non-null  int64  
 1   dropoff_census_tract        992056 non-null  int64  
 2   hour                        992056 non-null  int64  
 3   is_weekend                  992056 non-null  int64  
 4   is_airport                  992056 non-null  int64  
 5   trip_seconds                992056 non-null  float64
 6   trip_miles                  992056 non-null  float64
 7   additional_charges          992056 non-null  float64
 8   fare                        992056 non-null  float64
 9   tip                         992056 non-null  float64
 10  portion_twoplus_pickup      992056 non-null  float64
 11  portion_twoplus_dropoff     992056 non-null  float64
 12  portion_hispanic_pickup     992056 non-null  float64
 13  portion_hispanic_dropoff    992056 non-null  float64
 14  portion_renter_occ_pickup   992056 non-null  float64
 15  portion_renter_occ_dropoff  992056 non-null  float64
 16  crime_density_pickup        992056 non-null  float64
 17  house_density_pickup        992056 non-null  float64
 18  job_density_pickup          992056 non-null  float64
 19  job_density_dropoff         992056 non-null  float64
 20  house_density_dropoff       992056 non-null  float64
 21  pop_density_pickup          992056 non-null  float64
 22  pop_density_dropoff         992056 non-null  float64
 23  POI_density_pickup          992056 non-null  float64
 24  POI_density_dropoff         992056 non-null  float64
 25  bus_stops_density_pickup    992056 non-null  float64
 26  bus_stops_density_dropoff   992056 non-null  float64
 27  rail_stops_density_pickup   992056 non-null  float64
 28  rail_stops_density_dropoff  992056 non-null  float64
 29  count                       992056 non-null  int64  
 30  shared_trip_authorized      992056 non-null  float64
dtypes: float64(25), int64(6)
memory usage: 242.2 MB
In [39]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

s_scaled = pd.DataFrame(scaler.fit_transform(s), columns=s.columns)
s_scaled.head(2)
Out[39]:
pickup_census_tract dropoff_census_tract hour is_weekend is_airport trip_seconds trip_miles additional_charges fare tip ... pop_density_pickup pop_density_dropoff POI_density_pickup POI_density_dropoff bus_stops_density_pickup bus_stops_density_dropoff rail_stops_density_pickup rail_stops_density_dropoff count shared_trip_authorized
0 0.0 0.0 0.043478 1.0 0.0 0.002430 0.000000 0.130036 0.017857 0.0 ... 0.116825 0.116825 0.081782 0.081782 0.048466 0.048466 0.0 0.0 0.0 0.0
1 0.0 0.0 0.347826 0.0 0.0 0.032374 0.013544 0.130036 0.035714 0.0 ... 0.116825 0.116825 0.081782 0.081782 0.048466 0.048466 0.0 0.0 0.0 0.0

2 rows × 31 columns

In [40]:
cols = [0,1,30]
X=s.drop(s.columns[cols], axis=1)
y=s['shared_trip_authorized']
In [41]:
import statsmodels.api as sm
X = sm.add_constant(X)
est = sm.OLS(y, X).fit()
est.summary()
Out[41]:
OLS Regression Results
Dep. Variable: shared_trip_authorized R-squared: 0.477
Model: OLS Adj. R-squared: 0.477
Method: Least Squares F-statistic: 3.235e+04
Date: Wed, 02 Dec 2020 Prob (F-statistic): 0.00
Time: 12:43:10 Log-Likelihood: 27697.
No. Observations: 992056 AIC: -5.534e+04
Df Residuals: 992027 BIC: -5.499e+04
Df Model: 28
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 0.7189 0.002 370.860 0.000 0.715 0.723
hour -0.0008 3.72e-05 -21.042 0.000 -0.001 -0.001
is_weekend -0.0199 0.000 -40.566 0.000 -0.021 -0.019
is_airport 0.8791 0.002 388.766 0.000 0.875 0.884
trip_seconds 0.0002 7.27e-07 247.545 0.000 0.000 0.000
trip_miles 0.0314 0.000 267.025 0.000 0.031 0.032
additional_charges -0.1912 0.000 -620.387 0.000 -0.192 -0.191
fare -0.0394 8.99e-05 -437.526 0.000 -0.040 -0.039
tip -0.0058 0.000 -26.742 0.000 -0.006 -0.005
portion_twoplus_pickup -0.2484 0.012 -20.122 0.000 -0.273 -0.224
portion_twoplus_dropoff -0.2822 0.012 -23.060 0.000 -0.306 -0.258
portion_hispanic_pickup 0.0176 0.001 13.457 0.000 0.015 0.020
portion_hispanic_dropoff 0.0429 0.001 32.297 0.000 0.040 0.046
portion_renter_occ_pickup 0.0444 0.002 27.074 0.000 0.041 0.048
portion_renter_occ_dropoff 0.0642 0.002 39.154 0.000 0.061 0.067
crime_density_pickup 1.309e-06 6.14e-08 21.318 0.000 1.19e-06 1.43e-06
house_density_pickup 7.76e-07 7.34e-08 10.573 0.000 6.32e-07 9.2e-07
job_density_pickup -7.195e-08 3.83e-09 -18.800 0.000 -7.94e-08 -6.44e-08
job_density_dropoff -6.437e-08 3.02e-09 -21.297 0.000 -7.03e-08 -5.84e-08
house_density_dropoff 9.711e-07 6.77e-08 14.348 0.000 8.38e-07 1.1e-06
pop_density_pickup -1.596e-06 5.23e-08 -30.507 0.000 -1.7e-06 -1.49e-06
pop_density_dropoff -1.363e-06 5.02e-08 -27.154 0.000 -1.46e-06 -1.26e-06
POI_density_pickup -4.428e-05 1.36e-06 -32.491 0.000 -4.7e-05 -4.16e-05
POI_density_dropoff -4.817e-05 1.29e-06 -37.340 0.000 -5.07e-05 -4.56e-05
bus_stops_density_pickup -8.888e-05 8.65e-06 -10.277 0.000 -0.000 -7.19e-05
bus_stops_density_dropoff -0.0001 8.43e-06 -14.818 0.000 -0.000 -0.000
rail_stops_density_pickup -0.0003 2.66e-05 -13.137 0.000 -0.000 -0.000
rail_stops_density_dropoff -0.0003 2.66e-05 -11.997 0.000 -0.000 -0.000
count 0.0008 4.43e-05 17.291 0.000 0.001 0.001
Omnibus: 428467.968 Durbin-Watson: 1.804
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2108875.746
Skew: 2.091 Prob(JB): 0.00
Kurtosis: 8.790 Cond. No. 5.79e+06


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.79e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
In [116]:
import seaborn as sns

to_drop=[0,1]


fig, ax = plt.subplots(figsize=(20,20)) 
matrix = np.triu(s.drop(s.columns[to_drop], axis=1).corr())
sns.heatmap(s.drop(s.columns[to_drop], axis=1).corr(), annot=True, fmt='.1g', square=True, linewidths=.5, cbar_kws={"shrink": .5},ax=ax)
plt.tight_layout()
plt.savefig('images/heatmap.png', dpi=300)

What are underlying factors for WTP?

In [ ]:
data.info()
In [ ]:
pd.crosstab(data.is_airport, data.shared_trip_authorized, normalize=False).plot(kind="bar")
plt.title('WTP (trip shared authorized ~ additional charges)')
plt.xticks(rotation=0)
plt.ylabel('Frequency')
#plt.savefig('1.png')
plt.show()
In [ ]:
from scipy.stats import chi2_contingency
chi2, p, dof, ex = chi2_contingency( pd.crosstab(data.is_airport, data.shared_trip_authorized, margins=True))
print("chi2 = ", chi2)
print("p-val = ", p)
print("degree of freedom = ",dof)

From the bar chart we can see chance of WTP is more in not-airport area. from chi2 analysis, p-value is less than 0.05. Therefor, we have enough evidence to reject the null hypothesis, which was there was no effect of airport area on authorizing a trip to be pooled.

In [85]:
pd.crosstab(data.is_airport, data.shared_trip_authorized, normalize=False).plot(kind="bar")
plt.title('WTP (trip shared authorized ~ is_airport)')
plt.xticks(rotation=0)
plt.ylabel('Frequency')
#plt.savefig('1.png')
plt.show()
In [84]:
from scipy.stats import chi2_contingency
chi2, p, dof, ex = chi2_contingency( pd.crosstab(data.is_airport, data.shared_trip_authorized, margins=True))
print("chi2 = ", chi2)
print("p-val = ", p)
print("degree of freedom = ",dof)
chi2 =  6351.264506938556
p-val =  0.0
degree of freedom =  4

From the bar chart we can see chance of WTP is more in not-airport area. from chi2 analysis, p-value is less than 0.05. Therefor, we have enough evidence to reject the null hypothesis, which was there was no effect of weekend on authorizing a trip to be pooled.

In [77]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

cols = ['trip_seconds', 'trip_miles', 'fare', 'tip', 'additional_charges',
       'trip_total', 'hour', 'month', 'portion_low_income_pickup',
       'portion_high_income_pickup', 'portion_bachelor_up_pickup',
       'portion_female_pickup', 'portion_no_car_pickup',
       'portion_white_pickup', 'portion_black_pickup',
       'portion_twoplus_pickup', 'portion_hispanic_pickup',
       'portion_renter_occ_pickup', 'portion_young18_44_pickup',
       'portion_low_income_dropoff', 'portion_high_income_dropoff',
       'portion_bachelor_up_dropoff', 'portion_female_dropoff',
       'portion_no_car_dropoff', 'portion_white_dropoff',
       'portion_black_dropoff', 'portion_twoplus_dropoff',
       'portion_hispanic_dropoff', 'portion_renter_occ_dropoff',
       'portion_young18_44_dropoff']

dict_pval={}
for item in cols:
    model1 = smf.ols(f'{item} ~ C(shared_trip_authorized)', data=data).fit()
    dict_pval[item] = model1.f_pvalue
In [78]:
dict_pval
Out[78]:
{'trip_seconds': 0.0,
 'trip_miles': 0.0,
 'fare': 0.0,
 'tip': 0.0,
 'additional_charges': 0.0,
 'trip_total': 0.0,
 'hour': 2.277939007292422e-07,
 'month': 1.0,
 'portion_low_income_pickup': 0.0,
 'portion_high_income_pickup': 0.0,
 'portion_bachelor_up_pickup': 0.0,
 'portion_female_pickup': 0.0,
 'portion_no_car_pickup': 3.1344329999657895e-225,
 'portion_white_pickup': 0.0,
 'portion_black_pickup': 0.0,
 'portion_twoplus_pickup': 3.150332215877822e-69,
 'portion_hispanic_pickup': 0.0,
 'portion_renter_occ_pickup': 0.0,
 'portion_young18_44_pickup': 0.0,
 'portion_low_income_dropoff': 0.0,
 'portion_high_income_dropoff': 0.0,
 'portion_bachelor_up_dropoff': 0.0,
 'portion_female_dropoff': 0.0,
 'portion_no_car_dropoff': 8.952571556612613e-49,
 'portion_white_dropoff': 0.0,
 'portion_black_dropoff': 0.0,
 'portion_twoplus_dropoff': 0.10323990411105204,
 'portion_hispanic_dropoff': 0.0,
 'portion_renter_occ_dropoff': 0.0,
 'portion_young18_44_dropoff': 0.0}
In [79]:
pd.DataFrame.from_dict(dict_pval,'index').stack().reset_index(level=0)
Out[79]:
level_0 0
0 trip_seconds 0.000000e+00
0 trip_miles 0.000000e+00
0 fare 0.000000e+00
0 tip 0.000000e+00
0 additional_charges 0.000000e+00
0 trip_total 0.000000e+00
0 hour 2.277939e-07
0 month 1.000000e+00
0 portion_low_income_pickup 0.000000e+00
0 portion_high_income_pickup 0.000000e+00
0 portion_bachelor_up_pickup 0.000000e+00
0 portion_female_pickup 0.000000e+00
0 portion_no_car_pickup 3.134433e-225
0 portion_white_pickup 0.000000e+00
0 portion_black_pickup 0.000000e+00
0 portion_twoplus_pickup 3.150332e-69
0 portion_hispanic_pickup 0.000000e+00
0 portion_renter_occ_pickup 0.000000e+00
0 portion_young18_44_pickup 0.000000e+00
0 portion_low_income_dropoff 0.000000e+00
0 portion_high_income_dropoff 0.000000e+00
0 portion_bachelor_up_dropoff 0.000000e+00
0 portion_female_dropoff 0.000000e+00
0 portion_no_car_dropoff 8.952572e-49
0 portion_white_dropoff 0.000000e+00
0 portion_black_dropoff 0.000000e+00
0 portion_twoplus_dropoff 1.032399e-01
0 portion_hispanic_dropoff 0.000000e+00
0 portion_renter_occ_dropoff 0.000000e+00
0 portion_young18_44_dropoff 0.000000e+00
In [80]:
#data.to_csv('trip_Nov_1to14_clean.csv', index=False)
In [ ]:
 
In [ ]: