import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings(action='ignore')

https://data.cityofchicago.org/Transportation/2019_Nov1_14/s7xg-7axx

Load Data and Take a Look¶

From Data, there are 4505930 trips between Nov. 1 2019 to Nov. 14 2019 (two weeks)

trip = pd.read_csv('data/trip.csv')

print('shape of dataset:', trip.shape)
trip.head(3)

shape of dataset: (4505930, 18)

trip.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4505930 entries, 0 to 4505929
Data columns (total 18 columns):
 #   Column                      Dtype  
---  ------                      -----  
 0   Trip Start Timestamp        object 
 1   Trip End Timestamp          object 
 2   Trip Seconds                float64
 3   Trip Miles                  float64
 4   Pickup Census Tract         float64
 5   Dropoff Census Tract        float64
 6   Pickup Community Area       float64
 7   Dropoff Community Area      float64
 8   Fare                        float64
 9   Tip                         float64
 10  Additional Charges          float64
 11  Trip Total                  float64
 12  Shared Trip Authorized      bool   
 13  Trips Pooled                int64  
 14  Pickup Centroid Latitude    float64
 15  Pickup Centroid Longitude   float64
 16  Dropoff Centroid Latitude   float64
 17  Dropoff Centroid Longitude  float64
dtypes: bool(1), float64(14), int64(1), object(2)
memory usage: 588.7+ MB

Columns in this Dataset and their description:¶

Column Name	Description	Type
Trip Start Timestamp	When the trip started, rounded to the nearest 15 minutes.	Date & Time
Trip End Timestamp	When the trip ended, rounded to the nearest 15 minutes.	Date & Time
Trip Seconds	Time of the trip in seconds.	Number
Trip Miles	Distance of the trip in miles.	Number
Pickup Census Tract	The Census Tract where the trip began. This column often will be blank for locations outside Chicago.	Plain Text
Dropoff Census Tract	The Census Tract where the trip ended. This column often will be blank for locations outside Chicago.	Plain Text
Pickup Community Area	The Community Area where the trip began. This column will be blank for locations outside Chicago.	Number
Dropoff Community Area	The Community Area where the trip ended. This column will be blank for locations outside Chicago.	Number
Fare	The fare for the trip, rounded to the nearest \$2.50.	Number
Tip	The tip for the trip, rounded to the nearest \$1.00. Cash tips will not be recorded.	Number
Additional Charges	The taxes, fees, and any other charges for the trip.	Number
Trip Total	Total cost of the trip. This is calculated as the total of the previous columns, including rounding.	Number
Shared Trip Authorized	Whether the customer agreed to a shared trip with another customer, regardless of whether the customer was actually matched for a shared trip.	Checkbox
Trips Pooled	If customers were matched for a shared trip, how many trips, including this one, were pooled. All customer trips from the time the vehicle was empty until it was empty again contribute to this count, even if some customers were never present in the vehicle at the same time. Each trip making up the overall shared trip will have a separate record in this dataset, with the same value in this column.	Number
Pickup Centroid Latitude	The latitude of the center of the pickup census tract or the community area if the census tract has been hidden for privacy. This column often will be blank for locations outside Chicago.	Number
Pickup Centroid Longitude	The longitude of the center of the pickup census tract or the community area if the census tract has been hidden for privacy. This column often will be blank for locations outside Chicago.	Number
Dropoff Centroid Latitude	The latitude of the center of the dropoff census tract or the community area if the census tract has been hidden for privacy. This column often will be blank for locations outside Chicago.	Number
Dropoff Centroid Longitude	The longitude of the center of the dropoff census tract or the community area if the census tract has been hidden for privacy. This column often will be blank for locations outside Chicago.	Number

summary statistics

trip.describe()

I modify columns name:

Replace space with undeline
Make them lower case

trip.columns = trip.columns.str.replace(' ', '_')
trip.columns = trip.columns.str.lower()

trip['pickup_census_tract'] = trip['pickup_census_tract'].apply(lambda x: "{:.0f}".format(x) if not pd.isnull(x) else x)

trip['dropoff_census_tract'] = trip['dropoff_census_tract'].apply(lambda x: "{:.0f}".format(x) if not pd.isnull(x) else x)

Incorrect observations:¶

The data include instance of missing values and incorrect observations. I used the following steps to remove incorrect observations:

remove trips with total trip time less than 1 minutes and longer than 5 hours
remove trips with total distance traveled less than 0.25 miles and greater than 300 miles
remove trips with total fare equal to zero (fares are already rounded)
remove trips with extreme speeds (below 0.2 mph and above 80 mph)

# remove trips less than 60 secs or 5 hours
trip['trip_seconds'] = trip['trip_seconds'][(trip['trip_seconds'] > 60) & (trip['trip_seconds'] < 5*3600)]

# remove trips with extreme or zero miles
trip['trip_miles'] = trip['trip_miles'][(trip['trip_miles'] < 300) & (trip['trip_miles'] > 0.25)]

# trip total greater than zero
trip['trip_total'] = trip['trip_total'][(trip['trip_total'] > 0)]

# remove trips with extreme speeds
trip = trip[(trip['trip_miles']/(trip['trip_seconds']/3600) < 80) & (trip['trip_miles']/(trip['trip_seconds']/3600) < 80)]

trip.shape

(4486332, 18)

Missing Values¶

trip.isna().sum()

trip_start_timestamp                0
trip_end_timestamp                  0
trip_seconds                        0
trip_miles                          0
pickup_census_tract           1219419
dropoff_census_tract          1236069
pickup_community_area          293032
dropoff_community_area         323126
fare                                3
tip                                 3
additional_charges                  1
trip_total                          3
shared_trip_authorized              0
trips_pooled                        0
pickup_centroid_latitude       289195
pickup_centroid_longitude      289195
dropoff_centroid_latitude      319240
dropoff_centroid_longitude     319240
dtype: int64

Remove trips out of the boundary of chicago¶

Pickup_Census_Tract and Dropoff_Census_Tract are two variables with the most missing values. The City of Chicago has 77 community areas and within each community area there are multiple census tracts. I grouped the data frame by community area, and replaced census tracts within each community area with the median census tract of that group.

trip.groupby('pickup_community_area').get_group(1).head(3)

trip['pickup_census_tract'] = trip.groupby('pickup_community_area')['pickup_census_tract'].apply(lambda x:x.fillna(x.median()))
trip['dropoff_census_tract'] = trip.groupby('dropoff_community_area')['dropoff_census_tract'].apply(lambda x:x.fillna(x.median()))

trip.isna().sum()

trip_start_timestamp               0
trip_end_timestamp                 0
trip_seconds                       0
trip_miles                         0
pickup_census_tract           293032
dropoff_census_tract          323126
pickup_community_area         293032
dropoff_community_area        323126
fare                               3
tip                                3
additional_charges                 1
trip_total                         3
shared_trip_authorized             0
trips_pooled                       0
pickup_centroid_latitude      289195
pickup_centroid_longitude     289195
dropoff_centroid_latitude     319240
dropoff_centroid_longitude    319240
dtype: int64

trip.shape

(4486332, 18)

My objective is to analysis trips within chicago boundary. So, I remove areas out side of chicago. So, if there are any missing values in outside areas, they will be removed.

For this purpous, I used Boundaries - Census Tracts - 2010 dataset.

geoid = pd.read_csv('data/CensusTractsTIGER2010.csv')
geoid.head(2)

geoid['GEOID10'] = geoid['GEOID10'].astype('str')

list_of_chicago_geoid = geoid['GEOID10'].tolist()

trip = trip[trip['pickup_census_tract'].isin(list_of_chicago_geoid)]
trip = trip[trip['dropoff_census_tract'].isin(list_of_chicago_geoid)]
print(trip.shape)

(3051325, 18)

trip.isna().sum()

trip_start_timestamp          0
trip_end_timestamp            0
trip_seconds                  0
trip_miles                    0
pickup_census_tract           0
dropoff_census_tract          0
pickup_community_area         0
dropoff_community_area        0
fare                          1
tip                           1
additional_charges            0
trip_total                    1
shared_trip_authorized        0
trips_pooled                  0
pickup_centroid_latitude      0
pickup_centroid_longitude     0
dropoff_centroid_latitude     0
dropoff_centroid_longitude    0
dtype: int64

trip.dropna(inplace=True)

trip.isna().sum()

trip_start_timestamp          0
trip_end_timestamp            0
trip_seconds                  0
trip_miles                    0
pickup_census_tract           0
dropoff_census_tract          0
pickup_community_area         0
dropoff_community_area        0
fare                          0
tip                           0
additional_charges            0
trip_total                    0
shared_trip_authorized        0
trips_pooled                  0
pickup_centroid_latitude      0
pickup_centroid_longitude     0
dropoff_centroid_latitude     0
dropoff_centroid_longitude    0
dtype: int64

trip.shape

(3051324, 18)

print(round((4505930-3051324)/4505930 ,2)*100, '% of data were removed')

32.0 % of data were removed

My first question is what portion of total trips were requested to be pooled.

I found that 11% of all trips had been requested to be pooled.

print(trip.shared_trip_authorized.value_counts())
percent_of_shared = trip.shared_trip_authorized.value_counts()[1]/len(trip)
print (f'{round(percent_of_shared,2)}% of total trips are requested to be pooled')
trip.shared_trip_authorized.value_counts().plot(kind='bar')
plt.title('What portion of trips are requested to be pooled?')
plt.xticks(rotation=0)
plt.ylabel('Number of trips')
plt.savefig('images/target_distribution.png', dpi=300)
plt.show()

False    2724980
True      326344
Name: shared_trip_authorized, dtype: int64
0.11% of total trips are requested to be pooled

Since the target varible is imbalanced, we should consider precison and recall rather than accuracy.

trip.groupby('shared_trip_authorized').mean()

We can see that trips that are authorized to be shared are longer and less expensive.

Also, we can see that of authorized shared trips 68% percent of them are pooled.

import seaborn as sns

to_drop=['trip_start_timestamp', 'trip_end_timestamp',  'pickup_census_tract',
         'pickup_community_area','dropoff_community_area',
         'dropoff_census_tract','pickup_centroid_latitude', 'pickup_centroid_longitude',
         'dropoff_centroid_latitude', 'dropoff_centroid_longitude']


fig, ax = plt.subplots(figsize=(8,8)) 
matrix = np.triu(trip.drop(to_drop, axis=1).corr())
sns.heatmap(trip.drop(to_drop, axis=1).corr(), annot=False, mask=matrix, fmt='.1g', square=True, linewidths=.5, cbar_kws={"shrink": .5}, \
            center=0, cmap='coolwarm',ax=ax)
plt.tight_layout()
plt.savefig('images/heatmap.png', dpi=300)

I found that “Additional Charges” has conspicuous correlation with “shared_trip_authorized”. So, this variable might be one of the important features in model prediction. This is expected, since economical trip is most likely to be one of the incentives for a person to request a shared trip.

Baseline¶

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import log_loss
from sklearn.metrics import classification_report
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report


s = trip.sample(n=5000, random_state=4)
X=s[['trip_seconds', 'trip_miles', 'fare', 'tip',  'additional_charges','trip_total']]
y=s['shared_trip_authorized']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, stratify = y )


rf = RandomForestClassifier(max_depth=7).fit(X_train, y_train)
rf_pred=rf.predict(X_test)
rf_probs = rf.predict_proba(X_test)

print('log_loss:', log_loss(y_test, rf_probs))
print('--'*12)
print('train score accuracy', rf.score(X_train, y_train))
print('test score accuracy', rf.score(X_test, y_test))
print('--'*12)

print('train score precision', precision_score(y_test, rf_pred))
print('test score precision', precision_score(y_test, rf_pred))
print('--'*12)
print('train score recall', recall_score(y_test, rf_pred))
print('test score recall', recall_score(y_test, rf_pred))
print('--'*12)
print('\n')
print(classification_report(y_test, rf_pred, labels=[0, 1]))

log_loss: 0.09434126345148544
------------------------
train score accuracy 0.9722666666666666
test score accuracy 0.9752
------------------------
train score precision 0.9626168224299065
test score precision 0.9626168224299065
------------------------
train score recall 0.7923076923076923
test score recall 0.7923076923076923
------------------------


              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1120
           1       0.96      0.79      0.87       130

    accuracy                           0.98      1250
   macro avg       0.97      0.89      0.93      1250
weighted avg       0.97      0.98      0.97      1250

feat_importances = pd.Series(rf.feature_importances_, index=X.columns)
feat_importances.nlargest(10).plot(kind='barh')

<AxesSubplot:>

#trip.to_csv('data/data_1to14_eda.csv', index=False)

Feature Engineering¶

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings(action='ignore')

trip= pd.read_csv('data/data_1to14_eda.csv')

Creating a new variable to whether the trip was successfully pooled or not:

pooled : 1
not pooled (solo): 0

Traveler could authorize a shared trip but no matching could be made through the trip (variable Shared.Trip.Authorized could be true or false).

trip['trips_pooled'] = np.where(trip['trips_pooled']>1, 1, 0)

Change trip_start_timestamp and trip_end_timestamp to datetime object:

trip['trip_start_timestamp'] = pd.to_datetime(trip['trip_start_timestamp'])
trip['trip_end_timestamp'] = pd.to_datetime(trip['trip_end_timestamp'])

Creating a column showing day of week of each trip.

The week starts on Monday:

Monday: 0
.
.
.
Sunday: 6

trip['dayofweek'] = trip['trip_start_timestamp'].dt.dayofweek

Creating a column showing if the trip is done on weekdays or weekends?

trip['is_weekend']=np.where(trip['dayofweek'].isin([5,6]), 1, 0)

Creating a column showing the session of the trip:

Late Night
Eearly Morning
Morning
Noon
Eve
Night

b = [0,5,9,16,20,24]
l = ['midnight', 'm_peak','non_peak','e_peak', 'night']
trip['day_session'] = pd.cut(trip['trip_start_timestamp'].dt.hour, bins=b, labels=l, include_lowest=True)

trip['hour'] = trip['trip_start_timestamp'].dt.hour

Creating a column showing the month of the trip:

- January:1¶

- ¶

December: 12

trip['month'] = trip['trip_start_timestamp'].dt.month

trip.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3051324 entries, 0 to 3051323
Data columns (total 23 columns):
 #   Column                      Dtype         
---  ------                      -----         
 0   trip_start_timestamp        datetime64[ns]
 1   trip_end_timestamp          datetime64[ns]
 2   trip_seconds                float64       
 3   trip_miles                  float64       
 4   pickup_census_tract         int64         
 5   dropoff_census_tract        int64         
 6   pickup_community_area       float64       
 7   dropoff_community_area      float64       
 8   fare                        float64       
 9   tip                         float64       
 10  additional_charges          float64       
 11  trip_total                  float64       
 12  shared_trip_authorized      int64         
 13  trips_pooled                int64         
 14  pickup_centroid_latitude    float64       
 15  pickup_centroid_longitude   float64       
 16  dropoff_centroid_latitude   float64       
 17  dropoff_centroid_longitude  float64       
 18  dayofweek                   int64         
 19  is_weekend                  int64         
 20  day_session                 category      
 21  hour                        int64         
 22  month                       int64         
dtypes: category(1), datetime64[ns](2), float64(12), int64(8)
memory usage: 515.1 MB

ACS_data = pd.read_csv('data/chicago_sociodem_tract.csv',
                   sep=',')
ACS_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 801 entries, 0 to 800
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   GEOID                801 non-null    int64  
 1   portion_low_income   801 non-null    float64
 2   portion_high_income  801 non-null    float64
 3   portion_bachelor_up  801 non-null    float64
 4   portion_female       801 non-null    float64
 5   portion_no_car       801 non-null    float64
 6   portion_white        801 non-null    float64
 7   portion_black        801 non-null    float64
 8   portion_twoplus      801 non-null    float64
 9   portion_hispanic     801 non-null    float64
 10  portion_renter_occ   801 non-null    float64
 11  portion_young18_44   801 non-null    float64
dtypes: float64(11), int64(1)
memory usage: 75.2 KB

data = pd.merge(trip, ACS_data.add_suffix('_pickup'), how='left' , left_on='pickup_census_tract', right_on='GEOID_pickup',  suffixes=(None, '_pickup')).drop(['GEOID_pickup'],axis=1 )
data = pd.merge(data, ACS_data.add_suffix('_dropoff'), how='left' , left_on='dropoff_census_tract', right_on='GEOID_dropoff',  suffixes=(None, '_dropoff')).drop(['GEOID_dropoff'],axis=1 )

Crime Density¶

crime = pd.read_csv('data/Control_Variables/Crime_Rate/Chicago_crime_GEOID_mapped.csv')

data = pd.merge(data, crime.add_suffix('_pickup'), left_on='pickup_census_tract', \
         right_on='GEOID_pickup',  suffixes=(None, '_pickup')).drop(['GEOID_pickup', 'n_pickup'],axis=1 )

data = pd.merge(data, crime.add_suffix('_dropoff'), left_on='dropoff_census_tract', \
         right_on='GEOID_dropoff',  suffixes=(None, '_dropoff')).drop(['GEOID_dropoff', 'n_dropoff'],axis=1 )

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3051324 entries, 0 to 3051323
Data columns (total 47 columns):
 #   Column                       Dtype         
---  ------                       -----         
 0   trip_start_timestamp         datetime64[ns]
 1   trip_end_timestamp           datetime64[ns]
 2   trip_seconds                 float64       
 3   trip_miles                   float64       
 4   pickup_census_tract          int64         
 5   dropoff_census_tract         int64         
 6   pickup_community_area        float64       
 7   dropoff_community_area       float64       
 8   fare                         float64       
 9   tip                          float64       
 10  additional_charges           float64       
 11  trip_total                   float64       
 12  shared_trip_authorized       int64         
 13  trips_pooled                 int64         
 14  pickup_centroid_latitude     float64       
 15  pickup_centroid_longitude    float64       
 16  dropoff_centroid_latitude    float64       
 17  dropoff_centroid_longitude   float64       
 18  dayofweek                    int64         
 19  is_weekend                   int64         
 20  day_session                  category      
 21  hour                         int64         
 22  month                        int64         
 23  portion_low_income_pickup    float64       
 24  portion_high_income_pickup   float64       
 25  portion_bachelor_up_pickup   float64       
 26  portion_female_pickup        float64       
 27  portion_no_car_pickup        float64       
 28  portion_white_pickup         float64       
 29  portion_black_pickup         float64       
 30  portion_twoplus_pickup       float64       
 31  portion_hispanic_pickup      float64       
 32  portion_renter_occ_pickup    float64       
 33  portion_young18_44_pickup    float64       
 34  portion_low_income_dropoff   float64       
 35  portion_high_income_dropoff  float64       
 36  portion_bachelor_up_dropoff  float64       
 37  portion_female_dropoff       float64       
 38  portion_no_car_dropoff       float64       
 39  portion_white_dropoff        float64       
 40  portion_black_dropoff        float64       
 41  portion_twoplus_dropoff      float64       
 42  portion_hispanic_dropoff     float64       
 43  portion_renter_occ_dropoff   float64       
 44  portion_young18_44_dropoff   float64       
 45  crime_density_pickup         float64       
 46  crime_density_dropoff        float64       
dtypes: category(1), datetime64[ns](2), float64(36), int64(8)
memory usage: 1.1 GB

Job Density¶

job = pd.read_csv('data/Control_Variables/Employment_Density/Chicago_Job_Density.csv')

data = pd.merge(data, job.add_suffix('_pickup'), left_on='pickup_census_tract', \
         right_on='GEOID_pickup',  suffixes=(None, '_pickup')).drop(['GEOID_pickup', 'count_jobs_pickup', 'area_pickup'],axis=1 )

data = pd.merge(data, job.add_suffix('_dropoff'), left_on='dropoff_census_tract', \
         right_on='GEOID_dropoff',  suffixes=(None, '_dropoff')).drop(['GEOID_dropoff', 'count_jobs_dropoff', 'area_dropoff'],axis=1 )

Housing¶

hp = pd.read_csv('data/Control_Variables/Housing_and_Population/Chicago_house_pop_GEOID_mapped.csv')

data = pd.merge(data, hp.add_suffix('_pickup'), left_on='pickup_census_tract', \
         right_on='GEOID_pickup',  suffixes=(None, '_pickup')).drop(['GEOID_pickup', 'house_pickup', 'pop_pickup'],axis=1 )

data = pd.merge(data, hp.add_suffix('_dropoff'), left_on='dropoff_census_tract', \
         right_on='GEOID_dropoff',  suffixes=(None, '_dropoff')).drop(['GEOID_dropoff', 'house_dropoff', 'pop_dropoff'],axis=1 )

POI¶

poi = pd.read_csv('data/Control_Variables/POI/POI_GEOID_mapped.csv')

data = pd.merge(data, poi.add_suffix('_pickup'), left_on='pickup_census_tract', \
         right_on='GEOID_pickup',  suffixes=(None, '_pickup')).drop(['GEOID_pickup'],axis=1 )

data = pd.merge(data, poi.add_suffix('_dropoff'), left_on='dropoff_census_tract', \
         right_on='GEOID_dropoff',  suffixes=(None, '_dropoff')).drop(['GEOID_dropoff'],axis=1 )

Transit Supply¶

transit = pd.read_csv('data/Control_Variables/Transit_Supply/Chicago_transit_supply_GEOID_mapped.csv')

data = pd.merge(data, transit.add_suffix('_pickup'), left_on='pickup_census_tract', \
         right_on='GEOID_pickup',  suffixes=(None, '_pickup')).drop(['GEOID_pickup', 'rail_stops_count_pickup', 'bus_stops_count_pickup'],axis=1 )

data = pd.merge(data, transit.add_suffix('_dropoff'), left_on='dropoff_census_tract', \
         right_on='GEOID_dropoff',  suffixes=(None, '_dropoff')).drop(['GEOID_dropoff', 'rail_stops_count_dropoff', 'bus_stops_count_dropoff'],axis=1 )

Weather¶

#pd.read_csv('data/Control_Variables/Weather_Data/weatherdata.csv')

data.info()

airport¶

airport_geoid = ['17031980100', '17031980000']

data['is_airport']=np.where(data['pickup_census_tract'].isin(airport_geoid) | data['dropoff_census_tract'].isin(airport_geoid) , 1, 0)

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3051324 entries, 0 to 3051323
Data columns (total 60 columns):
 #   Column                       Dtype         
---  ------                       -----         
 0   trip_start_timestamp         datetime64[ns]
 1   trip_end_timestamp           datetime64[ns]
 2   trip_seconds                 float64       
 3   trip_miles                   float64       
 4   pickup_census_tract          int64         
 5   dropoff_census_tract         int64         
 6   pickup_community_area        float64       
 7   dropoff_community_area       float64       
 8   fare                         float64       
 9   tip                          float64       
 10  additional_charges           float64       
 11  trip_total                   float64       
 12  shared_trip_authorized       int64         
 13  trips_pooled                 int64         
 14  pickup_centroid_latitude     float64       
 15  pickup_centroid_longitude    float64       
 16  dropoff_centroid_latitude    float64       
 17  dropoff_centroid_longitude   float64       
 18  dayofweek                    int64         
 19  is_weekend                   int64         
 20  day_session                  category      
 21  hour                         int64         
 22  month                        int64         
 23  portion_low_income_pickup    float64       
 24  portion_high_income_pickup   float64       
 25  portion_bachelor_up_pickup   float64       
 26  portion_female_pickup        float64       
 27  portion_no_car_pickup        float64       
 28  portion_white_pickup         float64       
 29  portion_black_pickup         float64       
 30  portion_twoplus_pickup       float64       
 31  portion_hispanic_pickup      float64       
 32  portion_renter_occ_pickup    float64       
 33  portion_young18_44_pickup    float64       
 34  portion_low_income_dropoff   float64       
 35  portion_high_income_dropoff  float64       
 36  portion_bachelor_up_dropoff  float64       
 37  portion_female_dropoff       float64       
 38  portion_no_car_dropoff       float64       
 39  portion_white_dropoff        float64       
 40  portion_black_dropoff        float64       
 41  portion_twoplus_dropoff      float64       
 42  portion_hispanic_dropoff     float64       
 43  portion_renter_occ_dropoff   float64       
 44  portion_young18_44_dropoff   float64       
 45  crime_density_pickup         float64       
 46  crime_density_dropoff        float64       
 47  job_density_pickup           float64       
 48  job_density_dropoff          float64       
 49  house_density_pickup         float64       
 50  pop_density_pickup           float64       
 51  house_density_dropoff        float64       
 52  pop_density_dropoff          float64       
 53  POI_density_pickup           float64       
 54  POI_density_dropoff          float64       
 55  bus_stops_density_pickup     float64       
 56  rail_stops_density_pickup    float64       
 57  bus_stops_density_dropoff    float64       
 58  rail_stops_density_dropoff   float64       
 59  is_airport                   int64         
dtypes: category(1), datetime64[ns](2), float64(48), int64(9)
memory usage: 1.4 GB

to_drop = ['trip_start_timestamp', 'trip_end_timestamp']


# to_drop = ['trip_start_timestamp', 'trip_end_timestamp',
#            'pickup_community_area', 'dropoff_community_area',
#            'pickup_centroid_latitude', 'pickup_centroid_longitude',
#            'dropoff_centroid_latitude', 'dropoff_centroid_longitude',
#            'portion_twoplus_pickup','portion_twoplus_dropoff',
#            'portion_hispanic_pickup','portion_hispanic_dropoff',
#            'portion_renter_occ_pickup','portion_renter_occ_dropoff']

data.drop(to_drop, axis=1, inplace=True)

data.shared_trip_authorized  = np.where(data.shared_trip_authorized == False, 0, 1)

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3051324 entries, 0 to 3051323
Data columns (total 58 columns):
 #   Column                       Dtype   
---  ------                       -----   
 0   trip_seconds                 float64 
 1   trip_miles                   float64 
 2   pickup_census_tract          int64   
 3   dropoff_census_tract         int64   
 4   pickup_community_area        float64 
 5   dropoff_community_area       float64 
 6   fare                         float64 
 7   tip                          float64 
 8   additional_charges           float64 
 9   trip_total                   float64 
 10  shared_trip_authorized       int64   
 11  trips_pooled                 int64   
 12  pickup_centroid_latitude     float64 
 13  pickup_centroid_longitude    float64 
 14  dropoff_centroid_latitude    float64 
 15  dropoff_centroid_longitude   float64 
 16  dayofweek                    int64   
 17  is_weekend                   int64   
 18  day_session                  category
 19  hour                         int64   
 20  month                        int64   
 21  portion_low_income_pickup    float64 
 22  portion_high_income_pickup   float64 
 23  portion_bachelor_up_pickup   float64 
 24  portion_female_pickup        float64 
 25  portion_no_car_pickup        float64 
 26  portion_white_pickup         float64 
 27  portion_black_pickup         float64 
 28  portion_twoplus_pickup       float64 
 29  portion_hispanic_pickup      float64 
 30  portion_renter_occ_pickup    float64 
 31  portion_young18_44_pickup    float64 
 32  portion_low_income_dropoff   float64 
 33  portion_high_income_dropoff  float64 
 34  portion_bachelor_up_dropoff  float64 
 35  portion_female_dropoff       float64 
 36  portion_no_car_dropoff       float64 
 37  portion_white_dropoff        float64 
 38  portion_black_dropoff        float64 
 39  portion_twoplus_dropoff      float64 
 40  portion_hispanic_dropoff     float64 
 41  portion_renter_occ_dropoff   float64 
 42  portion_young18_44_dropoff   float64 
 43  crime_density_pickup         float64 
 44  crime_density_dropoff        float64 
 45  job_density_pickup           float64 
 46  job_density_dropoff          float64 
 47  house_density_pickup         float64 
 48  pop_density_pickup           float64 
 49  house_density_dropoff        float64 
 50  pop_density_dropoff          float64 
 51  POI_density_pickup           float64 
 52  POI_density_dropoff          float64 
 53  bus_stops_density_pickup     float64 
 54  rail_stops_density_pickup    float64 
 55  bus_stops_density_dropoff    float64 
 56  rail_stops_density_dropoff   float64 
 57  is_airport                   int64   
dtypes: category(1), float64(48), int64(9)
memory usage: 1.3 GB

group = ['pickup_census_tract', 'dropoff_census_tract', 'hour', 'is_weekend', 'is_airport']

s = data.groupby(group)

s = s.agg({'trip_seconds':'mean', 
           'trip_miles': 'mean', 
           'trip_total':'mean',
           'portion_twoplus_pickup': 'median','portion_twoplus_dropoff':'median',
           'portion_hispanic_pickup':'median','portion_hispanic_dropoff':'median',
           'portion_renter_occ_pickup':'median','portion_renter_occ_dropoff':'median',
           'crime_density_pickup':'median','house_density_pickup':'median',
           'job_density_pickup': 'median', 'job_density_dropoff': 'median',
           'house_density_pickup':'median', 'house_density_dropoff': 'median',
           'pop_density_pickup': 'median', 'pop_density_dropoff': 'median',
           'POI_density_pickup':'median', 'POI_density_dropoff': 'median',
           'bus_stops_density_pickup': 'median', 'bus_stops_density_dropoff': 'median',
          'rail_stops_density_pickup': 'median', 'rail_stops_density_dropoff': 'median',
          'trips_pooled': 'count',
          'shared_trip_authorized': lambda x: x.sum()/len(x)
         }).reset_index()

s['midnight'] = s['hour'].apply(lambda x: 1 if (x>=0 and x<5) else 0)
s['m_peak'] = s['hour'].apply(lambda x: 1 if (x>=5 and x<9) else 0)
s['non_peak'] = s['hour'].apply(lambda x: 1 if (x>=9 and x<16) else 0)
s['e_peak'] = s['hour'].apply(lambda x: 1 if (x>=16 and x<20) else 0)
s['night'] = s['hour'].apply(lambda x: 1 if (x>=20 and x<24) else 0)

s = s.rename(columns={'trips_pooled': 'count'})

s.shape

(993986, 31)

s=s[s['count'] > 100]

s.shape

(992056, 31)

s = s.fillna(0)
s.isna().sum()

pickup_census_tract           0
dropoff_census_tract          0
hour                          0
is_weekend                    0
is_airport                    0
trip_seconds                  0
trip_miles                    0
additional_charges            0
fare                          0
tip                           0
portion_twoplus_pickup        0
portion_twoplus_dropoff       0
portion_hispanic_pickup       0
portion_hispanic_dropoff      0
portion_renter_occ_pickup     0
portion_renter_occ_dropoff    0
crime_density_pickup          0
house_density_pickup          0
job_density_pickup            0
job_density_dropoff           0
house_density_dropoff         0
pop_density_pickup            0
pop_density_dropoff           0
POI_density_pickup            0
POI_density_dropoff           0
bus_stops_density_pickup      0
bus_stops_density_dropoff     0
rail_stops_density_pickup     0
rail_stops_density_dropoff    0
count                         0
shared_trip_authorized        0
dtype: int64

s.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 992056 entries, 0 to 993985
Data columns (total 31 columns):
 #   Column                      Non-Null Count   Dtype  
---  ------                      --------------   -----  
 0   pickup_census_tract         992056 non-null  int64  
 1   dropoff_census_tract        992056 non-null  int64  
 2   hour                        992056 non-null  int64  
 3   is_weekend                  992056 non-null  int64  
 4   is_airport                  992056 non-null  int64  
 5   trip_seconds                992056 non-null  float64
 6   trip_miles                  992056 non-null  float64
 7   additional_charges          992056 non-null  float64
 8   fare                        992056 non-null  float64
 9   tip                         992056 non-null  float64
 10  portion_twoplus_pickup      992056 non-null  float64
 11  portion_twoplus_dropoff     992056 non-null  float64
 12  portion_hispanic_pickup     992056 non-null  float64
 13  portion_hispanic_dropoff    992056 non-null  float64
 14  portion_renter_occ_pickup   992056 non-null  float64
 15  portion_renter_occ_dropoff  992056 non-null  float64
 16  crime_density_pickup        992056 non-null  float64
 17  house_density_pickup        992056 non-null  float64
 18  job_density_pickup          992056 non-null  float64
 19  job_density_dropoff         992056 non-null  float64
 20  house_density_dropoff       992056 non-null  float64
 21  pop_density_pickup          992056 non-null  float64
 22  pop_density_dropoff         992056 non-null  float64
 23  POI_density_pickup          992056 non-null  float64
 24  POI_density_dropoff         992056 non-null  float64
 25  bus_stops_density_pickup    992056 non-null  float64
 26  bus_stops_density_dropoff   992056 non-null  float64
 27  rail_stops_density_pickup   992056 non-null  float64
 28  rail_stops_density_dropoff  992056 non-null  float64
 29  count                       992056 non-null  int64  
 30  shared_trip_authorized      992056 non-null  float64
dtypes: float64(25), int64(6)
memory usage: 242.2 MB

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

s_scaled = pd.DataFrame(scaler.fit_transform(s), columns=s.columns)
s_scaled.head(2)

cols = [0,1,30]
X=s.drop(s.columns[cols], axis=1)
y=s['shared_trip_authorized']

import statsmodels.api as sm
X = sm.add_constant(X)
est = sm.OLS(y, X).fit()
est.summary()

import seaborn as sns

to_drop=[0,1]


fig, ax = plt.subplots(figsize=(20,20)) 
matrix = np.triu(s.drop(s.columns[to_drop], axis=1).corr())
sns.heatmap(s.drop(s.columns[to_drop], axis=1).corr(), annot=True, fmt='.1g', square=True, linewidths=.5, cbar_kws={"shrink": .5},ax=ax)
plt.tight_layout()
plt.savefig('images/heatmap.png', dpi=300)

What are underlying factors for WTP?¶

data.info()

pd.crosstab(data.is_airport, data.shared_trip_authorized, normalize=False).plot(kind="bar")
plt.title('WTP (trip shared authorized ~ additional charges)')
plt.xticks(rotation=0)
plt.ylabel('Frequency')
#plt.savefig('1.png')
plt.show()

from scipy.stats import chi2_contingency
chi2, p, dof, ex = chi2_contingency( pd.crosstab(data.is_airport, data.shared_trip_authorized, margins=True))
print("chi2 = ", chi2)
print("p-val = ", p)
print("degree of freedom = ",dof)

From the bar chart we can see chance of WTP is more in not-airport area. from chi2 analysis, p-value is less than 0.05. Therefor, we have enough evidence to reject the null hypothesis, which was there was no effect of airport area on authorizing a trip to be pooled.

pd.crosstab(data.is_airport, data.shared_trip_authorized, normalize=False).plot(kind="bar")
plt.title('WTP (trip shared authorized ~ is_airport)')
plt.xticks(rotation=0)
plt.ylabel('Frequency')
#plt.savefig('1.png')
plt.show()

from scipy.stats import chi2_contingency
chi2, p, dof, ex = chi2_contingency( pd.crosstab(data.is_airport, data.shared_trip_authorized, margins=True))
print("chi2 = ", chi2)
print("p-val = ", p)
print("degree of freedom = ",dof)

chi2 =  6351.264506938556
p-val =  0.0
degree of freedom =  4

From the bar chart we can see chance of WTP is more in not-airport area. from chi2 analysis, p-value is less than 0.05. Therefor, we have enough evidence to reject the null hypothesis, which was there was no effect of weekend on authorizing a trip to be pooled.

import statsmodels.api as sm
import statsmodels.formula.api as smf

cols = ['trip_seconds', 'trip_miles', 'fare', 'tip', 'additional_charges',
       'trip_total', 'hour', 'month', 'portion_low_income_pickup',
       'portion_high_income_pickup', 'portion_bachelor_up_pickup',
       'portion_female_pickup', 'portion_no_car_pickup',
       'portion_white_pickup', 'portion_black_pickup',
       'portion_twoplus_pickup', 'portion_hispanic_pickup',
       'portion_renter_occ_pickup', 'portion_young18_44_pickup',
       'portion_low_income_dropoff', 'portion_high_income_dropoff',
       'portion_bachelor_up_dropoff', 'portion_female_dropoff',
       'portion_no_car_dropoff', 'portion_white_dropoff',
       'portion_black_dropoff', 'portion_twoplus_dropoff',
       'portion_hispanic_dropoff', 'portion_renter_occ_dropoff',
       'portion_young18_44_dropoff']

dict_pval={}
for item in cols:
    model1 = smf.ols(f'{item} ~ C(shared_trip_authorized)', data=data).fit()
    dict_pval[item] = model1.f_pvalue

dict_pval

{'trip_seconds': 0.0,
 'trip_miles': 0.0,
 'fare': 0.0,
 'tip': 0.0,
 'additional_charges': 0.0,
 'trip_total': 0.0,
 'hour': 2.277939007292422e-07,
 'month': 1.0,
 'portion_low_income_pickup': 0.0,
 'portion_high_income_pickup': 0.0,
 'portion_bachelor_up_pickup': 0.0,
 'portion_female_pickup': 0.0,
 'portion_no_car_pickup': 3.1344329999657895e-225,
 'portion_white_pickup': 0.0,
 'portion_black_pickup': 0.0,
 'portion_twoplus_pickup': 3.150332215877822e-69,
 'portion_hispanic_pickup': 0.0,
 'portion_renter_occ_pickup': 0.0,
 'portion_young18_44_pickup': 0.0,
 'portion_low_income_dropoff': 0.0,
 'portion_high_income_dropoff': 0.0,
 'portion_bachelor_up_dropoff': 0.0,
 'portion_female_dropoff': 0.0,
 'portion_no_car_dropoff': 8.952571556612613e-49,
 'portion_white_dropoff': 0.0,
 'portion_black_dropoff': 0.0,
 'portion_twoplus_dropoff': 0.10323990411105204,
 'portion_hispanic_dropoff': 0.0,
 'portion_renter_occ_dropoff': 0.0,
 'portion_young18_44_dropoff': 0.0}

pd.DataFrame.from_dict(dict_pval,'index').stack().reset_index(level=0)

#data.to_csv('trip_Nov_1to14_clean.csv', index=False)

	Trip Start Timestamp	Trip End Timestamp	Trip Seconds	Trip Miles	Pickup Census Tract	Dropoff Census Tract	Pickup Community Area	Dropoff Community Area	Fare	Tip	Additional Charges	Trip Total	Shared Trip Authorized	Trips Pooled	Pickup Centroid Latitude	Pickup Centroid Longitude	Dropoff Centroid Latitude	Dropoff Centroid Longitude
0	11/01/2019 12:00:00 AM	11/01/2019 12:00:00 AM	165.0	0.6	1.703132e+10	1.703132e+10	32.0	32.0	2.5	0.0	2.55	5.05	False	1	41.870607	-87.622173	41.877406	-87.621972
1	11/01/2019 12:00:00 AM	11/01/2019 12:15:00 AM	823.0	5.4	1.703103e+10	1.703108e+10	3.0	8.0	10.0	0.0	2.55	12.55	False	1	41.958155	-87.653022	41.893216	-87.637844
2	11/01/2019 12:00:00 AM	11/01/2019 12:15:00 AM	683.0	5.6	1.703110e+10	1.703116e+10	10.0	16.0	10.0	3.0	2.55	15.55	False	1	41.972132	-87.797635	41.950212	-87.712814

	Trip Seconds	Trip Miles	Pickup Census Tract	Dropoff Census Tract	Pickup Community Area	Dropoff Community Area	Fare	Tip	Additional Charges	Trip Total	Trips Pooled	Pickup Centroid Latitude	Pickup Centroid Longitude	Dropoff Centroid Latitude	Dropoff Centroid Longitude
count	4.502375e+06	4.505930e+06	3.280119e+06	3.263206e+06	4.212010e+06	4.181385e+06	4.505927e+06	4.505927e+06	4.505929e+06	4.505927e+06	4.505930e+06	4.215877e+06	4.215877e+06	4.185304e+06	4.185304e+06
mean	1.091856e+03	6.456160e+00	1.703137e+10	1.703138e+10	2.578688e+01	2.649017e+01	1.193603e+01	6.870422e-01	3.035391e+00	1.565847e+01	1.138735e+00	4.189287e+01	-8.766735e+01	4.189376e+01	-8.766967e+01
std	7.964650e+02	7.271941e+00	3.383429e+05	3.427131e+05	2.047745e+01	2.094711e+01	9.477744e+00	1.838201e+00	2.042552e+00	1.145579e+01	4.934787e-01	6.071416e-02	6.397958e-02	6.090235e-02	6.816556e-02
min	0.000000e+00	0.000000e+00	1.703101e+10	1.703101e+10	1.000000e+00	1.000000e+00	0.000000e+00	0.000000e+00	0.000000e+00	7.200000e-01	1.000000e+00	4.165022e+01	-8.791362e+01	4.165022e+01	-8.791362e+01
25%	5.380000e+02	1.800000e+00	1.703108e+10	1.703108e+10	8.000000e+00	8.000000e+00	5.000000e+00	0.000000e+00	2.550000e+00	8.550000e+00	1.000000e+00	4.187741e+01	-8.767995e+01	4.187867e+01	-8.768323e+01
50%	8.730000e+02	3.800000e+00	1.703124e+10	1.703124e+10	2.400000e+01	2.400000e+01	1.000000e+01	0.000000e+00	2.550000e+00	1.255000e+01	1.000000e+00	4.189322e+01	-8.765177e+01	4.189322e+01	-8.765156e+01
75%	1.412000e+03	8.200000e+00	1.703183e+10	1.703183e+10	3.200000e+01	3.200000e+01	1.500000e+01	0.000000e+00	2.550000e+00	1.755000e+01	1.000000e+00	4.192908e+01	-8.763141e+01	4.192933e+01	-8.763141e+01
max	6.498000e+04	3.192000e+02	1.703198e+10	1.703198e+10	7.700000e+01	7.700000e+01	7.000000e+02	1.000000e+02	3.069000e+01	7.021000e+02	1.000000e+01	4.202122e+01	-8.752995e+01	4.202122e+01	-8.752995e+01

	trip_start_timestamp	trip_end_timestamp	trip_seconds	trip_miles	pickup_census_tract	dropoff_census_tract	pickup_community_area	dropoff_community_area	fare	tip	additional_charges	trip_total	shared_trip_authorized	trips_pooled	pickup_centroid_latitude	pickup_centroid_longitude	dropoff_centroid_latitude	dropoff_centroid_longitude
34	11/01/2019 12:00:00 AM	11/01/2019 12:00:00 AM	721.0	5.8	17031010600	17031070101	1.0	7.0	10.0	2.0	2.55	14.55	False	1	42.006279	-87.667684	41.929263	-87.635891
70	11/01/2019 12:00:00 AM	11/01/2019 12:15:00 AM	200.0	1.0	17031010300	17031010502	1.0	1.0	2.5	1.0	2.55	6.05	False	1	42.015934	-87.666536	42.004517	-87.663328
93	11/01/2019 12:00:00 AM	11/01/2019 12:00:00 AM	375.0	1.1	17031010400	17031030300	1.0	77.0	5.0	0.0	2.55	7.55	False	1	42.004765	-87.659122	41.994381	-87.672538

	the_geom	STATEFP10	COUNTYFP10	TRACTCE10	GEOID10	NAME10	NAMELSAD10	COMMAREA	COMMAREA_N	NOTES
0	MULTIPOLYGON (((-87.62404799998049 41.73021699...	17	31	842400	17031842400	8424.0	Census Tract 8424	44	44	NaN
1	MULTIPOLYGON (((-87.6860799999848 41.822956000...	17	31	840300	17031840300	8403.0	Census Tract 8403	59	59	NaN

	trip_seconds	trip_miles	pickup_community_area	dropoff_community_area	fare	tip	additional_charges	trip_total	trips_pooled	pickup_centroid_latitude	pickup_centroid_longitude	dropoff_centroid_latitude	dropoff_centroid_longitude
shared_trip_authorized
False	944.781605	4.542668	21.793658	22.581960	10.362360	0.669748	3.128046	14.160153	1.004789	41.900359	-87.657419	41.901192	-87.659827
True	1178.342448	6.319077	24.813819	24.562201	8.572013	0.250236	1.189183	10.011432	2.111447	41.891531	-87.660659	41.892490	-87.659300

	pickup_census_tract	dropoff_census_tract	hour	is_weekend	is_airport	trip_seconds	trip_miles	additional_charges	fare	tip	...	pop_density_pickup	pop_density_dropoff	POI_density_pickup	POI_density_dropoff	bus_stops_density_pickup	bus_stops_density_dropoff	rail_stops_density_pickup	rail_stops_density_dropoff	count	shared_trip_authorized
0	0.0	0.0	0.043478	1.0	0.0	0.002430	0.000000	0.130036	0.017857	0.0	...	0.116825	0.116825	0.081782	0.081782	0.048466	0.048466	0.0	0.0	0.0	0.0
1	0.0	0.0	0.347826	0.0	0.0	0.032374	0.013544	0.130036	0.035714	0.0	...	0.116825	0.116825	0.081782	0.081782	0.048466	0.048466	0.0	0.0	0.0	0.0

Dep. Variable:	shared_trip_authorized	R-squared:	0.477
Model:	OLS	Adj. R-squared:	0.477
Method:	Least Squares	F-statistic:	3.235e+04
Date:	Wed, 02 Dec 2020	Prob (F-statistic):	0.00
Time:	12:43:10	Log-Likelihood:	27697.
No. Observations:	992056	AIC:	-5.534e+04
Df Residuals:	992027	BIC:	-5.499e+04
Df Model:	28
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	0.7189	0.002	370.860	0.000	0.715	0.723
hour	-0.0008	3.72e-05	-21.042	0.000	-0.001	-0.001
is_weekend	-0.0199	0.000	-40.566	0.000	-0.021	-0.019
is_airport	0.8791	0.002	388.766	0.000	0.875	0.884
trip_seconds	0.0002	7.27e-07	247.545	0.000	0.000	0.000
trip_miles	0.0314	0.000	267.025	0.000	0.031	0.032
additional_charges	-0.1912	0.000	-620.387	0.000	-0.192	-0.191
fare	-0.0394	8.99e-05	-437.526	0.000	-0.040	-0.039
tip	-0.0058	0.000	-26.742	0.000	-0.006	-0.005
portion_twoplus_pickup	-0.2484	0.012	-20.122	0.000	-0.273	-0.224
portion_twoplus_dropoff	-0.2822	0.012	-23.060	0.000	-0.306	-0.258
portion_hispanic_pickup	0.0176	0.001	13.457	0.000	0.015	0.020
portion_hispanic_dropoff	0.0429	0.001	32.297	0.000	0.040	0.046
portion_renter_occ_pickup	0.0444	0.002	27.074	0.000	0.041	0.048
portion_renter_occ_dropoff	0.0642	0.002	39.154	0.000	0.061	0.067
crime_density_pickup	1.309e-06	6.14e-08	21.318	0.000	1.19e-06	1.43e-06
house_density_pickup	7.76e-07	7.34e-08	10.573	0.000	6.32e-07	9.2e-07
job_density_pickup	-7.195e-08	3.83e-09	-18.800	0.000	-7.94e-08	-6.44e-08
job_density_dropoff	-6.437e-08	3.02e-09	-21.297	0.000	-7.03e-08	-5.84e-08
house_density_dropoff	9.711e-07	6.77e-08	14.348	0.000	8.38e-07	1.1e-06
pop_density_pickup	-1.596e-06	5.23e-08	-30.507	0.000	-1.7e-06	-1.49e-06
pop_density_dropoff	-1.363e-06	5.02e-08	-27.154	0.000	-1.46e-06	-1.26e-06
POI_density_pickup	-4.428e-05	1.36e-06	-32.491	0.000	-4.7e-05	-4.16e-05
POI_density_dropoff	-4.817e-05	1.29e-06	-37.340	0.000	-5.07e-05	-4.56e-05
bus_stops_density_pickup	-8.888e-05	8.65e-06	-10.277	0.000	-0.000	-7.19e-05
bus_stops_density_dropoff	-0.0001	8.43e-06	-14.818	0.000	-0.000	-0.000
rail_stops_density_pickup	-0.0003	2.66e-05	-13.137	0.000	-0.000	-0.000
rail_stops_density_dropoff	-0.0003	2.66e-05	-11.997	0.000	-0.000	-0.000
count	0.0008	4.43e-05	17.291	0.000	0.001	0.001

Omnibus:	428467.968	Durbin-Watson:	1.804
Prob(Omnibus):	0.000	Jarque-Bera (JB):	2108875.746
Skew:	2.091	Prob(JB):	0.00
Kurtosis:	8.790	Cond. No.	5.79e+06