Segmenting and Clustering Italian Restaurants in Sao Paulo

Leonardo Janes Carneiro
9 min readJun 29, 2020

--

As a part of the IBM Data Science Professional Certificate final assignment, you will find in this post an overview of my Capstone project.

I go through problem description, data preprocessing and analysis of Italian Restaurants in the city of Sao Paulo. Detailed code is available on Github and a link is provided at the end of this post.

1. Introduction

1.1. Background

Sao Paulo is the most populous city in Brazil with approximately 12.25 million inhabitants, as of 2019. It exerts strong international influences in commerce, finance, arts and entertainment and is listed as an alpha global city by the Globalization and World Cities Study Group (GaWC). It has the largest economy by Gross Domestic Product (GDP) in Latin America and the Southern Hemisphere, representing 10.7% of all Brazilian GDP and being home to 63% of stablished multinational companies in Brazil.

Sao Paulo is also a cosmopolitan, melting pot and an ethnically diverse city, home to the largest Arab, Italian, Japanese, and Portuguese diasporas. It is also home to the largest Jewish population in Brazil, with almost 75,000 Jews. In 2016, inhabitants of Sao Paulo were native to 196 different countries.

Such a diverse culture translates to a diverse cuisine. We can find many different categories of restaurants in Sao Paulo: Italian, Asian, Middle Eastern, just to name a few. Sao Paulo attracts many to start their businesses in the food industry, either small ones such as mobile food vendors, food truck and fast food joints, or larger ones as restaurants. Before starting to operate, though, they need to find the appropriate location to open. What do they take into account when making this decision?

1.2. Business Problem

Upon exploring the districts of Sao Paulo, I hope to find whether opening a restaurant in a neighborhood of restaurants plays a role in the success of the business. I will explore, segment and cluster restaurants in the city of Sao Paulo. To simplify our analysis, this report focuses on the Italian cuisine based on the total number of restaurants we can find in Sao Paulo. The methodology used here applies to other cuisines as well.

2. Data

I list below the necessary data for this report:

a. Sao Paulo's districts data.

Source:

https://pt.wikipedia.org/wiki/Lista_dos_distritos_de_S%C3%A3o_Paulo_por_popula%C3%A7%C3%A3o

Description:

A table containing the names of all of Sao Paulo's districts.

b. Districts' geographical coordinates.

Source:

Geocoder class from Geopy client.

Description:

Using the names of Sao Paulo's districts, I used the Nominatim function from the geocoder class to extract latitude and longitude coordinates.

c. Venues in each district of Sao Paulo.

Source:

Foursquare API.

Description:

From the API, I collect all venues available along with their categories, ratings and counts for likes and tips.

2.1. Data Preparation

Since there are many Brazilian cities and Sao Paulo's districts homonyms, it took a little extra work to collect all latitude and longitude coordinates for the districts. I stored them in a csv file, available at the following link: https://s3-api.us-geo.objectstorage.service.networklayer.com.

Then, I dowloaded and read the file into a pandas data-frame with the districts, latitudes and longitudes stored as columns, such that each row represents a different district, with its respective latitude and longitude. There are 96 districts on the data-frame. Below, I display the first five rows:

Table 1: Districts and geographical coordinates

2.2 Foursquare Data

Next, I used the Foursquare API to get the necessary data. I used the latitude and longitude coordinates from the districts to extract a maximum of 100 venues located within a 500 meter radius. These data were stored in another data-frame consisting of 2,640 venues and 222 different categories.

In another API request, I collected ratings, likes and tips counts for Italian restaurants found in the city of Sao Paulo. I found available data for a total of 48 restaurants.

2.3 Exploratory Data Analysis

After collecting all venues, I filtered the Category column for Italian Restaurants. Then, I grouped them by district, counted the number of Italian restaurants found in each district and plotted a bar chart.

Figure 1: Italian restaurants in Sao Paulo

There are 49 restaurants distributed across 24 districts. Almost half of them, however, are concentrated in the districts of Bela Vista, Itaim Bibi and Jardim Paulista (22). Bela Vista is home to a large Italian immigrants community, so it is no surprise to find many restaurants there.

Next, we take a look at the information collected from these restaurants. Only 48 of them had available ratings, likes and tips counts. Thus, due to missing data, we dropped the district of Jaguaré from our database.

I grouped the restaurants by district once again and calculated the average for the ratings column. I display below the top 10 districts based on average rating.

Table 2: Top districts by average rating

We saw that the district of Itaim Bibi hosts 11 Italian restaurants — the most by district. Since the average rating for these restaurants is approximately 8.4, it seems that a few of the best places to go for Italian food is found there.

2.4 One Hot Encoding

In order to further inspect the Italian restaurants' neighborhood, I use pandas one hot encoding to find the 10 most common venues in each of the 23 districts. There is a total of 1,395 venues and 222 different venue categories.

First, I used the get_dummies function from pandas to create one column for each category. Then I grouped the venues by district and calculate the proportions of each category. Finally, I ran a for loop and created a data-frame with the districts as rows and the 1st up to the 10th most common venue category as columns. Later, I will use this data-frame to merge with the feature data-frame and filter for cluster label to examine common patterns across districts within the same cluster.

2.5 Feature Selection

Since the aim of this report focuses on Italian restaurants, I selected those districts where we can find at least one Italian restaurant, along with their respective average ratings and merged these columns with our geographical coordinates data-frame. The first five rows of this merged data-frame are shown below.

Table 3: Feature data-frame

Each row in this data-frame is a different district, with a total of 23. The numeric columns will be used as features for our clustering algorithm.

3. Methodology

First, I extract the numeric columns on our feature data-frame and normalize them, since they range in different scales. Then, I use the k-means algorithm from the cluster module under the scikit-learn Python library to cluster the districts. I ran the algorithm with two, three, four and five clusters, in order to choose the optimal number of clusters, according to the elbow method. The optimal number seems to be 3.

I stored the clusters' labels (0, 1 and 2) as a new column in our features data-frame and used the folium Python library to create a map centered in the city of Sao Paulo, with markers for each district and colored by cluster label. The map is displayed below.

Figure 2: Map of Sao Paulo

4. Results

In order to further inspect each cluster, I merged our features data-frame (now with the cluster labels) with the one hot encoding data-frame and filtered for one cluster label at a time. I selected the district, average rating, cluster label and the most common venues columns, as displayed below for cluster label 1.

Table 4: Cluster label 1 data

To begin our analysis, we can see from the average rating column that the values range from 5.40 to 7.95, with a mean value of 6.67. I looks like the worst reviewed restaurants, on average were assigned to this cluster, with a couple of outliers in the district of Vila Leopoldina. The presence of grocery stores, bakeries, pastelarias and bars might suggest a public preference for lighter meals in these districts.

In our map, districts assigned to cluster label 1 are purple colored. It seems that for cluster label 1 the average rating feature had a higher weight than the coordinates, since they more spread around the city.

Cluster label 2, on the hand hand, was assigned to districts whose average ratings are quite higher. The values range from 7.84 to 9.20, with a mean value of 8.24. Unlike cluster label 1, restaurants in these districts seem to be the top rated. Furthermore, Italian restaurants seem to concentrate in the districts assigned to cluster label 2, with 30 restaurants — that is 62.5% of our database.

Districts within cluster label 2 are shown as green dots in our map. We can see that they seem to concentrate in the south side of Sao Paulo, which hosts the city's wealthiest neighborhoods. Additionally, Bela Vista, the district home to a large Italian community, was assigned to cluster label 2. Thus, restaurants found in cluster label 2 are the most likely to make the best Italian food in Sao Paulo.

Table 5: Cluster label 2 data

Finally, districts assigned to cluster label 0 seems to be host the most diverse neighborhood. Looking at the top common venues columns, we can find a wide variety of categories — from plazas and shopping malls to electronics and toy/game stores. There are not many Italian restaurants listed here , — only 8 — but most of them have great reviews.

The red dots in our map represent districts within cluster label 0. they seem to concentrate in the central side of Sao Paulo, with a few outliers in the east side. Indeed, there is a great diversity of places and people found in central Sao Paulo and the clustering algorithm seems to have captured this pattern.

Table 6: Cluster label 0 data

5. Discussion

As stated before, Sao Paulo has a diverse culture and, therefore, a diverse cuisine, as shown in the most common venues, as extracted from the Foursquare API. The number of Italian restaurants found in each district vary, with most of them hosting only one. Also, they seem to further concentrate in three districts (Bela Vista, Itaim Bibi and Jardim Paulista). Since the Italian restaurant distribution seems to skewed toward these districts, details for neighborhood or street might need to be drilled, for a more detailed and accurate analysis.

Also, I ignored other factors that may affect the success of a business, such as population density at the district level, number of employees, price range across restaurants, among others, due to a lack of availability for some of those. Hence, our analysis only helps travelers and tourists to get an overview of the Italian restaurants distribution at the district level in the city of Sao Paulo.

I chose the k-means algorithm with an optimal number of clusters set to 3, as tested by the elbow method, but other approaches to clustering, — such as DBSCAN — as well as number of clusters, could also be tested. The results may vary should other techniques be used.

I ended this study visualising all relevant data and clustering results in a city map of Sao Paulo. In future studies, GeoJson data with districts limits could be added to create a choropleth map with information on district name, population density, Italian restaurants' average rating, cluster label and most common venue categories. This map could be further carried out to web and smartphone applications to guide future investors.

6. Conclusion

People are turning to big cities to start a new business or work. In a world moving at a fast pace, many real life problems can be solved with the help of data. In this report, data was used to cluster districts in Sao Paulo according to geographical location (latitude and longitude) and average rating for Italian restaurants. With its drawbacks and potential for improvements, our results should still help a traveler find the cluster of districts with the best Italian food.

Similarly, data can also be used to solve problems of different natures. Investors, and even city or state governors can execute a better management of their own with the help of data analysis.

Code for this report can be found here.

--

--

Leonardo Janes Carneiro

Economist, aspiring data scientist. Looking for the right questions to ask.