As a really ski fun, I wanted to take a look to what data we have about ski resorts and how we could label it. This a very small project, but also will give me an idea of how complex a project could be and what could be the most difficult steps.
I decided to use the web www.nevasport.com, which have a lot of users to understand what kind of information I could extract about ski resorts and their users. I wanted all this project had to be based on Python. I found an interesting Python module called BeautifulSoup, that together with webdriver allowed me to navigate through the site web pages and extract the information that I wanted. Of course, it would have been much easier if the site could provide APIs to be called and extract information, but that wasn´t the case. Maybe sites like this could start thinking about selling or making their data available for others.
Using those packages, I was able to extract the list of all users, the list of all the ski resorts in the system, how many ski resorts each user had visited and what are the rates for each user for the different resorts visited. This wasn´t a very easy task as I had to consider the format of each page and how the information was displayed in order to extract useful information.
Once the process to extract information was defined, all that info was dumped into flat files in my storage system. This was a time-consuming task as my computer had to browse the entire website. In fact, extracting all that information was the lengthier task in the process.
The next step in the process was to put some order to that data stored in flat files. As I wanted to run some queries against that data, I decided to use containers to bring up a mysql instance and fill a database with the data collected, so I could run some queries to understand better what information we had.
At that site, users can rate ski resorts based on different characteristics, like easy access, freeride areas, easy slopes, family ski, etc… I thought this could be quite interesting in order to classify ski resorts depending on those characteristics.
For example, we could find good resorts for beginners based on the ski resort access, how easy is to reach the beginners areas and if they have good slopes for beginners. With the extracted data, I used sklearn and KMeans to classify the ski resorts based on those characteristics.
The big star represents the centroids for each of the 3 clusters we asked the algorithm to create.
As a note (and I know this will create a lot of controversial opinions), these are the top 5 ski resorts on the top cluster with the number of visits from nevasport users, and the average rating (from 0 to 1):
resort num_visits avg 0 Cerro-Castor 57 1.000000 1 Teide-Ski-Resort 229 0.933333 2 Courchevel 437 0.900000 3 Candanchu 1589 0.896296 4 Vallnord-Pal-Arinsal 1027 0.888889 5 La-Molina 879 0.868254
Top five in the middle:
resort num_visits avg 0 Vallnord 1018 0.760000 1 Formigal 2163 0.749351 2 Masella 885 0.748148 3 Vallnord-Ordino-Arcalis 1001 0.746667 4 Val-Thorens 582 0.733333
And top five on the low one:
resort num_visits avg 0 Soelden 172 0.666667 1 Tignes 708 0.666667 2 Ischgl 261 0.533333 3 Tavascan 122 0.466667 4 Vall-de-Nuria 253 0.400000
A more interesting classification for ski fans is the ski resorts that are good for experts. In order to do that I used two metrics, slopes for experts and freeride areas.
This is the initial data without being labeled:
When applying the KMeans algorithm to create three clusters, this is how it looks like with their centroids:
These are the ski resorts on the top:
resort num_visits avg 0 Aspen-Snowmass-Highlands-Buttermilk 98 1.000000 1 Val-Thorens 582 1.000000 2 Val-d-Isere 555 1.000000 3 Teide-Ski-Resort 229 1.000000 4 Stubaier-Gletscher 95 1.000000 5 Snowbird 34 1.000000
Ski resorts in the middle:
resort num_visits avg 0 Cerler 1732 0.853333 1 Los-Penitentes 27 0.850000 2 Grandvalira 2130 0.830285 3 Fuentes-de-Invierno 304 0.814286 4 Astun 1625 0.807516
And the ones on the low:
resort num_visits avg 0 Javalambre 400 0.611905 1 Zillertal 104 0.600000 2 Port-Aine 592 0.580952 3 Sierra-de-Bejar 462 0.580000 4 San-Isidro 601 0.564286
The most interesting part of this project is to use Python libraries that make extracting and analyzing information very easy. Around 20 years ago, I spent several months programing algorithms to do things like this, while nowadays with a few lines of code I can run the KMedias algortithm:
# Number of clusters kmeans = KMeans(n_clusters=3) # Fitting the input data kmeans = kmeans.fit(X) # Getting the cluster labels labels = kmeans.predict(X) # Centroid values C = kmeans.cluster_centers_
As a summary, this is the workflow to extract, classify and analyze the information:
Some lessons learnt is that extracting data can be a complex task. I just used one web site, but there are many more sites available in the world with useful data. Same thing happens with data sources for any company or business. Normalize that data can be a very time consuming as each source will represent data with different formats and metrics.
Another point is how to manage that data. This was a small project with a very small data footprint. But real enterprise project will require storing large amounts of data and making it accessible to all different users that will be running different projects with that same data. Here containers is a very interesting tool, but also you have to make sure the storage subsystem will be able to provide the speed needed and the flexibility to share that data across all users.
And now, let´s snow!!!!