The Background
I used to work for a hotel booking company. I have been asked to clean a .csv file that was created after querying a database to combine two different tables from different hotels. I have already performed some basic cleaning functions on this data; this task will focus on using functions to conduct basic data manipulation. Step 1: Load Packages
Step 2: Import Data
You can download this dataset. In the code below, I use the `read_csv()` function to import data from a .csv in the project folder called "hotel_bookings.csv" and save it as a data frame called `hotel_bookings`:
Step 3: Getting to know the data
I am going to use summary functions to get to know the data. This time, I am going to complete the code below in order to use these different functions. I use the `head()` function to preview the columns and the first several rows of data.
Code Editor
​Now I know this dataset contains information on hotel bookings. Each booking is a row in the dataset, and each column contains information such as what type of hotel was booked, when the booking took place, and how far in advance the booking took place (the 'lead_time' column).
In addition to `head()` I can also use the `str()` and `glimpse()` functions to get summaries of each column in the data arranged horizontally. I can try these two functions by completing and running the code below:
Code Editor
I can see the different column names and some sample values to the right of the colon.
Code Editor
I can also use `colnames()` to get the names of the columns in the dataset. Run the code below to get the column names:
Code Editor
Manipulating the data
Now I want to arrange the data by most lead time to least lead time because I want to focus on bookings that were made far in advance. I decide I want to try using the `arrange()` function; input the correct column name after the comma and I need to specifically tell it when to order by descending order, like the below code below:
Code Editor
Now it is in the order I needed. I can click on the different pages of results to see additional rows of data, too.
Notice that when I just run `arrange()` without saving my data to a new data frame, it does not alter the existing data frame. Check it out by running `head()` again to find out if the highest lead times are first:
Code Editor
This will be true of all the functions I will be using in this task. If I wanted to create a new data frame that had those changes saved, I would use the assignment operator, <- , as written in the code below to store the arranged data in a data frame named 'hotel_bookings_v2':
​Run `head()`to check it out:
Code Editor
I can also find out the maximum and minimum lead times without sorting the whole dataset using the `arrange()` function. Try it out using the max() and min() functions below:
Code Editor
Code Editor
Now, I want to know what the average lead time for booking is because the stakeholder asks me how early I should run promotions for hotel rooms. I can use the `mean()` function to answer that question since the average of a set of number is also the mean of the set of numbers:
Code Editor
I should get the same answer even if I use the v2 dataset that included the `arrange()` function. This is because the `arrange()` function doesn't change the values in the dataset; it just re-arranges them.
Code Editor
I was able to report to stakeholder what the average lead time before booking is, but now they want to know what the average lead time before booking is for just city hotels. They want to focus the promotion they're running by targeting major cities.
I know that my first step will be creating a new dataset that only contains data about city hotels. I can do that using the `filter()` function, and name my new data frame 'hotel_bookings_city':
​Check out a new dataset:
Code Editor
I quickly check what the average lead time for this set of hotels is, just like I did for all of hotels before:
Code Editor
Now, the stakeholder wants to know a lot more information about city hotels, including the maximum and minimum lead time. They are also interested in how they are different from resort hotels. I don't want to run each line of code over and over again, so I decide to use the `group_by()`and`summarize()` functions. I can also use the pipe operator to make the code easier to follow. I store the new dataset in a data frame named 'hotel_summary':
​Check out the new dataset using head() again:
Code Editor
TAGS :
Comments are closed.
|
ISRIL CANIAGONEED HELP?
Please feel free to reach out to me if you have any questions
Categories
All
|
© 2017 Isril Caniago. All rights reserved