manipulating and changing data in R

The Background

I used to work for a hotel booking company. I have been asked to clean a .csv file that was created after querying a database to combine two different tables from different hotels. I have already performed some basic cleaning functions on this data; this task will focus on using functions to conduct basic data manipulation.

Step 1: Load Packages

Step 2: Import Data

You can download this dataset. In the code below, I use the `read_csv()` function to import data from a .csv in the project folder called "hotel_bookings.csv" and save it as a data frame called `hotel_bookings`:

Step 3: Getting to know the data

I am going to use summary functions to get to know the data. This time, I am going to complete the code below in order to use these different functions. I use the `head()` function to preview the columns and the first several rows of data.

Now I know this dataset contains information on hotel bookings. Each booking is a row in the dataset, and each column contains information such as what type of hotel was booked, when the booking took place, and how far in advance the booking took place (the 'lead_time' column).

In addition to `head()` I can also use the `str()` and `glimpse()` functions to get summaries of each column in the data arranged horizontally. I can try these two functions by completing and running the code below:

I can see the different column names and some sample values to the right of the colon.

I can also use `colnames()` to get the names of the columns in the dataset. Run the code below to get the column names:

Manipulating the data

Now I want to arrange the data by most lead time to least lead time because I want to focus on bookings that were made far in advance. I decide I want to try using the `arrange()` function; input the correct column name after the comma and I need to specifically tell it when to order by descending order, like the below code below:

Now it is in the order I needed. I can click on the different pages of results to see additional rows of data, too.

Notice that when I just run `arrange()` without saving my data to a new data frame, it does not alter the existing data frame. Check it out by running `head()` again to find out if the highest lead times are first:

This will be true of all the functions I will be using in this task. If I wanted to create a new data frame that had those changes saved, I would use the assignment operator, <- , as written in the code below to store the arranged data in a data frame named 'hotel_bookings_v2':

Run `head()`to check it out:

I can also find out the maximum and minimum lead times without sorting the whole dataset using the `arrange()` function. Try it out using the max() and min() functions below:

Now, I want to know what the average lead time for booking is because the stakeholder asks me how early I should run promotions for hotel rooms. I can use the `mean()` function to answer that question since the average of a set of number is also the mean of the set of numbers:

I should get the same answer even if I use the v2 dataset that included the `arrange()` function. This is because the `arrange()` function doesn't change the values in the dataset; it just re-arranges them.

I was able to report to stakeholder what the average lead time before booking is, but now they want to know what the average lead time before booking is for just city hotels. They want to focus the promotion they're running by targeting major cities.

I know that my first step will be creating a new dataset that only contains data about city hotels. I can do that using the `filter()` function, and name my new data frame 'hotel_bookings_city':

Check out a new dataset:

I quickly check what the average lead time for this set of hotels is, just like I did for all of hotels before:

Now, the stakeholder wants to know a lot more information about city hotels, including the maximum and minimum lead time. They are also interested in how they are different from resort hotels. I don't want to run each line of code over and over again, so I decide to use the `group_by()`and`summarize()` functions. I can also use the pipe operator to make the code easier to follow. I store the new dataset in a data frame named 'hotel_summary':

Check out the new dataset using head() again:

TAGS :

Comments are closed.

manipulating and changing data in R

ISRIL CANIAGO

Categories