The Background
I used to work for a hotel booking company. I have been asked to clean a .csv file that was created after querying a database to combine two different tables from different hotels. I am going to need to use functions to preview the data's structure, including its columns and rows. I will also need to use basic cleaning functions to prepare this data for analysis. Step 1: Load Packages
Step 2: Import Data
The data I have been asked to clean is currently an external .csv file. In order to view and clean it in `R`, I will need to import it. The `tidyverse` library `readr` package has a number of functions for "reading in" or importing data, including .csv files. You can download this dataset. In the code below, I use the `read_csv()` function to import data from a .csv file in the project folder called "hotel_bookings.csv" and save it as a data frame called `bookings_df`:
Step 3: Getting to know the data
Before I start cleaning the data, take some time to explore it. I can use several functions that I am already familiar with to preview the data, including the `head()` function in the code below:
I can also summarize or preview the data with the `str()` and `glimpse()` functions to get a better understanding of the data by running the code below:
I can also use `colnames()` to check the names of the columns in the data set. Run the code below to find out the column names in this data set:
Some packages contain more advanced functions for summarizing and exploring the data. One example is the `skimr` package, which has a number of functions for this purpose. For example, the `skim_without_charts()` function provides a detailed summary of the data. Try running the code below:
Step 4: Cleaning the data
Now, I am primarily interested in the following variables: 'hotel', 'is_canceled', and 'lead_time'. Create a new data frame with just those columns, calling it `trimmed_df` by adding the variable names to this code:
Notice that some of the column names aren't very intuitive, so I want to rename them to make them easier to understand. I want to create the same exact data frame as above, but rename the variable 'hotel' to be named 'hotel_type' to be crystal clear on what the data is about
Fill in the space to the left of the '=' symbol with the new variable name:
Another common task is to either split or combine data in different columns. I can combine the arrival month and year into one column using the unite() function:
Step 5: Another way of doing things
I can also use the`mutate()` function to make changes to the columns. I want to create a new column that summed up all the adults, children, and babies on a reservation for the total number of people. The code below to create that new column:
Now it's time to calculate some summary statistics! Calculate the total number of canceled bookings and the average lead time for booking - I want to start the code after the %>% symbol. Make a column called 'number_canceled' to represent the total number of canceled bookings. Then, make a column called 'average_lead_time' to represent the average lead time. Use the `summarize()` function to do this in the code below:
TAGS :
Comments are closed.
|
ISRIL CANIAGONEED HELP?
Please feel free to reach out to me if you have any questions
Categories
All
|
© 2017 Isril Caniago. All rights reserved