Data cleaning in r

The Background

I used to work for a hotel booking company. I have been asked to clean a .csv file that was created after querying a database to combine two different tables from different hotels. I am going to need to use functions to preview the data's structure, including its columns and rows. I will also need to use basic cleaning functions to prepare this data for analysis.

Step 1: Load Packages

Step 2: Import Data

The data I have been asked to clean is currently an external .csv file. In order to view and clean it in `R`, I will need to import it. The `tidyverse` library `readr` package has a number of functions for "reading in" or importing data, including .csv files.

You can download this dataset. In the code below, I use the `read_csv()` function to import data from a .csv file in the project folder called "hotel_bookings.csv" and save it as a data frame called `bookings_df`:

Step 3: Getting to know the data

Before I start cleaning the data, take some time to explore it. I can use several functions that I am already familiar with to preview the data, including the `head()` function in the code below:

I can also summarize or preview the data with the `str()` and `glimpse()` functions to get a better understanding of the data by running the code below:

I can also use `colnames()` to check the names of the columns in the data set. Run the code below to find out the column names in this data set:

Some packages contain more advanced functions for summarizing and exploring the data. One example is the `skimr` package, which has a number of functions for this purpose. For example, the `skim_without_charts()` function provides a detailed summary of the data. Try running the code below:

Step 4: Cleaning the data

Now, I am primarily interested in the following variables: 'hotel', 'is_canceled', and 'lead_time'. Create a new data frame with just those columns, calling it `trimmed_df` by adding the variable names to this code:

Notice that some of the column names aren't very intuitive, so I want to rename them to make them easier to understand. I want to create the same exact data frame as above, but rename the variable 'hotel' to be named 'hotel_type' to be crystal clear on what the data is about

Fill in the space to the left of the '=' symbol with the new variable name:

Another common task is to either split or combine data in different columns. I can combine the arrival month and year into one column using the unite() function:

Step 5: Another way of doing things

I can also use the`mutate()` function to make changes to the columns. I want to create a new column that summed up all the adults, children, and babies on a reservation for the total number of people. The code below to create that new column:

Now it's time to calculate some summary statistics! Calculate the total number of canceled bookings and the average lead time for booking - I want to start the code after the %>% symbol. Make a column called 'number_canceled' to represent the total number of canceled bookings. Then, make a column called 'average_lead_time' to represent the average lead time. Use the `summarize()` function to do this in the code below:

TAGS :

Comments are closed.

Data cleaning in r

ISRIL CANIAGO

Categories