Think Summer: Project 4 — 2023

Submission

Students need to submit the following file by 10:00PM EST through Gradescope inside Brightspace.

  1. A Jupyter notebook (a .ipynb file).

We’ve provided you with a template notebook for you to use. Please carefully read this section to get started.

When you are finished with the project, please make sure to run every cell in the notebook prior to submitting. To do this click Run  Run All Cells. Next, to export your notebook (your .ipynb file), click on File  Download, and download your .ipynb file.

Questions

Question 1

For the show The Handmaid’s Tale (title_id is tt5834204), there are 4 seasons listed in the IMDB database. Find the average rating of each of the four seasons. Hint: Use AVG for find the average, and use GROUP BY the season_number.

Items to submit
  • SQL used to solve this problem. (2 pts)

  • Output from running SQL. (1 pt)

Question 2

Identify the six most popular episodes of the show Grey’s Anatomy (where "popular" denotes a high rating).

Items to submit
  • SQL used to solve this problem. (2 pts)

  • Output from running SQL. (1 pt)

Question 3

Make a dotchart in R showing the results of the previous question. Hint: You can use your work from SQL, and export the results to a dataframe called myDF in R. Then you can use something like:

# use a dbGetQuery here, to import the SQL results to R, and then
myresults <- myDF$rating
names(myresults) <- myDF$primary_title
dotchart(myresults)
Items to submit
  • R used to solve this problem. (2 pts)

  • Output from running R. (1 pt)

Question 4

Make a dotchart showing the total amount of money donated in each of the top 10 states, during the 2000 federal election cycle.

Items to submit
  • R used to solve this problem. (2 pts)

  • Output from running R. (1 pt)

Question 5

Make a dotchart that shows how many movies premiered in each year. Now make another dotchart, which shows the same data (i.e., how many movies premiered each year) but only since the year 2000.

Items to submit
  • R used to solve this problem. (2 pts)

  • Output from running R. (1 pt)

Question 6

Among the three big New York City airports (JFK, LGA, EWR), which of these airports had the worst DepDelay (on average) in 2005? (Can you solve this with 1 line of R, using a tapply (rather than using 3 separate lines of R)? Hint: After you run the tapply, you can index your results using [c("JFK", "LGA", "EWR")] to lookup all 3 airports at once.)

Items to submit
  • R used to solve this problem. (2 pts)

  • Output from running R. (1 pt)

Question 7

LIKE is a very powerful tool. You can read about SQLite’s version of LIKE here. Use LIKE to analyze the primary_title of all IMDB titles: First determine how many titles have Batman anywhere in the title, and then determine how many titles have Superman anywhere in the title? Which one occurs more often?

Items to submit
  • SQL used to solve this problem. (2 pts)

  • Output from running SQL. (1 pt)

Question 8

How much money was donated during the 2000 federal election cycle by people who have PURDUE listed somewhere in their employer name? How much money was donated by people who have MICROSOFT listed somewhere in their employer name? Hint: You might use the grep or the grepl (which is a logical grep) to solve this one.

Items to submit
  • R used to solve this problem. (2 pts)

  • Output from running R. (1 pt)

Question 9

How much money was donated during the 2000 federal election cycle by people from your hometown? (Be sure to match the city and the state.)

Items to submit
  • R used to solve this problem. (2 pts)

  • Output from running R. (1 pt)

Question 10

Create your own interesting question based on the things you have learned this week. What insights can you find?

Items to submit
  • R used to solve this problem. (2 pts)

  • Output from running R. (1 pt)