Making a Connection to the Project Data
# connect to the SQLite database
= 'lahmansbaseballdb.sqlite'
sqlite_file = sqlite3.connect(sqlite_file) conn
When you hear the word “relationship” what is the first thing that comes to mind? Probably not baseball. But a relationship is simply a way to describe how two or more objects are connected. There are many relationships in baseball such as those between teams and managers, players and salaries, even stadiums and concession prices.
For this project, your client would like developed SQL queries that they can use to retrieve data for use on their website without needing Python. They would also like to see example Altair charts.
# connect to the SQLite database
= 'lahmansbaseballdb.sqlite'
sqlite_file = sqlite3.connect(sqlite_file) conn
Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.
# Define the SQL query to retrieve the data
= '''
query SELECT DISTINCT
p.playerID,
cp.schoolID,
s.salary,
s.yearID,
s.teamID
FROM
people AS p
LEFT JOIN
collegeplaying AS cp ON p.playerID = cp.playerID
LEFT JOIN
salaries AS s ON p.playerID = s.playerID
WHERE
cp.schoolID = 'idbyuid'
ORDER BY
s.salary DESC
'''
# Execute the query and fetch the results into a DataFrame
= pd.read_sql_query(query, conn)
byui_players
# Print the resulting dataframe
byui_players
playerID | schoolID | salary | yearID | teamID | |
---|---|---|---|---|---|
0 | lindsma01 | idbyuid | 4000000.0 | 2014.0 | CHA |
1 | lindsma01 | idbyuid | 3600000.0 | 2012.0 | BAL |
2 | lindsma01 | idbyuid | 2800000.0 | 2011.0 | COL |
3 | lindsma01 | idbyuid | 2300000.0 | 2013.0 | CHA |
4 | lindsma01 | idbyuid | 1625000.0 | 2010.0 | HOU |
5 | stephga01 | idbyuid | 1025000.0 | 2001.0 | SLN |
6 | stephga01 | idbyuid | 900000.0 | 2002.0 | SLN |
7 | stephga01 | idbyuid | 800000.0 | 2003.0 | SLN |
8 | stephga01 | idbyuid | 550000.0 | 2000.0 | SLN |
9 | lindsma01 | idbyuid | 410000.0 | 2009.0 | FLO |
10 | lindsma01 | idbyuid | 395000.0 | 2008.0 | FLO |
11 | lindsma01 | idbyuid | 380000.0 | 2007.0 | FLO |
12 | stephga01 | idbyuid | 215000.0 | 1999.0 | SLN |
13 | stephga01 | idbyuid | 185000.0 | 1998.0 | PHI |
14 | stephga01 | idbyuid | 150000.0 | 1997.0 | PHI |
15 | catetr01 | idbyuid | NaN | NaN | None |
By examining this data, one can learn about the success and earning potential of baseball players who attended BYU-Idaho and which teams they played for during their careers. Additionally, one could analyze salary trends over time or by the team to gain further insights into the baseball industry.
This three-part question requires you to calculate batting average (number of hits divided by the number of at-bats)
Part A. Write an SQL query that provides playerID, yearID, and batting average for players with at least 1 at bat that year. Sort the table from highest batting average to lowest, and then by playerid alphabetically. Show the top 5 results in your report.
# Define the SQL query to retrieve the data
= '''
query SELECT
playerID,
yearID,
CAST(H AS FLOAT) / CAST(AB AS FLOAT) AS Batt_avg
FROM
batting
WHERE
AB >= 1
ORDER BY
Batt_avg DESC,
playerID ASC
LIMIT 5
'''
# Execute the query and fetch the results into a DataFrame
= pd.read_sql_query(query, conn)
batt_avg_1
# Print the resulting dataframe
batt_avg_1
playerID | yearID | Batt_avg | |
---|---|---|---|
0 | aberal01 | 1957 | 1.0 |
1 | abernte02 | 1960 | 1.0 |
2 | abramge01 | 1923 | 1.0 |
3 | acklefr01 | 1964 | 1.0 |
4 | alanirj01 | 2019 | 1.0 |
The first table shows the names of the top five baseball players with the highest batting averages and the year they achieved those averages. The batting average is calculated for players with at least one at-bat that year. The ‘Batt_avg’ column represents the ratio of hits (H) to at-bats (AB), giving us insight into how well each player performed at bat.
Part B. Use the same query as above, but only include players with at least 10 at bats that year. Print the top 5 results.
# Define the SQL query to retrieve the data
= '''
query SELECT
playerID,
yearID,
CAST(H AS FLOAT) / CAST(AB AS FLOAT) AS Batt_avg
FROM
batting
WHERE
AB >= 10
ORDER BY
Batt_avg DESC,
playerID ASC
LIMIT 5
'''
# execute the query and load the results into a DataFrame
= pd.read_sql_query(query, conn)
batt_avg_2
# print the top 5 results
batt_avg_2
playerID | yearID | Batt_avg | |
---|---|---|---|
0 | nymanny01 | 1974 | 0.642857 |
1 | carsoma01 | 2013 | 0.636364 |
2 | altizda01 | 1910 | 0.600000 |
3 | johnsde01 | 1975 | 0.600000 |
4 | silvech01 | 1948 | 0.571429 |
The second table shows the names of the top five baseball players with the highest batting averages and the year they achieved those averages for players with at least one at-bat that year. By only showing results for players with at least ten at-bats, the code eliminates small sample sizes and gives a more accurate estimate of a player’s ability to hit.
Part C. Now calculate the batting average for players over their entire careers (all years combined). Only include players with at least 100 at bats, and print the top 5 results.
# Define the SQL query to retrieve the data
= '''
query SELECT
playerID,
SUM(H) AS total_hits,
SUM(AB) AS total_at_bats,
ROUND(CAST(SUM(H) AS FLOAT) / CAST(SUM(AB) AS FLOAT), 3) AS career_batt_avg
FROM
batting
GROUP BY
playerID
HAVING
SUM(AB) >= 100
ORDER BY
career_batt_avg DESC,
playerID ASC
LIMIT 5
'''
# Execute the query and load the results into a DataFrame
= pd.read_sql_query(query, conn)
batt_avg_3
# Print the top 5 results
print(batt_avg_3.to_markdown())
# Find the player with the highest batting average
= batt_avg_3.iloc[0]['playerID']
highest_batt_avg_player = batt_avg_3.iloc[0]['career_batt_avg']
highest_batt_avg
# Print the player with the highest batting average
print(
f"\nThe player with the highest career batting average is {highest_batt_avg_player}\nwith a career batting average of {highest_batt_avg}.")
| | playerID | total_hits | total_at_bats | career_batt_avg |
|---:|:-----------|-------------:|----------------:|------------------:|
| 0 | cobbty01 | 4189 | 11436 | 0.366 |
| 1 | barnero01 | 860 | 2391 | 0.36 |
| 2 | hornsro01 | 2930 | 8173 | 0.358 |
| 3 | jacksjo01 | 1772 | 4981 | 0.356 |
| 4 | meyerle01 | 513 | 1443 | 0.356 |
The player with the highest career batting average is cobbty01
with a career batting average of 0.366.
The third table shows the names of the top five baseball players with the highest career batting averages and players with at least 100 at-bats. The player with the highest batting average over their career among these five players is Ty Cobb (playerID: “cobbty01”), with a batting average of 0.366.
Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc). Write an SQL query to get the data you need, then make a graph in Altair to visualize the comparison. What do you learn?
# Define the SQL query to retrieve the data
= '''
query SELECT
t.yearID AS Year,
SUM(CASE WHEN t.teamID = 'NYA' THEN t.L ELSE 0 END) AS Yankees_Losses,
SUM(CASE WHEN t.teamID = 'BOS' THEN t.L ELSE 0 END) AS RedSox_Losses,
SUM(CASE WHEN t.teamID = 'NYA' THEN t.W ELSE 0 END) AS Yankees_wins,
SUM(CASE WHEN t.teamID = 'BOS' THEN t.W ELSE 0 END) AS RedSox_wins
FROM
teams AS t
WHERE
t.yearID >= 2000 AND t.yearID <= 2021
GROUP BY
t.yearID
ORDER BY
t.yearID ASC;
'''
# Load data into a dataframe
= pd.read_sql_query(query, conn)
df
# Melt the data to make it easier to plot
= pd.melt(df, id_vars=["Year"],
melted_df ="Team_Info", value_name="Value")
var_name
# Split the Team_Info column into separate columns for team and metric
'Team', 'Metric']] = melted_df['Team_Info'].str.split(
melted_df[['_', expand=True)
# Create the chart
= alt.Chart(melted_df).mark_line().encode(
chart ="Year:O",
x=alt.Y("Value:Q", axis=alt.Axis(title="Value")),
y="Team:N",
color="Metric:N",
strokeDash=['Year', 'Team', 'Value']
tooltip={'text': "Yankees vs Red Sox", 'subtitle': "Wins and Losses (2000 - 2021)"}
).properties(title
).resolve_scale(='independent'
color=True)
).configure_axis(grid
# Show the chart
chart
The code compares the New York Yankees and the Boston Red Sox using a metric of wins and losses from the year 2000 to 2021. The data shows that the Yankees have more wins than losses for most years, while the Red Sox have a more inconsistent record with some years having more losses than wins. However, the Red Sox did win the World Series in 2004, 2007, 2013, and 2018.
The Altair chart displays the wins and losses for both teams over time, with the x-axis representing the year and the y-axis representing the value (i.e., wins or losses). The chart also uses different line styles to represent wins and losses and different colors to represent each team. The tooltip shows the year, team, and value when the user hovers over a point on the chart.
Overall, the data and chart show that the Yankees have had more success than the Red Sox over the past two decades in terms of wins and losses, but the Red Sox have had some notable successes with four World Series wins.
# Define the SQL query to retrieve the data
= '''
query SELECT
SUM(CASE WHEN t.teamID = 'NYA' THEN t.W ELSE 0 END) AS Yankees_wins,
SUM(CASE WHEN t.teamID = 'BOS' THEN t.W ELSE 0 END) AS RedSox_wins
FROM
teams AS t
WHERE
t.yearID >= 2000 AND t.yearID <= 2021;
'''
# Load data into a dataframe
= pd.read_sql_query(query, conn)
df
# Calculate the total wins for each team
= df['Yankees_wins'][0]
yankees_total_wins = df['RedSox_wins'][0]
redsox_total_wins
# Print the overall answer
if yankees_total_wins > redsox_total_wins:
print("The New York Yankees have won more games than \nthe Boston Red Sox over the past two decades.")
elif redsox_total_wins > yankees_total_wins:
print("The Boston Red Sox have won more games than \nthe New York Yankees over the past two decades.")
else:
print("The New York Yankees and the Boston Red Sox \nhave won the same number of games over the past two decades.")
The New York Yankees have won more games than
the Boston Red Sox over the past two decades.
I calculated the number of wins over the past two decades to ensure that the New York Yankees are the true winners. We can see that the New York Yankees have had more wins during the last twenty years compared to the Boston Red Sox.
# connect to the SQLite database
= 'lahmansbaseballdb.sqlite'
sqlite_file = sqlite3.connect(sqlite_file)
conn
# Question #1
# Define the SQL query to retrieve the data
= '''
query SELECT DISTINCT
p.playerID,
cp.schoolID,
s.salary,
s.yearID,
s.teamID
FROM
people AS p
JOIN
collegeplaying AS cp ON p.playerID = cp.playerID
JOIN
salaries AS s ON p.playerID = s.playerID
WHERE
cp.schoolID = 'idbyuid'
ORDER BY
s.salary DESC
'''
# Execute the query and fetch the results into a DataFrame
= pd.read_sql_query(query, conn)
byui_players
# Question #2
# Part A.
# Define the SQL query to retrieve the data
= '''
query SELECT
playerID,
yearID,
CAST(H AS FLOAT) / CAST(AB AS FLOAT) AS Batt_avg
FROM
batting
WHERE
AB >= 1
ORDER BY
Batt_avg DESC,
playerID ASC
LIMIT 5
'''
# Execute the query and fetch the results into a DataFrame
= pd.read_sql_query(query, conn)
batt_avg_1
# Part B.
# Define the SQL query to retrieve the data
= '''
query SELECT
playerID,
yearID,
CAST(H AS FLOAT) / CAST(AB AS FLOAT) AS Batt_avg
FROM
batting
WHERE
AB >= 10
ORDER BY
Batt_avg DESC,
playerID ASC
LIMIT 5
'''
# execute the query and load the results into a DataFrame
= pd.read_sql_query(query, conn)
batt_avg_2
# Part C.
# Define the SQL query to retrieve the data
= '''
query SELECT
playerID,
SUM(H) AS total_hits,
SUM(AB) AS total_at_bats,
ROUND(CAST(SUM(H) AS FLOAT) / CAST(SUM(AB) AS FLOAT), 3) AS career_batt_avg
FROM
batting
GROUP BY
playerID
HAVING
SUM(AB) >= 100
ORDER BY
career_batt_avg DESC,
playerID ASC
LIMIT 5
'''
# Execute the query and load the results into a DataFrame
= pd.read_sql_query(query, conn)
batt_avg_3
# Find the player with the highest batting average
= batt_avg_3.iloc[0]['playerID']
highest_batt_avg_player = batt_avg_3.iloc[0]['career_batt_avg']
highest_batt_avg
# Question #3
# Part #1
# Define the SQL query to retrieve the data
= '''
query SELECT
t.yearID AS Year,
SUM(CASE WHEN t.teamID = 'NYA' THEN t.L ELSE 0 END) AS Yankees_Losses,
SUM(CASE WHEN t.teamID = 'BOS' THEN t.L ELSE 0 END) AS RedSox_Losses,
SUM(CASE WHEN t.teamID = 'NYA' THEN t.W ELSE 0 END) AS Yankees_wins,
SUM(CASE WHEN t.teamID = 'BOS' THEN t.W ELSE 0 END) AS RedSox_wins
FROM
teams AS t
WHERE
t.yearID >= 2000 AND t.yearID <= 2021
GROUP BY
t.yearID
ORDER BY
t.yearID ASC;
'''
# Load data into a dataframe
= pd.read_sql_query(query, conn)
df
# Melt the data to make it easier to plot
= pd.melt(df, id_vars=["Year"],
melted_df ="Team_Info", value_name="Value")
var_name
# Split the Team_Info column into separate columns for team and metric
'Team', 'Metric']] = melted_df['Team_Info'].str.split(
melted_df[['_', expand=True)
# Create the chart
= alt.Chart(melted_df).mark_line().encode(
chart ="Year:O",
x=alt.Y("Value:Q", axis=alt.Axis(title="Value")),
y="Team:N",
color="Metric:N",
strokeDash=['Year', 'Team', 'Value']
tooltip={'text': "Yankees vs Red Sox", 'subtitle': "Wins and Losses (2000 - 2021)"}
).properties(title
).resolve_scale(='independent'
color=True)
).configure_axis(grid
# Part #2
# Define the SQL query to retrieve the data
= '''
query SELECT
SUM(CASE WHEN t.teamID = 'NYA' THEN t.W ELSE 0 END) AS Yankees_wins,
SUM(CASE WHEN t.teamID = 'BOS' THEN t.W ELSE 0 END) AS RedSox_wins
FROM
teams AS t
WHERE
t.yearID >= 2000 AND t.yearID <= 2021;
'''
# Load data into a dataframe
= pd.read_sql_query(query, conn)
df
# Calculate the total wins for each team
= df['Yankees_wins'][0]
yankees_total_wins = df['RedSox_wins'][0] redsox_total_wins