Making a Connection to the Project Data
# connect to the SQLite database
sqlite_file = 'lahmansbaseballdb.sqlite'
conn = sqlite3.connect(sqlite_file)When you hear the word “relationship” what is the first thing that comes to mind? Probably not baseball. But a relationship is simply a way to describe how two or more objects are connected. There are many relationships in baseball such as those between teams and managers, players and salaries, even stadiums and concession prices.
For this project, your client would like developed SQL queries that they can use to retrieve data for use on their website without needing Python. They would also like to see example Altair charts.
# connect to the SQLite database
sqlite_file = 'lahmansbaseballdb.sqlite'
conn = sqlite3.connect(sqlite_file)Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.
# Define the SQL query to retrieve the data
query = '''
SELECT DISTINCT
p.playerID,
cp.schoolID,
s.salary,
s.yearID,
s.teamID
FROM
people AS p
LEFT JOIN
collegeplaying AS cp ON p.playerID = cp.playerID
LEFT JOIN
salaries AS s ON p.playerID = s.playerID
WHERE
cp.schoolID = 'idbyuid'
ORDER BY
s.salary DESC
'''
# Execute the query and fetch the results into a DataFrame
byui_players = pd.read_sql_query(query, conn)
# Print the resulting dataframe
byui_players| playerID | schoolID | salary | yearID | teamID | |
|---|---|---|---|---|---|
| 0 | lindsma01 | idbyuid | 4000000.0 | 2014.0 | CHA |
| 1 | lindsma01 | idbyuid | 3600000.0 | 2012.0 | BAL |
| 2 | lindsma01 | idbyuid | 2800000.0 | 2011.0 | COL |
| 3 | lindsma01 | idbyuid | 2300000.0 | 2013.0 | CHA |
| 4 | lindsma01 | idbyuid | 1625000.0 | 2010.0 | HOU |
| 5 | stephga01 | idbyuid | 1025000.0 | 2001.0 | SLN |
| 6 | stephga01 | idbyuid | 900000.0 | 2002.0 | SLN |
| 7 | stephga01 | idbyuid | 800000.0 | 2003.0 | SLN |
| 8 | stephga01 | idbyuid | 550000.0 | 2000.0 | SLN |
| 9 | lindsma01 | idbyuid | 410000.0 | 2009.0 | FLO |
| 10 | lindsma01 | idbyuid | 395000.0 | 2008.0 | FLO |
| 11 | lindsma01 | idbyuid | 380000.0 | 2007.0 | FLO |
| 12 | stephga01 | idbyuid | 215000.0 | 1999.0 | SLN |
| 13 | stephga01 | idbyuid | 185000.0 | 1998.0 | PHI |
| 14 | stephga01 | idbyuid | 150000.0 | 1997.0 | PHI |
| 15 | catetr01 | idbyuid | NaN | NaN | None |
By examining this data, one can learn about the success and earning potential of baseball players who attended BYU-Idaho and which teams they played for during their careers. Additionally, one could analyze salary trends over time or by the team to gain further insights into the baseball industry.
This three-part question requires you to calculate batting average (number of hits divided by the number of at-bats)
Part A. Write an SQL query that provides playerID, yearID, and batting average for players with at least 1 at bat that year. Sort the table from highest batting average to lowest, and then by playerid alphabetically. Show the top 5 results in your report.
# Define the SQL query to retrieve the data
query = '''
SELECT
playerID,
yearID,
CAST(H AS FLOAT) / CAST(AB AS FLOAT) AS Batt_avg
FROM
batting
WHERE
AB >= 1
ORDER BY
Batt_avg DESC,
playerID ASC
LIMIT 5
'''
# Execute the query and fetch the results into a DataFrame
batt_avg_1 = pd.read_sql_query(query, conn)
# Print the resulting dataframe
batt_avg_1| playerID | yearID | Batt_avg | |
|---|---|---|---|
| 0 | aberal01 | 1957 | 1.0 |
| 1 | abernte02 | 1960 | 1.0 |
| 2 | abramge01 | 1923 | 1.0 |
| 3 | acklefr01 | 1964 | 1.0 |
| 4 | alanirj01 | 2019 | 1.0 |
The first table shows the names of the top five baseball players with the highest batting averages and the year they achieved those averages. The batting average is calculated for players with at least one at-bat that year. The ‘Batt_avg’ column represents the ratio of hits (H) to at-bats (AB), giving us insight into how well each player performed at bat.
Part B. Use the same query as above, but only include players with at least 10 at bats that year. Print the top 5 results.
# Define the SQL query to retrieve the data
query = '''
SELECT
playerID,
yearID,
CAST(H AS FLOAT) / CAST(AB AS FLOAT) AS Batt_avg
FROM
batting
WHERE
AB >= 10
ORDER BY
Batt_avg DESC,
playerID ASC
LIMIT 5
'''
# execute the query and load the results into a DataFrame
batt_avg_2 = pd.read_sql_query(query, conn)
# print the top 5 results
batt_avg_2| playerID | yearID | Batt_avg | |
|---|---|---|---|
| 0 | nymanny01 | 1974 | 0.642857 |
| 1 | carsoma01 | 2013 | 0.636364 |
| 2 | altizda01 | 1910 | 0.600000 |
| 3 | johnsde01 | 1975 | 0.600000 |
| 4 | silvech01 | 1948 | 0.571429 |
The second table shows the names of the top five baseball players with the highest batting averages and the year they achieved those averages for players with at least one at-bat that year. By only showing results for players with at least ten at-bats, the code eliminates small sample sizes and gives a more accurate estimate of a player’s ability to hit.
Part C. Now calculate the batting average for players over their entire careers (all years combined). Only include players with at least 100 at bats, and print the top 5 results.
# Define the SQL query to retrieve the data
query = '''
SELECT
playerID,
SUM(H) AS total_hits,
SUM(AB) AS total_at_bats,
ROUND(CAST(SUM(H) AS FLOAT) / CAST(SUM(AB) AS FLOAT), 3) AS career_batt_avg
FROM
batting
GROUP BY
playerID
HAVING
SUM(AB) >= 100
ORDER BY
career_batt_avg DESC,
playerID ASC
LIMIT 5
'''
# Execute the query and load the results into a DataFrame
batt_avg_3 = pd.read_sql_query(query, conn)
# Print the top 5 results
print(batt_avg_3.to_markdown())
# Find the player with the highest batting average
highest_batt_avg_player = batt_avg_3.iloc[0]['playerID']
highest_batt_avg = batt_avg_3.iloc[0]['career_batt_avg']
# Print the player with the highest batting average
print(
f"\nThe player with the highest career batting average is {highest_batt_avg_player}\nwith a career batting average of {highest_batt_avg}.")| | playerID | total_hits | total_at_bats | career_batt_avg |
|---:|:-----------|-------------:|----------------:|------------------:|
| 0 | cobbty01 | 4189 | 11436 | 0.366 |
| 1 | barnero01 | 860 | 2391 | 0.36 |
| 2 | hornsro01 | 2930 | 8173 | 0.358 |
| 3 | jacksjo01 | 1772 | 4981 | 0.356 |
| 4 | meyerle01 | 513 | 1443 | 0.356 |
The player with the highest career batting average is cobbty01
with a career batting average of 0.366.
The third table shows the names of the top five baseball players with the highest career batting averages and players with at least 100 at-bats. The player with the highest batting average over their career among these five players is Ty Cobb (playerID: “cobbty01”), with a batting average of 0.366.
Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc). Write an SQL query to get the data you need, then make a graph in Altair to visualize the comparison. What do you learn?
# Define the SQL query to retrieve the data
query = '''
SELECT
t.yearID AS Year,
SUM(CASE WHEN t.teamID = 'NYA' THEN t.L ELSE 0 END) AS Yankees_Losses,
SUM(CASE WHEN t.teamID = 'BOS' THEN t.L ELSE 0 END) AS RedSox_Losses,
SUM(CASE WHEN t.teamID = 'NYA' THEN t.W ELSE 0 END) AS Yankees_wins,
SUM(CASE WHEN t.teamID = 'BOS' THEN t.W ELSE 0 END) AS RedSox_wins
FROM
teams AS t
WHERE
t.yearID >= 2000 AND t.yearID <= 2021
GROUP BY
t.yearID
ORDER BY
t.yearID ASC;
'''
# Load data into a dataframe
df = pd.read_sql_query(query, conn)
# Melt the data to make it easier to plot
melted_df = pd.melt(df, id_vars=["Year"],
var_name="Team_Info", value_name="Value")
# Split the Team_Info column into separate columns for team and metric
melted_df[['Team', 'Metric']] = melted_df['Team_Info'].str.split(
'_', expand=True)
# Create the chart
chart = alt.Chart(melted_df).mark_line().encode(
x="Year:O",
y=alt.Y("Value:Q", axis=alt.Axis(title="Value")),
color="Team:N",
strokeDash="Metric:N",
tooltip=['Year', 'Team', 'Value']
).properties(title={'text': "Yankees vs Red Sox", 'subtitle': "Wins and Losses (2000 - 2021)"}
).resolve_scale(
color='independent'
).configure_axis(grid=True)
# Show the chart
chartThe code compares the New York Yankees and the Boston Red Sox using a metric of wins and losses from the year 2000 to 2021. The data shows that the Yankees have more wins than losses for most years, while the Red Sox have a more inconsistent record with some years having more losses than wins. However, the Red Sox did win the World Series in 2004, 2007, 2013, and 2018.
The Altair chart displays the wins and losses for both teams over time, with the x-axis representing the year and the y-axis representing the value (i.e., wins or losses). The chart also uses different line styles to represent wins and losses and different colors to represent each team. The tooltip shows the year, team, and value when the user hovers over a point on the chart.
Overall, the data and chart show that the Yankees have had more success than the Red Sox over the past two decades in terms of wins and losses, but the Red Sox have had some notable successes with four World Series wins.
# Define the SQL query to retrieve the data
query = '''
SELECT
SUM(CASE WHEN t.teamID = 'NYA' THEN t.W ELSE 0 END) AS Yankees_wins,
SUM(CASE WHEN t.teamID = 'BOS' THEN t.W ELSE 0 END) AS RedSox_wins
FROM
teams AS t
WHERE
t.yearID >= 2000 AND t.yearID <= 2021;
'''
# Load data into a dataframe
df = pd.read_sql_query(query, conn)
# Calculate the total wins for each team
yankees_total_wins = df['Yankees_wins'][0]
redsox_total_wins = df['RedSox_wins'][0]
# Print the overall answer
if yankees_total_wins > redsox_total_wins:
print("The New York Yankees have won more games than \nthe Boston Red Sox over the past two decades.")
elif redsox_total_wins > yankees_total_wins:
print("The Boston Red Sox have won more games than \nthe New York Yankees over the past two decades.")
else:
print("The New York Yankees and the Boston Red Sox \nhave won the same number of games over the past two decades.")The New York Yankees have won more games than
the Boston Red Sox over the past two decades.
I calculated the number of wins over the past two decades to ensure that the New York Yankees are the true winners. We can see that the New York Yankees have had more wins during the last twenty years compared to the Boston Red Sox.
# connect to the SQLite database
sqlite_file = 'lahmansbaseballdb.sqlite'
conn = sqlite3.connect(sqlite_file)
# Question #1
# Define the SQL query to retrieve the data
query = '''
SELECT DISTINCT
p.playerID,
cp.schoolID,
s.salary,
s.yearID,
s.teamID
FROM
people AS p
JOIN
collegeplaying AS cp ON p.playerID = cp.playerID
JOIN
salaries AS s ON p.playerID = s.playerID
WHERE
cp.schoolID = 'idbyuid'
ORDER BY
s.salary DESC
'''
# Execute the query and fetch the results into a DataFrame
byui_players = pd.read_sql_query(query, conn)
# Question #2
# Part A.
# Define the SQL query to retrieve the data
query = '''
SELECT
playerID,
yearID,
CAST(H AS FLOAT) / CAST(AB AS FLOAT) AS Batt_avg
FROM
batting
WHERE
AB >= 1
ORDER BY
Batt_avg DESC,
playerID ASC
LIMIT 5
'''
# Execute the query and fetch the results into a DataFrame
batt_avg_1 = pd.read_sql_query(query, conn)
# Part B.
# Define the SQL query to retrieve the data
query = '''
SELECT
playerID,
yearID,
CAST(H AS FLOAT) / CAST(AB AS FLOAT) AS Batt_avg
FROM
batting
WHERE
AB >= 10
ORDER BY
Batt_avg DESC,
playerID ASC
LIMIT 5
'''
# execute the query and load the results into a DataFrame
batt_avg_2 = pd.read_sql_query(query, conn)
# Part C.
# Define the SQL query to retrieve the data
query = '''
SELECT
playerID,
SUM(H) AS total_hits,
SUM(AB) AS total_at_bats,
ROUND(CAST(SUM(H) AS FLOAT) / CAST(SUM(AB) AS FLOAT), 3) AS career_batt_avg
FROM
batting
GROUP BY
playerID
HAVING
SUM(AB) >= 100
ORDER BY
career_batt_avg DESC,
playerID ASC
LIMIT 5
'''
# Execute the query and load the results into a DataFrame
batt_avg_3 = pd.read_sql_query(query, conn)
# Find the player with the highest batting average
highest_batt_avg_player = batt_avg_3.iloc[0]['playerID']
highest_batt_avg = batt_avg_3.iloc[0]['career_batt_avg']
# Question #3
# Part #1
# Define the SQL query to retrieve the data
query = '''
SELECT
t.yearID AS Year,
SUM(CASE WHEN t.teamID = 'NYA' THEN t.L ELSE 0 END) AS Yankees_Losses,
SUM(CASE WHEN t.teamID = 'BOS' THEN t.L ELSE 0 END) AS RedSox_Losses,
SUM(CASE WHEN t.teamID = 'NYA' THEN t.W ELSE 0 END) AS Yankees_wins,
SUM(CASE WHEN t.teamID = 'BOS' THEN t.W ELSE 0 END) AS RedSox_wins
FROM
teams AS t
WHERE
t.yearID >= 2000 AND t.yearID <= 2021
GROUP BY
t.yearID
ORDER BY
t.yearID ASC;
'''
# Load data into a dataframe
df = pd.read_sql_query(query, conn)
# Melt the data to make it easier to plot
melted_df = pd.melt(df, id_vars=["Year"],
var_name="Team_Info", value_name="Value")
# Split the Team_Info column into separate columns for team and metric
melted_df[['Team', 'Metric']] = melted_df['Team_Info'].str.split(
'_', expand=True)
# Create the chart
chart = alt.Chart(melted_df).mark_line().encode(
x="Year:O",
y=alt.Y("Value:Q", axis=alt.Axis(title="Value")),
color="Team:N",
strokeDash="Metric:N",
tooltip=['Year', 'Team', 'Value']
).properties(title={'text': "Yankees vs Red Sox", 'subtitle': "Wins and Losses (2000 - 2021)"}
).resolve_scale(
color='independent'
).configure_axis(grid=True)
# Part #2
# Define the SQL query to retrieve the data
query = '''
SELECT
SUM(CASE WHEN t.teamID = 'NYA' THEN t.W ELSE 0 END) AS Yankees_wins,
SUM(CASE WHEN t.teamID = 'BOS' THEN t.W ELSE 0 END) AS RedSox_wins
FROM
teams AS t
WHERE
t.yearID >= 2000 AND t.yearID <= 2021;
'''
# Load data into a dataframe
df = pd.read_sql_query(query, conn)
# Calculate the total wins for each team
yankees_total_wins = df['Yankees_wins'][0]
redsox_total_wins = df['RedSox_wins'][0]