Client Report - [Finding relationships in baseball]

Elevator pitch

When you hear the word “relationship” what is the first thing that comes to mind? Probably not baseball. But a relationship is simply a way to describe how two or more objects are connected. There are many relationships in baseball such as those between teams and managers, players and salaries, even stadiums and concession prices.

For this project, your client would like developed SQL queries that they can use to retrieve data for use on their website without needing Python. They would also like to see example Altair charts.

Making a Connection to the Project Data

# connect to the SQLite database
sqlite_file = 'lahmansbaseballdb.sqlite'
conn = sqlite3.connect(sqlite_file)

GRAND QUESTION 1

Write an SQL query to create a new dataframe about baseball players who attended BYU-Idaho. The new table should contain five columns: playerID, schoolID, salary, and the yearID/teamID associated with each salary. Order the table by salary (highest to lowest) and print out the table in your report.

Baseball Players Who Attended BYU-Idaho Ordered by Salary (Highest to Lowest)

# Define the SQL query to retrieve the data
query = '''
SELECT DISTINCT 
    p.playerID, 
    cp.schoolID, 
    s.salary, 
    s.yearID, 
    s.teamID
FROM 
    people AS p
LEFT JOIN 
    collegeplaying AS cp ON p.playerID = cp.playerID
LEFT JOIN 
    salaries AS s ON p.playerID = s.playerID
WHERE 
    cp.schoolID = 'idbyuid'
ORDER BY 
    s.salary DESC
'''

# Execute the query and fetch the results into a DataFrame
byui_players = pd.read_sql_query(query, conn)

# Print the resulting dataframe
byui_players

	playerID	schoolID	salary	yearID	teamID
0	lindsma01	idbyuid	4000000.0	2014.0	CHA
1	lindsma01	idbyuid	3600000.0	2012.0	BAL
2	lindsma01	idbyuid	2800000.0	2011.0	COL
3	lindsma01	idbyuid	2300000.0	2013.0	CHA
4	lindsma01	idbyuid	1625000.0	2010.0	HOU
5	stephga01	idbyuid	1025000.0	2001.0	SLN
6	stephga01	idbyuid	900000.0	2002.0	SLN
7	stephga01	idbyuid	800000.0	2003.0	SLN
8	stephga01	idbyuid	550000.0	2000.0	SLN
9	lindsma01	idbyuid	410000.0	2009.0	FLO
10	lindsma01	idbyuid	395000.0	2008.0	FLO
11	lindsma01	idbyuid	380000.0	2007.0	FLO
12	stephga01	idbyuid	215000.0	1999.0	SLN
13	stephga01	idbyuid	185000.0	1998.0	PHI
14	stephga01	idbyuid	150000.0	1997.0	PHI
15	catetr01	idbyuid	NaN	NaN	None

By examining this data, one can learn about the success and earning potential of baseball players who attended BYU-Idaho and which teams they played for during their careers. Additionally, one could analyze salary trends over time or by the team to gain further insights into the baseball industry.

GRAND QUESTION 2

This three-part question requires you to calculate batting average (number of hits divided by the number of at-bats)

Part A. Write an SQL query that provides playerID, yearID, and batting average for players with at least 1 at bat that year. Sort the table from highest batting average to lowest, and then by playerid alphabetically. Show the top 5 results in your report.

Top 5 Batting Averages for Players With at Least 1 at Bat

# Define the SQL query to retrieve the data
query = '''
SELECT 
    playerID, 
    yearID, 
    CAST(H AS FLOAT) / CAST(AB AS FLOAT) AS Batt_avg
FROM 
    batting
WHERE 
    AB >= 1
ORDER BY 
    Batt_avg DESC, 
    playerID ASC
LIMIT 5
'''

# Execute the query and fetch the results into a DataFrame
batt_avg_1 = pd.read_sql_query(query, conn)

# Print the resulting dataframe
batt_avg_1

	playerID	yearID	Batt_avg
0	aberal01	1957	1.0
1	abernte02	1960	1.0
2	abramge01	1923	1.0
3	acklefr01	1964	1.0
4	alanirj01	2019	1.0

The first table shows the names of the top five baseball players with the highest batting averages and the year they achieved those averages. The batting average is calculated for players with at least one at-bat that year. The ‘Batt_avg’ column represents the ratio of hits (H) to at-bats (AB), giving us insight into how well each player performed at bat.

Part B. Use the same query as above, but only include players with at least 10 at bats that year. Print the top 5 results.

Top 5 Batting Averages for Players With at Least 10 at Bats

# Define the SQL query to retrieve the data
query = '''
SELECT 
    playerID, 
    yearID, 
    CAST(H AS FLOAT) / CAST(AB AS FLOAT) AS Batt_avg
FROM 
    batting
WHERE 
    AB >= 10
ORDER BY 
    Batt_avg DESC, 
    playerID ASC
LIMIT 5
'''

# execute the query and load the results into a DataFrame
batt_avg_2 = pd.read_sql_query(query, conn)

# print the top 5 results
batt_avg_2

	playerID	yearID	Batt_avg
0	nymanny01	1974	0.642857
1	carsoma01	2013	0.636364
2	altizda01	1910	0.600000
3	johnsde01	1975	0.600000
4	silvech01	1948	0.571429

The second table shows the names of the top five baseball players with the highest batting averages and the year they achieved those averages for players with at least one at-bat that year. By only showing results for players with at least ten at-bats, the code eliminates small sample sizes and gives a more accurate estimate of a player’s ability to hit.

Part C. Now calculate the batting average for players over their entire careers (all years combined). Only include players with at least 100 at bats, and print the top 5 results.

Top 5 Batting Averages for Players With at Least 100 at Bats

# Define the SQL query to retrieve the data
query = '''
SELECT 
    playerID, 
    SUM(H) AS total_hits, 
    SUM(AB) AS total_at_bats, 
    ROUND(CAST(SUM(H) AS FLOAT) / CAST(SUM(AB) AS FLOAT), 3) AS career_batt_avg
FROM 
    batting
GROUP BY 
    playerID
HAVING 
    SUM(AB) >= 100
ORDER BY 
    career_batt_avg DESC, 
    playerID ASC
LIMIT 5
'''

# Execute the query and load the results into a DataFrame
batt_avg_3 = pd.read_sql_query(query, conn)

# Print the top 5 results
print(batt_avg_3.to_markdown())

# Find the player with the highest batting average
highest_batt_avg_player = batt_avg_3.iloc[0]['playerID']
highest_batt_avg = batt_avg_3.iloc[0]['career_batt_avg']

# Print the player with the highest batting average
print(
    f"\nThe player with the highest career batting average is {highest_batt_avg_player}\nwith a career batting average of {highest_batt_avg}.")

|    | playerID   |   total_hits |   total_at_bats |   career_batt_avg |
|---:|:-----------|-------------:|----------------:|------------------:|
|  0 | cobbty01   |         4189 |           11436 |             0.366 |
|  1 | barnero01  |          860 |            2391 |             0.36  |
|  2 | hornsro01  |         2930 |            8173 |             0.358 |
|  3 | jacksjo01  |         1772 |            4981 |             0.356 |
|  4 | meyerle01  |          513 |            1443 |             0.356 |

The player with the highest career batting average is cobbty01
with a career batting average of 0.366.

The third table shows the names of the top five baseball players with the highest career batting averages and players with at least 100 at-bats. The player with the highest batting average over their career among these five players is Ty Cobb (playerID: “cobbty01”), with a batting average of 0.366.

GRAND QUESTION 3

Pick any two baseball teams and compare them using a metric of your choice (average salary, home runs, number of wins, etc). Write an SQL query to get the data you need, then make a graph in Altair to visualize the comparison. What do you learn?

Chart Comparing Wins and Losses of New York Yankees and Boston Red Sox

# Define the SQL query to retrieve the data
query = '''
SELECT 
    t.yearID AS Year, 
    SUM(CASE WHEN t.teamID = 'NYA' THEN t.L ELSE 0 END) AS Yankees_Losses,
    SUM(CASE WHEN t.teamID = 'BOS' THEN t.L ELSE 0 END) AS RedSox_Losses,
    SUM(CASE WHEN t.teamID = 'NYA' THEN t.W ELSE 0 END) AS Yankees_wins,
    SUM(CASE WHEN t.teamID = 'BOS' THEN t.W ELSE 0 END) AS RedSox_wins
FROM 
    teams AS t
WHERE 
    t.yearID >= 2000 AND t.yearID <= 2021
GROUP BY 
    t.yearID
ORDER BY 
    t.yearID ASC;
'''

# Load data into a dataframe
df = pd.read_sql_query(query, conn)

# Melt the data to make it easier to plot
melted_df = pd.melt(df, id_vars=["Year"],
                    var_name="Team_Info", value_name="Value")

# Split the Team_Info column into separate columns for team and metric
melted_df[['Team', 'Metric']] = melted_df['Team_Info'].str.split(
    '_', expand=True)

# Create the chart
chart = alt.Chart(melted_df).mark_line().encode(
    x="Year:O",
    y=alt.Y("Value:Q", axis=alt.Axis(title="Value")),
    color="Team:N",
    strokeDash="Metric:N",
    tooltip=['Year', 'Team', 'Value']
).properties(title={'text': "Yankees vs Red Sox", 'subtitle': "Wins and Losses (2000 - 2021)"}
             ).resolve_scale(
    color='independent'
).configure_axis(grid=True)

# Show the chart
chart

The code compares the New York Yankees and the Boston Red Sox using a metric of wins and losses from the year 2000 to 2021. The data shows that the Yankees have more wins than losses for most years, while the Red Sox have a more inconsistent record with some years having more losses than wins. However, the Red Sox did win the World Series in 2004, 2007, 2013, and 2018.

The Altair chart displays the wins and losses for both teams over time, with the x-axis representing the year and the y-axis representing the value (i.e., wins or losses). The chart also uses different line styles to represent wins and losses and different colors to represent each team. The tooltip shows the year, team, and value when the user hovers over a point on the chart.

Overall, the data and chart show that the Yankees have had more success than the Red Sox over the past two decades in terms of wins and losses, but the Red Sox have had some notable successes with four World Series wins.

Calculating What Team Has Won More Games

# Define the SQL query to retrieve the data
query = '''
SELECT 
    SUM(CASE WHEN t.teamID = 'NYA' THEN t.W ELSE 0 END) AS Yankees_wins,
    SUM(CASE WHEN t.teamID = 'BOS' THEN t.W ELSE 0 END) AS RedSox_wins
FROM 
    teams AS t
WHERE 
    t.yearID >= 2000 AND t.yearID <= 2021;
'''

# Load data into a dataframe
df = pd.read_sql_query(query, conn)

# Calculate the total wins for each team
yankees_total_wins = df['Yankees_wins'][0]
redsox_total_wins = df['RedSox_wins'][0]

# Print the overall answer
if yankees_total_wins > redsox_total_wins:
    print("The New York Yankees have won more games than \nthe Boston Red Sox over the past two decades.")
elif redsox_total_wins > yankees_total_wins:
    print("The Boston Red Sox have won more games than \nthe New York Yankees over the past two decades.")
else:
    print("The New York Yankees and the Boston Red Sox \nhave won the same number of games over the past two decades.")

The New York Yankees have won more games than 
the Boston Red Sox over the past two decades.

I calculated the number of wins over the past two decades to ensure that the New York Yankees are the true winners. We can see that the New York Yankees have had more wins during the last twenty years compared to the Boston Red Sox.

APPENDIX A (Additional Python Code)

Show the code

# connect to the SQLite database
sqlite_file = 'lahmansbaseballdb.sqlite'
conn = sqlite3.connect(sqlite_file)


# Question #1

# Define the SQL query to retrieve the data
query = '''
SELECT DISTINCT 
    p.playerID, 
    cp.schoolID, 
    s.salary, 
    s.yearID, 
    s.teamID
FROM 
    people AS p
JOIN 
    collegeplaying AS cp ON p.playerID = cp.playerID
JOIN 
    salaries AS s ON p.playerID = s.playerID
WHERE 
    cp.schoolID = 'idbyuid'
ORDER BY 
    s.salary DESC
'''

# Execute the query and fetch the results into a DataFrame
byui_players = pd.read_sql_query(query, conn)


# Question #2

# Part A.
# Define the SQL query to retrieve the data
query = '''
SELECT 
    playerID, 
    yearID, 
    CAST(H AS FLOAT) / CAST(AB AS FLOAT) AS Batt_avg
FROM 
    batting
WHERE 
    AB >= 1
ORDER BY 
    Batt_avg DESC, 
    playerID ASC
LIMIT 5
'''

# Execute the query and fetch the results into a DataFrame
batt_avg_1 = pd.read_sql_query(query, conn)

# Part B.
# Define the SQL query to retrieve the data
query = '''
SELECT 
    playerID, 
    yearID, 
    CAST(H AS FLOAT) / CAST(AB AS FLOAT) AS Batt_avg
FROM 
    batting
WHERE 
    AB >= 10
ORDER BY 
    Batt_avg DESC, 
    playerID ASC
LIMIT 5
'''

# execute the query and load the results into a DataFrame
batt_avg_2 = pd.read_sql_query(query, conn)

# Part C.
# Define the SQL query to retrieve the data
query = '''
SELECT 
    playerID, 
    SUM(H) AS total_hits, 
    SUM(AB) AS total_at_bats, 
    ROUND(CAST(SUM(H) AS FLOAT) / CAST(SUM(AB) AS FLOAT), 3) AS career_batt_avg
FROM 
    batting
GROUP BY 
    playerID
HAVING 
    SUM(AB) >= 100
ORDER BY 
    career_batt_avg DESC, 
    playerID ASC
LIMIT 5
'''

# Execute the query and load the results into a DataFrame
batt_avg_3 = pd.read_sql_query(query, conn)

# Find the player with the highest batting average
highest_batt_avg_player = batt_avg_3.iloc[0]['playerID']
highest_batt_avg = batt_avg_3.iloc[0]['career_batt_avg']


# Question #3

# Part #1
# Define the SQL query to retrieve the data
query = '''
SELECT 
    t.yearID AS Year, 
    SUM(CASE WHEN t.teamID = 'NYA' THEN t.L ELSE 0 END) AS Yankees_Losses,
    SUM(CASE WHEN t.teamID = 'BOS' THEN t.L ELSE 0 END) AS RedSox_Losses,
    SUM(CASE WHEN t.teamID = 'NYA' THEN t.W ELSE 0 END) AS Yankees_wins,
    SUM(CASE WHEN t.teamID = 'BOS' THEN t.W ELSE 0 END) AS RedSox_wins
FROM 
    teams AS t
WHERE 
    t.yearID >= 2000 AND t.yearID <= 2021
GROUP BY 
    t.yearID
ORDER BY 
    t.yearID ASC;
'''

# Load data into a dataframe
df = pd.read_sql_query(query, conn)

# Melt the data to make it easier to plot
melted_df = pd.melt(df, id_vars=["Year"],
                    var_name="Team_Info", value_name="Value")

# Split the Team_Info column into separate columns for team and metric
melted_df[['Team', 'Metric']] = melted_df['Team_Info'].str.split(
    '_', expand=True)

# Create the chart
chart = alt.Chart(melted_df).mark_line().encode(
    x="Year:O",
    y=alt.Y("Value:Q", axis=alt.Axis(title="Value")),
    color="Team:N",
    strokeDash="Metric:N",
    tooltip=['Year', 'Team', 'Value']
).properties(title={'text': "Yankees vs Red Sox", 'subtitle': "Wins and Losses (2000 - 2021)"}
             ).resolve_scale(
    color='independent'
).configure_axis(grid=True)


# Part #2
# Define the SQL query to retrieve the data
query = '''
SELECT 
    SUM(CASE WHEN t.teamID = 'NYA' THEN t.W ELSE 0 END) AS Yankees_wins,
    SUM(CASE WHEN t.teamID = 'BOS' THEN t.W ELSE 0 END) AS RedSox_wins
FROM 
    teams AS t
WHERE 
    t.yearID >= 2000 AND t.yearID <= 2021;
'''

# Load data into a dataframe
df = pd.read_sql_query(query, conn)

# Calculate the total wins for each team
yankees_total_wins = df['Yankees_wins'][0]
redsox_total_wins = df['RedSox_wins'][0]