# Load required packages (hint: you need tidycensus, tidyverse, and knitr)
library(tidyverse)
library(tidycensus)
library(scales)
library(RColorBrewer)
library(knitr)
# Set your Census API key
census_api_key(Sys.getenv("CENSUS_API_KEY"))
# Choose your state for analysis - assign it to a variable called my_state
my_state <- "PA"Lab 1: Census Data Quality for Policy Decisions
Evaluating Data Reliability for Algorithmic Decision-Making
Assignment Overview
Scenario
You are a data analyst for the Pennsylvania Department of Human Services. The department is considering implementing an algorithmic system to identify communities that should receive priority for social service funding and outreach programs. Your supervisor has asked you to evaluate the quality and reliability of available census data to inform this decision.
Drawing on our Week 2 discussion of algorithmic bias, you need to assess not just what the data shows, but how reliable it is and what communities might be affected by data quality issues.
Learning Objectives
- Apply dplyr functions to real census data for policy analysis
- Evaluate data quality using margins of error
- Connect technical analysis to algorithmic decision-making
- Identify potential equity implications of data reliability issues
- Create professional documentation for policy stakeholders
Submission Instructions
Submit by posting your updated portfolio link on Canvas. Your assignment should be accessible at your-portfolio-url/labs/lab_1/
Make sure to update your _quarto.yml navigation to include this assignment under an “Labs” menu.
Part 1: Portfolio Integration
Create this assignment in your portfolio repository under an labs/lab_1/ folder structure. Update your navigation menu to include:
- text: Assignments
menu:
- href: labs/lab_1/your_file_name.qmd
text: "Lab 1: Census Data Exploration"
If there is a special character like a colon, you need use double quote mark so that the quarto can identify this as text
Setup
State Selection: I have chosen Pennsylvania for this analysis because: I gre up here, attended undergrad here, and now am in grad school here.
Part 2: County-Level Resource Assessment
2.1 Data Retrieval
Your Task: Use get_acs() to retrieve county-level data for your chosen state.
Requirements: - Geography: county level - Variables: median household income (B19013_001) and total population (B01003_001)
- Year: 2022 - Survey: acs5 - Output format: wide
Hint: Remember to give your variables descriptive names using the variables = c(name = "code") syntax.
# Write your get_acs() code here
medHHin_totalpop <- get_acs(
geography = "county",
variables = c(
total_pop = "B01003_001", # Total population
median_income = "B19013_001" # Median household income
),
state = "PA",
year = 2022,
survey = "acs5",
output = "wide"
)
# Clean the county names to remove state name and "County"
# Hint: use mutate() with str_remove()
county_incomepop <- medHHin_totalpop %>%
mutate(county_name = str_remove(NAME, paste0(" County, Pennsylvania")))
# Display the first few rows
glimpse(county_incomepop)Rows: 67
Columns: 7
$ GEOID <chr> "42001", "42003", "42005", "42007", "42009", "42011", "…
$ NAME <chr> "Adams County, Pennsylvania", "Allegheny County, Pennsy…
$ total_popE <dbl> 104604, 1245310, 65538, 167629, 47613, 428483, 122640, …
$ total_popM <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ median_incomeE <dbl> 78975, 72537, 61011, 67194, 58337, 74617, 59386, 60650,…
$ median_incomeM <dbl> 3334, 869, 2202, 1531, 2606, 1191, 2058, 2167, 1516, 21…
$ county_name <chr> "Adams", "Allegheny", "Armstrong", "Beaver", "Bedford",…
2.2 Data Quality Assessment
Your Task: Calculate margin of error percentages and create reliability categories.
Requirements: - Calculate MOE percentage: (margin of error / estimate) * 100 - Create reliability categories: - High Confidence: MOE < 5% - Moderate Confidence: MOE 5-10%
- Low Confidence: MOE > 10% - Create a flag for unreliable estimates (MOE > 10%)
Hint: Use mutate() with case_when() for the categories.
# Calculate MOE percentage and reliability categories using mutate()
county_incomepop <- county_incomepop %>%
mutate(MOE_per = median_incomeM/median_incomeE * 100)
county_incomepop <- county_incomepop %>%
mutate(
reliability = case_when(
MOE_per < 5 ~ "High",
MOE_per > 5 & MOE_per < 10 ~ "Moderate",
MOE_per > 10 ~ "Low"))
county_incomepop <- county_incomepop %>%
mutate(Low_Flag = if_else(reliability == "Low", 1L, 0L))
# Create a summary showing count of counties in each reliability category
county_incomepop %>%
group_by(reliability) %>%
summarise(count = n(), .groups = "drop")# A tibble: 2 × 2
reliability count
<chr> <int>
1 High 57
2 Moderate 10
#57 Low
#10 Moderate
# Hint: use count() and mutate() to add percentages2.3 High Uncertainty Counties
Your Task: Identify the 5 counties with the highest MOE percentages.
Requirements: - Sort by MOE percentage (highest first) - Select the top 5 counties - Display: county name, median income, margin of error, MOE percentage, reliability category - Format as a professional table using kable()
Hint: Use arrange(), slice(), and select() functions.
# Create table of top 5 counties by MOE percentage
high5_uncertainty <- county_incomepop %>%
arrange(desc(MOE_per)) %>%
slice(1:5) %>%
select(
NAME,
median_incomeE,
median_incomeM,
MOE_per,
reliability
)
# Format as table with kable() - include appropriate column names and caption
kable(
high5_uncertainty,
col.names = c(
"County",
"Median Income",
"Income Margin of Error",
"Margin Of Error Percentage",
"Reliability Category"
),
caption = "Top 5 Counties with the Highest Margin of Error Percentages"
)| County | Median Income | Income Margin of Error | Margin Of Error Percentage | Reliability Category |
|---|---|---|---|---|
| Forest County, Pennsylvania | 46188 | 4612 | 9.985278 | Moderate |
| Sullivan County, Pennsylvania | 62910 | 5821 | 9.252901 | Moderate |
| Union County, Pennsylvania | 64914 | 4753 | 7.321995 | Moderate |
| Montour County, Pennsylvania | 72626 | 5146 | 7.085617 | Moderate |
| Elk County, Pennsylvania | 61672 | 4091 | 6.633480 | Moderate |
Data Quality Commentary: Counties with lower data quality and/or smaller populations may be more poorly served by these algorithms. Factors such as the accuracy of the median income estimate may contribute to this challenge.
Part 3: Neighborhood-Level Analysis
3.1 Focus Area Selection
Your Task: Select 2-3 counties from your reliability analysis for detailed tract-level study.
Strategy: Choose counties that represent different reliability levels (e.g., 1 high confidence, 1 moderate, 1 low confidence) to compare how data quality varies.
# Use filter() to select 2-3 counties from your county_reliability data
# Store the selected counties in a variable called selected_counties
selected_counties <- county_incomepop %>%
filter(
county_name %in% c("Allegheny", "Forest", "Philadelphia")
)
# Display the selected counties with their key characteristics
# Show: county name, median income, MOE percentage, reliability category
selected_counties %>%
select(
county_name,
median_incomeE,
MOE_per,
reliability
) %>%
kable(
col.names = c(
"County",
"Median Income",
"MOE Percentage",
"Reliability Category"
),
caption = "Selected Counties for Neighborhood-Level (Tract) Analysis"
)| County | Median Income | MOE Percentage | Reliability Category |
|---|---|---|---|
| Allegheny | 72537 | 1.198009 | High |
| Forest | 46188 | 9.985278 | Moderate |
| Philadelphia | 57537 | 1.376506 | High |
Comment on the output: Selected counties are Allegheny (with PA’s lowest MOE %), Forest (with PA’s highest MOE %), and Philadelphia (as an interest because we live and study here).
3.2 Tract-Level Demographics
Your Task: Get demographic data for census tracts in your selected counties.
Requirements: - Geography: tract level - Variables: white alone (B03002_003), Black/African American (B03002_004), Hispanic/Latino (B03002_012), total population (B03002_001) - Use the same state and year as before - Output format: wide - Challenge: You’ll need county codes, not names. Look at the GEOID patterns in your county data for hints.
# Define your race/ethnicity variables with descriptive names
race_vars <- c(
total_pop = "B03002_001",
white = "B03002_003",
black = "B03002_004",
hispanic = "B03002_012"
)
# Use get_acs() to retrieve tract-level data
# Hint: You may need to specify county codes in the county parameter
# Allegheny = 003, Forest = 053, Philadelphia = 101
tract_demographics <- get_acs(
geography = "tract",
state = "PA",
county = c("003", "053", "101"),
variables = race_vars,
year = 2022, # same ACS year used earlier
output = "wide"
)
# Calculate percentage of each group using mutate()
# Create percentages for white, Black, and Hispanic populations
tract_demographics <- tract_demographics %>%
mutate(
pct_white = whiteE / total_popE * 100,
pct_black = blackE / total_popE * 100,
pct_hispanic = hispanicE / total_popE * 100
)
# Add readable tract and county name columns using str_extract() or similar
tract_demographics <- tract_demographics %>%
mutate(
tract_name = str_extract(NAME, "^Census Tract[^;]+"),
county_name = str_extract(NAME, "(?<=; )[A-Za-z ]+(?= County)")
)3.3 Demographic Analysis
Your Task: Analyze the demographic patterns in your selected areas.
# Find the tract with the highest percentage of Hispanic/Latino residents
# Hint: use arrange() and slice() to get the top tract
#high5_uncertainty <- county_incomepop %>% arrange(desc(MOE_per)) %>% slice(1:5) %>%select(NAME,median_incomeE,median_incomeM,MOE_per,reliability)
tract_mostlatino <- tract_demographics %>%
arrange(desc(pct_hispanic)) %>%
slice(1)
# Calculate average demographics by county using group_by() and summarize()
# Show: number of tracts, average percentage for each racial/ethnic group
county_demographics <- tract_demographics %>%
group_by(county_name) %>%
summarise(
num_tracts = n(),
avg_pct_white = mean(pct_white, na.rm = TRUE),
avg_pct_black = mean(pct_black, na.rm = TRUE),
avg_pct_hispanic = mean(pct_hispanic, na.rm = TRUE),
.groups = "drop"
)
# Create a nicely formatted table of your results using kable()
kable(
county_demographics,
col.names = c(
"County",
"Number of Tracts",
"Avg % White",
"Avg % Black",
"Avg % Hispanic/Latino"
),
caption = "Average Tract-Level Demographics by County"
)| County | Number of Tracts | Avg % White | Avg % Black | Avg % Hispanic/Latino |
|---|---|---|---|---|
| Allegheny | 394 | 74.45359 | 15.41641 | 2.416422 |
| Forest | 2 | 71.18900 | 13.56075 | 7.379975 |
| Philadelphia | 408 | 35.38937 | 39.19500 | 13.843119 |
Part 4: Comprehensive Data Quality Evaluation
4.1 MOE Analysis for Demographic Variables
Your Task: Examine margins of error for demographic variables to see if some communities have less reliable data.
Requirements: - Calculate MOE percentages for each demographic variable - Flag tracts where any demographic variable has MOE > 15% - Create summary statistics
# Calculate MOE percentages for white, Black, and Hispanic variables
# Hint: use the same formula as before (margin/estimate * 100)
tract_demographics <- tract_demographics %>%
mutate(MOE_per_white = whiteM/whiteE *100 ) %>%
mutate(MOE_per_black = blackM/blackE *100) %>%
mutate(MOE_per_hispanic = hispanicM/hispanicE*100)
# Create a flag for tracts with high MOE on any demographic variable
# Use logical operators (| for OR) in an ifelse() statement
tract_demographics <- tract_demographics %>%
mutate(high_moe_flag = ifelse(
MOE_per_white > 15 | MOE_per_black > 15 | MOE_per_hispanic > 16,
"High MOE",
"Acceptable"
))
# Create summary statistics showing how many tracts have data quality issues
tract_demographics %>%
summarise(
total_tracts = n(),
tracts_with_issues = sum(high_moe_flag == "High MOE", na.rm = TRUE),
percent_with_issues = 100 * mean(high_moe_flag == "High MOE", na.rm = TRUE)
)# A tibble: 1 × 3
total_tracts tracts_with_issues percent_with_issues
<int> <int> <dbl>
1 804 804 100
#100% ...4.2 Pattern Analysis
Your Task: Investigate whether data quality problems are randomly distributed or concentrated in certain types of communities.
# Group tracts by whether they have high MOE issues
tract_demographics <- tract_demographics %>%
mutate(
high_moe_white = ifelse(MOE_per_white > 15, "High MOE", "Acceptable"),
high_moe_black = ifelse(MOE_per_black > 15, "High MOE", "Acceptable"),
high_moe_hispanic = ifelse(MOE_per_hispanic > 15, "High MOE", "Acceptable")
)
tracts_by_race_moe <- tract_demographics %>%
summarise(
white_high_moe = sum(high_moe_white == "High MOE", na.rm = TRUE),
black_high_moe = sum(high_moe_black == "High MOE", na.rm = TRUE),
hispanic_high_moe = sum(high_moe_hispanic == "High MOE", na.rm = TRUE),
total_tracts = n()
) %>%
mutate(
pct_white_high_moe = 100 * white_high_moe / total_tracts,
pct_black_high_moe = 100 * black_high_moe / total_tracts,
pct_hispanic_high_moe = 100 * hispanic_high_moe / total_tracts
)
# Calculate average characteristics for each group:
# - population size, demographic percentages
#white
white_summary <- tract_demographics %>%
group_by(high_moe_white) %>%
summarise(
n_tracts = n(),
avg_population = mean(total_popE, na.rm = TRUE),
avg_pct_white = mean(pct_white, na.rm = TRUE),
avg_pct_black = mean(pct_black, na.rm = TRUE),
avg_pct_hispanic = mean(pct_hispanic, na.rm = TRUE)
) %>%
mutate(subgroup = "White")
#black
black_summary <- tract_demographics %>%
group_by(high_moe_black) %>%
summarise(
n_tracts = n(),
avg_population = mean(total_popE, na.rm = TRUE),
avg_pct_white = mean(pct_white, na.rm = TRUE),
avg_pct_black = mean(pct_black, na.rm = TRUE),
avg_pct_hispanic = mean(pct_hispanic, na.rm = TRUE)
) %>%
mutate(subgroup = "Black")
#hispanic
hispanic_summary <- tract_demographics %>%
group_by(high_moe_hispanic) %>%
summarise(
n_tracts = n(),
avg_population = mean(total_popE, na.rm = TRUE),
avg_pct_white = mean(pct_white, na.rm = TRUE),
avg_pct_black = mean(pct_black, na.rm = TRUE),
avg_pct_hispanic = mean(pct_hispanic, na.rm = TRUE)
) %>%
mutate(subgroup = "Hispanic")
# Use group_by() and summarize() to create this comparison
pattern_summary <- bind_rows(white_summary, black_summary, hispanic_summary) %>%
select(subgroup, everything()) %>%
arrange(subgroup, desc(n_tracts))
# Create a professional table showing the patterns
kable(pattern_summary, digits = 1,
caption = "Average Tract Characteristics by Racial/Ethnic Subgroup and MOE Flag")| subgroup | high_moe_white | n_tracts | avg_population | avg_pct_white | avg_pct_black | avg_pct_hispanic | high_moe_black | high_moe_hispanic |
|---|---|---|---|---|---|---|---|---|
| Black | NA | 795 | 3548.6 | 55.2 | 26.9 | 8.2 | High MOE | NA |
| Black | NA | 9 | 2705.2 | 19.8 | 68.6 | 5.1 | Acceptable | NA |
| Hispanic | NA | 803 | 3534.1 | 54.8 | 27.4 | 8.2 | NA | High MOE |
| Hispanic | NA | 1 | 7591.0 | 63.0 | 6.1 | 5.3 | NA | Acceptable |
| White | High MOE | 564 | 3406.2 | 42.1 | 36.8 | 10.5 | NA | NA |
| White | Acceptable | 240 | 3851.7 | 83.4 | 6.1 | 2.9 | NA | NA |
Pattern Analysis: Since ACS provides an estimate (as opposed to a more exact count from the Census) and census tracts are a pretty granular level to analyze, it makes sense that there are some substantial margins of error. However, I was surprised to see that all 804 census tracts in this analysis were flagged for having at least one MOE of any race greater than 15%. In further pattern analysis, Black & Hispanic tracts have significantly more flagged as having High Margins of Error (there is only 1! Latino tract with an Acceptable MOE!). This means Black and Latino residents of these 3 counties are significantly worse represented in the data.
Part 5: Policy Recommendations
5.1 Analysis Integration and Professional Summary
Your Task: Write an executive summary that integrates findings from all four analyses.
Executive Summary Requirements: 1. Overall Pattern Identification: What are the systematic patterns across all your analyses? 2. Equity Assessment: Which communities face the greatest risk of algorithmic bias based on your findings? 3. Root Cause Analysis: What underlying factors drive both data quality issues and bias risk? 4. Strategic Recommendations: What should the Department implement to address these systematic issues?
Executive Summary: A common pattern across the four analyses is that data quality issues are not random, but are concentrated around certain communities. Black and Latino communities and tracts with smaller populations have significantly higher margins of error. By initially flagging for a tract having potentially unreliable data when any of the three analyzed racial groups’ (white, Black, Latino) margin of error was greater than 15%, this encompassed all 804 included tracts, suggesting a high-level of data unreliability. However, further steps looking into the data more granularly very quickly highlighted that the margin of error is extremely high for Latino (>99%) and Black (>98%) tracts. This means, amongst other concerns, that data quality issues specifically related to these racial groups may be overlooked or unaddressed. Following this, Black and Latino communities face the greatest risk of algorithmic bias. The data attributed to these groups in the ACS surveys are estimates with extremely high margins of error and rates of margins of error. The level of uncertainty and how widespread this quality issue spreads both signify that these data are unlikely to reflect the true population. And, with a less accurate picture of these communities, plans and community action are more likely to overlook and/or plan without adequate consideration for these groups. Underlying factors that may drive data quality issues and/or bias risk include the data collection methods. As previously stated, the ACS is a survey of citizens conducted more frequently than the decennial census. Thus, the ACS data are based on a sample of the complete population. When estimating the demographics of, for example, smaller communities (the tract level is relatively small and hard to encapsulate) can have larger margins of error just based on the sample size and how the sample is translated to different geographies. Additionally, groups that have been historically marginalized are less likely to be well-covered in this data and therefore may be marginalized further. Despite its challenges, ACS data is a significant opportunity for understanding change over time and dates more frequent than the Census. It is, however, absolutely necessary for the Department to address these systematic issues, in order to utilize this data as a tool without perpetuating certain groups’ marginalization. The department can make use of the flagging mechanisms used here to set a threshold for when a Margin Of Error is significant; when data are flagged for having high margins of error, this can be noted and taken into account in planning. Specifically, when high MOEs are flagged, either further data collection/research can be produced or the data can be not the primary driver of any planning decisions.
6.3 Specific Recommendations
Your Task: Create a decision framework for algorithm implementation.
# Create a summary table using your county reliability data
# Include: county name, median income, MOE percentage, reliability category
county_incomepop# A tibble: 67 × 10
GEOID NAME total_popE total_popM median_incomeE median_incomeM county_name
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 42001 Adams … 104604 NA 78975 3334 Adams
2 42003 Allegh… 1245310 NA 72537 869 Allegheny
3 42005 Armstr… 65538 NA 61011 2202 Armstrong
4 42007 Beaver… 167629 NA 67194 1531 Beaver
5 42009 Bedfor… 47613 NA 58337 2606 Bedford
6 42011 Berks … 428483 NA 74617 1191 Berks
7 42013 Blair … 122640 NA 59386 2058 Blair
8 42015 Bradfo… 60159 NA 60650 2167 Bradford
9 42017 Bucks … 645163 NA 107826 1516 Bucks
10 42019 Butler… 194562 NA 82932 2164 Butler
# ℹ 57 more rows
# ℹ 3 more variables: MOE_per <dbl>, reliability <chr>, Low_Flag <int>
county_incomepop <- county_incomepop %>%
mutate(
algorithm_recommendation = case_when(
reliability == "High" ~
"Safe for algorithmic decisions",
reliability == "Moderate" ~
"Use with caution – monitor outcomes",
reliability == "Low" ~
"Requires manual review or additional data",
TRUE ~ NA_character_
)
)
county_algorithm <- county_incomepop %>%
select(county_name,median_incomeE,MOE_per,reliability,algorithm_recommendation)
# Add a new column with algorithm recommendations using case_when():
# - High Confidence: "Safe for algorithmic decisions"
# - Moderate Confidence: "Use with caution - monitor outcomes"
# - Low Confidence: "Requires manual review or additional data"
# Format as a professional table with kable()
kable(
county_algorithm,
digits = 1,
caption = "County-Level Data Reliability and Algorithm Implementation Recommendations"
)| county_name | median_incomeE | MOE_per | reliability | algorithm_recommendation |
|---|---|---|---|---|
| Adams | 78975 | 4.2 | High | Safe for algorithmic decisions |
| Allegheny | 72537 | 1.2 | High | Safe for algorithmic decisions |
| Armstrong | 61011 | 3.6 | High | Safe for algorithmic decisions |
| Beaver | 67194 | 2.3 | High | Safe for algorithmic decisions |
| Bedford | 58337 | 4.5 | High | Safe for algorithmic decisions |
| Berks | 74617 | 1.6 | High | Safe for algorithmic decisions |
| Blair | 59386 | 3.5 | High | Safe for algorithmic decisions |
| Bradford | 60650 | 3.6 | High | Safe for algorithmic decisions |
| Bucks | 107826 | 1.4 | High | Safe for algorithmic decisions |
| Butler | 82932 | 2.6 | High | Safe for algorithmic decisions |
| Cambria | 54221 | 3.3 | High | Safe for algorithmic decisions |
| Cameron | 46186 | 5.6 | Moderate | Use with caution – monitor outcomes |
| Carbon | 64538 | 5.3 | Moderate | Use with caution – monitor outcomes |
| Centre | 70087 | 2.8 | High | Safe for algorithmic decisions |
| Chester | 118574 | 1.7 | High | Safe for algorithmic decisions |
| Clarion | 58690 | 4.4 | High | Safe for algorithmic decisions |
| Clearfield | 56982 | 2.8 | High | Safe for algorithmic decisions |
| Clinton | 59011 | 3.9 | High | Safe for algorithmic decisions |
| Columbia | 59457 | 3.8 | High | Safe for algorithmic decisions |
| Crawford | 58734 | 3.9 | High | Safe for algorithmic decisions |
| Cumberland | 82849 | 2.2 | High | Safe for algorithmic decisions |
| Dauphin | 71046 | 2.3 | High | Safe for algorithmic decisions |
| Delaware | 86390 | 1.5 | High | Safe for algorithmic decisions |
| Elk | 61672 | 6.6 | Moderate | Use with caution – monitor outcomes |
| Erie | 59396 | 2.6 | High | Safe for algorithmic decisions |
| Fayette | 55579 | 4.2 | High | Safe for algorithmic decisions |
| Forest | 46188 | 10.0 | Moderate | Use with caution – monitor outcomes |
| Franklin | 71808 | 3.0 | High | Safe for algorithmic decisions |
| Fulton | 63153 | 3.6 | High | Safe for algorithmic decisions |
| Greene | 66283 | 6.4 | Moderate | Use with caution – monitor outcomes |
| Huntingdon | 61300 | 4.7 | High | Safe for algorithmic decisions |
| Indiana | 57170 | 4.6 | High | Safe for algorithmic decisions |
| Jefferson | 56607 | 3.4 | High | Safe for algorithmic decisions |
| Juniata | 61915 | 4.8 | High | Safe for algorithmic decisions |
| Lackawanna | 63739 | 2.6 | High | Safe for algorithmic decisions |
| Lancaster | 81458 | 1.8 | High | Safe for algorithmic decisions |
| Lawrence | 57585 | 3.1 | High | Safe for algorithmic decisions |
| Lebanon | 72532 | 2.7 | High | Safe for algorithmic decisions |
| Lehigh | 74973 | 2.0 | High | Safe for algorithmic decisions |
| Luzerne | 60836 | 2.4 | High | Safe for algorithmic decisions |
| Lycoming | 63437 | 4.4 | High | Safe for algorithmic decisions |
| McKean | 57861 | 4.7 | High | Safe for algorithmic decisions |
| Mercer | 57353 | 3.6 | High | Safe for algorithmic decisions |
| Mifflin | 58012 | 3.4 | High | Safe for algorithmic decisions |
| Monroe | 80656 | 3.2 | High | Safe for algorithmic decisions |
| Montgomery | 107441 | 1.3 | High | Safe for algorithmic decisions |
| Montour | 72626 | 7.1 | Moderate | Use with caution – monitor outcomes |
| Northampton | 82201 | 1.9 | High | Safe for algorithmic decisions |
| Northumberland | 55952 | 2.7 | High | Safe for algorithmic decisions |
| Perry | 76103 | 3.2 | High | Safe for algorithmic decisions |
| Philadelphia | 57537 | 1.4 | High | Safe for algorithmic decisions |
| Pike | 76416 | 4.9 | High | Safe for algorithmic decisions |
| Potter | 56491 | 4.4 | High | Safe for algorithmic decisions |
| Schuylkill | 63574 | 2.4 | High | Safe for algorithmic decisions |
| Snyder | 65914 | 5.6 | Moderate | Use with caution – monitor outcomes |
| Somerset | 57357 | 2.8 | High | Safe for algorithmic decisions |
| Sullivan | 62910 | 9.3 | Moderate | Use with caution – monitor outcomes |
| Susquehanna | 63968 | 3.1 | High | Safe for algorithmic decisions |
| Tioga | 59707 | 3.2 | High | Safe for algorithmic decisions |
| Union | 64914 | 7.3 | Moderate | Use with caution – monitor outcomes |
| Venango | 59278 | 3.4 | High | Safe for algorithmic decisions |
| Warren | 57925 | 5.2 | Moderate | Use with caution – monitor outcomes |
| Washington | 74403 | 2.4 | High | Safe for algorithmic decisions |
| Wayne | 59240 | 4.8 | High | Safe for algorithmic decisions |
| Westmoreland | 69454 | 2.0 | High | Safe for algorithmic decisions |
| Wyoming | 67968 | 3.9 | High | Safe for algorithmic decisions |
| York | 79183 | 1.8 | High | Safe for algorithmic decisions |
Key Recommendations:
Your Task: Use your analysis results to provide specific guidance to the department.
Counties suitable for immediate algorithmic implementation: PA has primarily counties with high confidence based on these parameters. Allegheny and Philadelphia counties both fall into this category. These counties’ data are solid at the county level, but as seen in this analysis, planners must be aware of the groups who are hidden and whose data is not flagged at the county level despite its low confidence at smaller geographies.
Counties requiring additional oversight: Cameron, Carbon, Elk, Forest (ANALYSIS), Greene, Montour, Snyder, SUllivan, Union, and Warren counties have moderate reliability data. This means that the data may be okay, but planners relying on this data should be aware of this flag and be prepared to adapt or complete further data collection.
Counties needing alternative approaches: PA has no counties with low confidence based on these parameters. However, since all 804 PA tracts analyzed across three counties were found to be unreliable, planners should be aware of the subgroups within counties that may be overlooked or misrepresented by seemingly high-confidence data.
Questions for Further Investigation
To what extent is it possible to most-completely overcome algorithmic bias? How much does the confidence TRULY increase at a scale such as county as compared to tracts without continuing to miss the same overlooked, marginalized groups?
Technical Notes
Data Sources: - U.S. Census Bureau, American Community Survey 2018-2022 5-Year Estimates - Retrieved via tidycensus R package on in February 2026
Reproducibility: - All analysis conducted in R version 2025.05.01+513 - Census API key required for replication - Complete code and documentation available at: https://mamole27.github.io/PPA_Portfolio/
Methodology Notes: Thresholds and parameters followed explicitly given guidelines in this assignment.
Limitations: Analysis based on a small selection of counties for no particular reason.
Submission Checklist
Before submitting your portfolio link on Canvas:
Remember: Submit your portfolio URL on Canvas, not the file itself. Your assignment should be accessible at your-portfolio-url/labs/lab_1/your_file_name.html