Multiple Hypothesis Testing: The average annual rates of lung cancer and patients’ races

9 min readFeb 16, 2022

Author: Khoa Nguyen

1. Research Question

At state level, are the average annual rates of lung cancer incidence different for people of a certain race as compared to another race?

2. Overview

This project includes multiple hypothesis tests to investigate the relationship between lung cancer rates and ethnic groups. They aim to prove that there are significant differences between the rate of lung cancer incidents across patients with different racial backgrounds. At the same time, they emphasize the fact that these differences are not causal due to the impact of other socioeconomic factors.

3. Dataset

a. Data Overview

I have taken my data from the CDC Annual State-Level U.S. Chronic Disease Indicators set (link).

The data came from the EIA, or the US Energy Information Administration. This is also census data and I can assume that there are no groups being systematically excluded from this as all 50 states have coal consumption values reported. This data set is extremely large and gives information and data on several major diseases and indicators in the US. It contains a mix of sample and census data as it is made up of various studies from varied data sources.

b. Data Cleaning

The original complete data set has 127 thousand rows and 33 columns with data from several sources including death and birth certificates, as well as legal research, school health profiles and statewide central cancer registries. Because I am focusing my research on lung cancer and race on a state level, I can work on a specific subset of the original dataset, thus the data set is filtered to only include studies on incidences of lung cancer. In the end, after filtering and cleaning my data, I have data that comes from a single source: statewide central cancer registries, which registers the data as census. This only happened to occur after my data cleaning. When looking at my filtered data set and considering the granularity of the data frame, I note that each row represents a study conducted over 4 years stratified for race and specifies the Average annual age adjusted rate for that specific racial group in each state.

4. EDA

Figure 1: The discrepancy between annual rates of different cancer incidents of 5 racial groups in interest

Figure 1 shows great disparities among the groups for all types of cancer but Lung and Oral cancers. Between these two, I decide to go with Lung cancer, since the points are closest together as this seems to be the one whose differences are worth investigating.

Figure 2: The geographic distribution of lung cancer rates among 5 racial groups in interest

For Figure 2, geographically speaking, there are differences in the states’ cancer rates between the racial groups. In certain regions, some groups show higher cancer rates than others. However, due to the limitations of the graphs, the color codings are not on the same scale, which makes it difficult to inspect this phenomenon visually.

Comments: Looking at the density plot above, I observe that the rates are distributed differently: Most groups are fairly normally distributed except for Indian/Native density with prominently bi-modal distribution. Asian & Hispanic groups have similar means, same with White & Black groups. Variances vastly differ from one group to the next and show no homoscedasticity, which means ANOVA is not a plausible test. This supports the use of Multiple Hypothesis Testing, which will be discussed in details in the next section.

Within the data filtered by Cancer, there are multiple data value types, but I choose to use the annual rate after being age-adjusted. This is to account for the fact that other groups but White are considered minorities with lower population size, so I cannot use raw numbers as it will skew the results. Moreover, since my question of interest focuses on the statewide data, I exclude all rows with location being the United States. It is worth noting that this dataframe is missing data for 4 states: Delaware, Illinois, Kentucky, and Nevada, so this analysis is not a holistic representation of the whole American population.

It is also important to understand that I may be excluding certain groups from my analysis through data cleaning. Specifically, within the racial stratification, I started with 8 groups of race, but after filtering for lung and bronchus cancers, I notice that there are some unusable observations in the dataset based on the data footnotes. These points are then removed, which left us with only 5 groups of race instead of 8: Hispanic, White — non-Hispanic, Black — non-Hispanic, Asian or Pacific Islander, and American Indian or Alaska Native. Reason for the missing of the other 3 racial groups is likely due to the inherent way state cancer registries work that did not capture incidences of cancer among people who are not identified as “cancer patients” or who were not able to receive medical treatment, including many indigenous people as well as undocumented people.

5. Multiple Hypothesis Testing

a. Methods

There are 2 main reasons to use multiple hypothesis testing. First and foremost, the probability of observing at least one significant result just due to chance is:

𝑃 (at least one significant result)

= 1−𝑃 (no significant results)

= 1−(1−.05)¹⁰

≈ 0.40

With 10 tests to consider, there is a 40% chance of observing at least one significant result despite all tests being insignificant; this number would grow if I had more treatments to consider. Secondly, from EDA, I do not observe homoscedasticity among the average annual rates between 5 groups, so the assumption for ANOVA does not hold. Hence I cannot test the null hypothesis 𝐻0: μ1 = μ2 = μ3 = μ4 = μ5. Instead, I will be testing the paired difference of means using multiple testing.

With 5 treatments (groups of race/ethnicity) being Hispanic, White or non-Hispanic, Black or non-Hispanic, Asian or Pacific Islander, American Indian or Alaska Native, and 1 outcome being lung cancer, I pairwise compare their average annual rate of lung cancer and generate 10 hypothesis tests and 10 corresponding P-values.

Let R be a random variable representing a population’s Average Annual Rate. The following hypotheses will be tested:

H0: Ri = Rj , with i and i representing 2 different races
H1: Ri ≠ Rj

In order to calculate each hypothesis statistic, I calculate the mean and standard error for each group, then perform a paired difference t-test then obtain the corresponding P-values.

Naively speaking, these P-values can individually be compared against the significance level 0.05, ignoring the multiple testing to obtain discoveries.

def obtain_pvals(m, var, n):
 p_vals = np.array([])
 for i in np.arange(5):
 for j in np.arange(i+1, 5):
 t = (m[i] — m[j]) / np.sqrt(var[i]/n[i] + var[j]/n[j])
 df = n[i] + n[j] — 2
 p = 2*(1 — stats.t.cdf(abs(t),df=df))
 p_vals = np.append(p_vals,p)
 p_vals = p_vals[~np. isnan(p_vals)]
 return p_valsp_values = obtain_pvals(race.AvgAnnualRates, race.VarAnnualRates, race.n)

However, as discussed above, this approach is not optimal due to the nature of probabilities. This calls for some form of correction so that the probability of False Positive remains below my desired significance level. In this case, 2 correction methods are chosen: Bonferroni and Benjamini-Hochberg Procedure.

With Bonferroni adjustment, I simply reject any hypothesis with
P-value ≤ 0.05/10 = 0. 005.

def bonferroni(p_values, alpha = 0.05):
    n = len(p_values)
    decisions =  p_values <= alpha/n
    return decisionsbon = bonferroni(p_values)

On the other hand, to control FDR at level δ = 0.05, I will use the Benjamini-Hochberg Procedure as described below:

Order the unadjusted p-values: p1 ≤ p2 ≤ … ≤ p10
Then find the test with the highest rank j, for which:

𝑝 ≤ ( 𝑗 /10)× 0. 05

3. Declare the tests of rank 1, 2, …, j as significant

def benjamini_hochberg(p_values, alpha = 0.05):
    n = len(p_values)
    sorted_p = np.sort(p_values)
    
    max_k = max([k for k in range(n) if sorted_p[k]<=(k + 1)*(alpha/n)])
    threshold = sorted_p[max_k]decisions = p_values <= threshold
    return decisionsbh = benjamini_hochberg(p_values)

b. Results

Before correction, using a significance level of 0.05, most hypotheses are deemed statistically significant except when the differences in mean rates of Asian and Hispanic populations. This means that, with naive thresholding, I do not have sufficient evidence to say that the average annual lung cancer rates of Asian patients are different from that of Hispanic patients.

Bonferroni correction ensures that the Type I error rate (False Positive — FP) of α is maintained throughout our 10 independent hypothesis tests. FP in this case is defined as incorrectly identifying the difference between 2 mean rates of 2 given populations as significant while in reality, the 2 measures are similar. However, using Bonferroni correction risks having a high probability of failing to recognize the significance in the difference of 2 means. In this context, a certain number of FP is acceptable, and it is more desirable to control False Discovery Rate (FDR), thus the second correction method of my choice is Benjamini-Hochberg Procedure.

Benjamini-Hochberg Procedure controls FDR, the number of FP out of all the rejections. Contextually, similar to above, I am trying to control the number of times I mistakenly reject the null hypothesis that 2 given means are similar among all of our rejections across the board.

c. Discussion

Most hypotheses are deemed significant after the correction procedures with 2 exceptions. In specific, the hypothesis testing the difference of the means of Asian or Pacific Islander and Hispanic groups is insignificant after being corrected with both Benjamini-Hochberg Procedure and Bonferroni. Meanwhile, that of Black, non-Hispanic against White, non-Hispanic’s average annual rate is insignificant after Bonferroni adjustment, but survived the Benjamini-Hochberg procedure.

Individual tests point out that there are significant differences between 2 given races/ethnicities, with the only exception discussed above. Overall, I can say with confidence that the average annual rate of lung cancer incidents are different for patients with different races. However, this does not indicate any causal relationship, or in other words, these discrepancies in lung cancer rates are not caused by genetics, as there are other more plausible confounding factors that need to be considered such as one’s socioeconomic background or accessibility to healthcare.

This dataset is well collected and presented, but not without room for improvement. If there was overlapping data between cancer patients and smokers, I would be able to test the hypothesis that smokers of certain races have a higher risk of getting cancer. On top of that, given socioeconomic data, I could also perform multiple testing with causal inference to take a closer look into the confounding factors that are responsible for the relationship between race and lung cancer, and concretize the conclusions above with better evidence.

6. Conclusion

When looking at the relationship between lung cancer incidents and race, the key finding is that the rates between each race is mostly different. This is important since there have been remarkable improvements in cancer treatment, specifically invasive lung cancer, that increase the chance of success as well as minimize the mortality rates of patients. However, I also need to ensure that all patients from different backgrounds can have equal access to medical care in order to benefit from the treatments. Healthcare providers need to take this into consideration, not to single out patients, but to make screenings and treating cancer more accessible to everyone.

Right now, the analysis focuses on the data on state level. However, this does not reflect other social factors such as regional economy due to the lack of geographic details. Upon the work of this research question, a sensible direction I can take would be to test a similar hypothesis at county level, focusing on patients from low-income areas and compare that to the ones from higher income areas.