Data Analysis of Contraceptive use in Indonesia in 1987

Callie Nguyen
11 min readAug 5, 2021

Authors: Khoa Nguyen, Cynthia Lai

Link to Jupyter notebook code: Github

I. Abstract

The Contraceptive Method Choice Dataset compiles data collected from the 1987 National Indonesia Contraceptive Prevalence Survey. This report investigates data modeling methods to predict a woman’s contraceptive usage based on socio-economic attributes such as education level, working status and standard of living.

II. Question of interest

Given these attributes, our data exploration is focused on the relationship between socio-economic conditions and contraceptive methods. Specifically, the study was interested in the impact of a woman’s education on her choice of family planning devices, or, in other terms, ​how accurately we could predict a woman’s choice of contraceptive between no-use, long-term and short-term, based on her education level​. Noted, our study found that many other attributes such as the husband’s education are also key determinants of contraceptive method.

III. Data Cleaning​ / ​Feature Engineering

At first glance, the dataset is fairly straightforward without any missing values.

When we plotted out the distribution of number of children for women who work vs women who do not work, an interesting data point was the instance of 16 children. Even when compared against a broader group of instances with other similar attributes, 16 children was still considered an outlier as shown on the boxplots, so this instance was dropped from the dataset used in the models.

In our EDA, it was also clear from the graphs that the use of numbers as nominal variables were misleading, suggesting an inaccurate ordinal relationship between different levels of an attribute. Some attributes are not labeled conventionally (ie: 0 = Yes and 1 = No), which were changed back for the ease of interpreting. Although some attributes such as wife’s education or standard of living are labelled in order (i.e. 1–4 representing lowest to highest), the value of the numbers themselves in relation to each other are not representative of the relationship between different levels. In order to ensure that this nominal relationship is not mis-interpreted in our models, the categorical variables were hot-encoded. This binary format will allow the model to equally weigh each of the features.

On the same note, since the numerical variables are on different scales, they were normalized in order for the models to weigh them equally. However, this was done after EDA for simple interpretation, for instance, a family with 3 children is more understandable than that with -0.11 standard units.

Similarly, we found during EDA that it was difficult to ascertain what numerical labels meant for qualitative categorical features such as husband occupation. With further research into the context of the study, we replaced the labels of husband occupation with the appropriate categories from the survey (agriculture, manual, sales and professional). Ideally, a more quantitative label, such as average income for these industries, would have supported the model. Nonetheless, changing these labels assisted with interpretation especially while conducting EDA.

IV. EDA

Firstly, we wanted to determine which attributes were most correlated with the ​wife’s education and ​contraceptive choice​. We created a ​heat map​, using both the depth of color and a numerical value of correlation to determine that husband occupation, husband education and media exposure were most highly correlated with wife’s education, suggesting some correlative relationship.

To gain a sense of our data, we initially calculated the numbers of instances using each type of birth control. As expected, the most popular form was ​non-usage with 629 instances, followed by 511 women using ​short-term​ options while 333 women used ​long-term​ contraceptives.

Our side-by-side bar charts visualize this trend in each of the ​standard of living levels​. Since this would partly be reflected by the larger overall proportion of women in this standard of living level, we also include the proportional distribution of ​contraceptive choice based on standard of living​. In following our assumption that cost might be a factor, we used standard of living as an indicator of socio-economic status. Since long-term contraceptive options such as IUDs tend to be more expensive, our hypothesis was that women with higher socio-economic statuses would prefer long-term contraceptives. Our pivot table analysis suggested that indeed, more than 60% of these women using long-term methods had a standard of living at the highest level.

Initially, the assumption was that the ​contraceptive method would have an impact on the ​number of children​. Women who either were not using contraceptives were more likely to want more and have more children. However, surprisingly, we found that women only short term and long-term contraceptives on average have more children.

It was recognizable that the ​age of a woman may impact contraceptives used and wanted to quantify this confounding factor, which was shown in the boxplot below:

Now that there was a clear relationship between socioeconomic status and contraceptive method, we wanted to further investigate into whether there was a relationship between the ​wife’s education ​and socioeconomic status​ that would enhance our prediction of contraceptive.

Firstly, we wanted to quantify the effect of the ​wife’s working status​. Two working adults in a household are more likely to have a higher level of income, thus affecting contraceptive usage. Our EDA found that a higher proportion of women using long-term contraceptives tend to be working, and similarly, highly educated women are more likely to be working. This could suggest that highly educated women are more likely to work, and thus would opt for long-term contraceptives that would not distract them from their career.

After dividing husband occupation into agriculture, manual, sales and professional, we compared the ratio of ​working to non-working women with husbands in each of these occupations as well as a wife’s education level distribution in these occupations. It is assumed that men in agricultural occupations are less likely to have a high salary and household standard of living whereas men in professional occupations are more likely to have a high salary and high standard of living. Our graphs indicated that while there wasn’t a clear distinction of working/non-working ratios in different occupations (the higher agriculture ratio may be skewed due to its small sample size), wives with higher education levels make up a larger proportion of husband’s occupations with higher assumed salaries (professional and sales). Conversely, wives with lower education levels tend to have husbands working in agriculture or manual occupations.

We plotted and found similar results comparing husband and wife education levels so we decided to aggregate their ​education levels with a mean to compare against standard of living in a scatter plot. As expected, households with higher aggregate education are more likely to result in a high standard of living, which would affect contraceptive methods of choice.

Finally, to reduce the dimensionality of our data and gauge which features may be strong indicators of contraceptive use, we conducted ​Principal Component Analysis on our original non-hot encoded data set for clarity purposes. It appeared that the first three principal components generated a majority of the variance. After plotting out the first three sets of feature weightings, we discovered that the features which will most likely provide insights into contraceptive methods in our models are ​wife age, wife education, husband education and husband occupation.

V . Modelling

Since our response variable is categorical, we find it appropriate to use ​Logistic Regression​, Decision Tree​ and ​Random Forest​ as classifiers for our models.

Using Logistic Regression to predict contraception using only numerical features, our base model yields a mediocre CV 10-Fold accuracy score of 47%, after which ​we realized that there were 3 classes of birth control methods (no-use / short-term / long-term) to classify; this is a ​multiclass classification problem. We are better off using the ​one-versus-rest approach with ​Multiclass Logistic Regression where the multiclass prediction problem is divided into separate binary prediction problems. One thing to note is that, while the training accuracy and the CV accuracy varies for different iterations of our logistic regression models, they all appear to be close to each other. So conclusion can be made that the logistic regression models perform well on our data. On top of that, we also have a list of highly correlated or heavily weighted features to combine with MLR in an attempt to better our base model. As a result, we found that the model using ​MLR on all features yields the best result among all LR models. The ​model improvement score​ is ​3%​.

With ​Decision Tree​, the base model to predict contraception based on all given features introduces 2 alarming problems: One is, as expected, fairly low 46% accuracy score, and two is a high training accuracy of 97% — which is an indicator of overfitting. From what we know about decision trees, it is likely that the overfitting problem comes from the fact that the tree grows too deep with too many branches in an attempt to purify the attributes. The simplest solution to this is to stop the tree from growing by setting the tree’s maximum depth. Since we are using the GridSearchCV library to pick the best max_depth, we also tested on other features such as criterion and min_samples_leaf. As expected, changing the ​depth of the tree ​not only helped us improve the CV accuracy score of the models, but also took care of the overfitting problem that we have in the base model. 2 ​hyperparameter tuned models both yield better results, with the first one having only slightly better score of 55.97% since we also tuned the min_samples_leaf parameter. The ​model improvement score​ is ​9.9%.

Similar to the challenge we faced with Decision Tree, ​Random Forest Classifier also yields a low CV 10-Fold prediction score of 50% with a high training score of 94%. This is expected because Random Forest is built from multiple Decision Trees. Our approach stayed the same where we used GridSearchCV to optimize the hyperparameters and train our models. The 2 parameters of interest are depth of trees — max_depth and the ​number of trees in the forest — n_estimators to tackle overfitting without sacrificing the speed of the model. As such, fine-tuning these 2 parameters resulted in better Random Forest models and higher CV scores, and while other parameters also helped improving scores, the change was not noticeable. However, the training accuracy is still a little high, indicating that we might have slightly overfit the models. The ​model improvement score ​is ​5%​.

From this, we have acquired 10 different models to test and compare with metrics of success being accuracy %. After creating a table compiling all their training accuracy scores and CV 10-fold scores, ​Hyperparameter Tuned Tree 1 ​seemed to generalize the best among all models. ​We ​decided to move forward with it as our final model because it has proven to give a good CV score without the cost of overfitting.

This model resulted in the test accuracy score of ​56.25%​. It means among 1473 women, based on the information we have, we expect to correctly classify 825 contraceptive methods that they choose. This result is not quite as high as we want our predictions to be, but by far is still the highest score we have obtained under sensible assumptions and methods.

VI. Discussion

i. Two most interesting observations we came across:

A noticeable relationship was the strong relationship between husband and wife’s education, specifically if they are both highly educated. According to the heatmap and EDA, their correlation is significantly higher than any other relationships relating to education levels. At the same time, women with higher education also have the highest correlation with the type of contraceptive they choose. This might lead to a more meaningful connection between the household’s education levels and the choice of family planning devices.

ii. A feature we thought would be useful, but turned out to be ineffective:

We initially believed that the number of children in each family would be a useful feature with significant weighting in determining contraceptive use. Our hypothesis was that women who want children are more likely to have a higher number of children and would not be using contraceptives. However, our EDA found the converse result.

This inconsistency with our assumptions is likely due to the survey being a snapshot in time. It was a better reflection of women who either, had not had their desired amount of children, thus not using contraception and have a lower number of children, or women who, already had the desired amount of children and were thus on long-term/short-term contraceptives to avoid having more.

iii. Limitations & uncertainties of the analysis

The biggest limitation we faced was that the data is heavily focused on the socio-economic aspects of families and lacking information about other factors such as geographic locations or price points of contraception, which might be considered as confounding variables. Thus, we were unable to come up with a better generalized model to predict, which in turn is reflected by the mediocre accuracy of the final model. Specifically, with the Logistic Regression, we train one of our models by using carefully chosen education levels and standard of living features under the assumption that they were good predictors of the contraceptive methods, which in turn proved to not yield any improvements. It proved that all features as a whole are better indicators than specific attributes despite them having high correlation and are heavily weighted.

iv. Ethical dilemmas with this data:

When we were looking into the relationship between a household’s socio-economy and wife’s education level, taking into account the context of the dataset, we assumed that a highly educated wife tended to marry a highly educated husband with a well-paying job and vice versa. This is a sticky assumption because it reflected the gender-role stereotype between men and women, especially in traditional Asian countries like Indonesia where women are expected to rely on their husband for financial support.

In a more holistic picture, the socioeconomic attributes might be too shallow to be used as main indicators of the choice of contraception. There are many factors that might affect a woman’s choice of usage, including but not limited to her accessibility to healthcare, the affordability of the methods in certain countries (Indonesia in this case), or simply her personal preferences or beliefs. These are the unknown variables that we, unfortunately, are not well informed, and hence cannot make a good judgement about. Using a household’s financial health alone might lead us to draw biased causal conclusions against low-income people, which as a result deepens the healthcare-derived stereotypes upon poor families. We might address this concern by understanding and acknowledging our limitations with the dataset, and potentially acquire more data to specifically tackle said unspecified attributes to give an objective understanding of the matter at hand.

--

--