Scraping and classifying articles to analyze Corporate Rebranding

I. Introduction

  • What kind of companies are engaged in name rebranding?
  • What factors motivated these activities?

II. Data Collecting

1. Dataset exploration

  • Search using keywords selected
  • Download full document in doc format in a zip file
  • Download the result list for ‘News’
  • Open result list in excel and set up VBA as below
  1. Full text: can only down 100 articles per batch
  2. Result list: can only down 250 articles per batch
  3. Can only download the first 1000 results for any search return
  4. “Please limit your selection to 100 of just the top 1000 results.”

2. Exploratory Data Analysis

III. Data modeling

1. Classification

  • Fuzzy matching
  • Cleaning
  • Feature Selection

2. Next steps

  1. Imbalanced dataset — the dataset used is small and has a larger proportion of 1’s than 0’s. As such, a larger dataset with a more equal proportions of 1’s and 0’s would help make the classifier more robust. A large imbalanced dataset could also be used by changing class weights in the models. GAN’s (Generative Adversarial Networks) could also potentially be implemented to create larger datasets.
  2. Using PCA — Instead of using plain feature selection, different tools could be used for feature extraction. A deeper study of the textual data could help generate new features. PCA could also be used to condense the large DTMs into smaller dimensions that could be fed into ML models.
  3. Better cleaning — The custom tokenizer could also be improved and coupled with other methods for cleaning the textual data in ways that remove things like newline characters, tabs, etc.
  4. Using different models — Just ensemble methods like XGBoost and Random Forest were used for this method. Simple multi-layer perceptrons (MLP), support vector machines (SVM), and perhaps even logistic regression models could be examined, although it is likely that the ensemble methods and MLPs perform better for such classification problems.



