[OC] Entity Treemap from 50,000+ News Articles

Data source:
Collected from ~20 major global news outlets for 2025 (e.g. BBC, Reuters, NPR, The Guardian, Al Jazeera, France24). Articles were scraped by kosmopulse.com.

Methodology:

  • Extracted named entities (people, places, organizations) using spaCy NLP.
  • Constructed a co-occurrence matrix to detect which entities appear together across articles.
  • Applied hierarchical clustering (Ward linkage) to group related entities.
  • Labeled internal tree nodes with the most frequent entity in each cluster.
  • Final structure exported as a tree and visualized using Plotly Express (Treemap ).

Tools:
Python, pandas, spaCy, scikit-learn, scipy, plotly, Jupyter

What it shows:
Each box represents an entity (like β€œDonald Trump” or β€œUkraine”). Size reflects how often it appeared across the dataset as an entity along side other entities. Boxes are nested based on clustering β€” showing which names and topics tend to appear together and as subtopics of each other in global media coverage.

for the original HIGH-resolution PDF (width=3000, height=2000) check out https://www.kosmopulse.com/post/we-ve-added-5-new-news-sources-and-a-curious-visualization-to-match

β€œI also created a 60s video version of this exploration if you're curious β€” https://youtu.be/3H5bcNKXihM

Posted by Serious-Parking-2625

Share.

2 Comments

  1. Serious-Parking-2625 on

    **Data source**: News articles scraped from ~20 global news outlets (2025), including BBC, Reuters, NPR, The Guardian, Al Jazeera, and others. Extracted by [kosmopulse.com](http://kosmopulse.com) .

    **Method**:

    – Named Entity Recognition (spaCy) to extract people, places, organizations from article text

    – Co-occurrence matrix of entity pairs

    – Hierarchical clustering (Ward linkage)

    – Final visualization via Plotly Express (Treemap/Sunburst)

    **Tools**:

    – Python (pandas, spaCy, sklearn, scipy, plotly)

    – Jupyter + Colab for preprocessing and clustering

    **Visualization**:

    Each box represents an entity (like β€œDonald Trump” or β€œUkraine”). Size reflects how often it appeared across the dataset as an entity along side other entities. Boxes are nested based on clustering β€” showing which names and topics tend to appear together and as subtopics of each other in global media coverage.

    for the original HIGH-resolution PDF (width=3000, height=2000) check out [https://www.kosmopulse.com/post/we-ve-added-5-new-news-sources-and-a-curious-visualization-to-match](https://www.kosmopulse.com/post/we-ve-added-5-new-news-sources-and-a-curious-visualization-to-match)

    β€œI also created a 60s video version of this exploration if you’re curious β€” [https://youtu.be/3H5bcNKXihM](https://youtu.be/3H5bcNKXihM)