[OC] Entity Treemap from 50,000+ News Articles

Data source:
Collected from ~20 major global news outlets for 2025 (e.g. BBC, Reuters, NPR, The Guardian, Al Jazeera, France24). Articles were scraped by kosmopulse.com.

Methodology:

Extracted named entities (people, places, organizations) using spaCy NLP.
Constructed a co-occurrence matrix to detect which entities appear together across articles.
Applied hierarchical clustering (Ward linkage) to group related entities.
Labeled internal tree nodes with the most frequent entity in each cluster.
Final structure exported as a tree and visualized using Plotly Express (Treemap ).

Tools:
Python, pandas, spaCy, scikit-learn, scipy, plotly, Jupyter

What it shows:
Each box represents an entity (like “Donald Trump” or “Ukraine”). Size reflects how often it appeared across the dataset as an entity along side other entities. Boxes are nested based on clustering — showing which names and topics tend to appear together and as subtopics of each other in global media coverage.

for the original HIGH-resolution PDF (width=3000, height=2000) check out https://www.kosmopulse.com/post/we-ve-added-5-new-news-sources-and-a-curious-visualization-to-match

“I also created a 60s video version of this exploration if you're curious — https://youtu.be/3H5bcNKXihM

Posted by Serious-Parking-2625

Serious-Parking-2625 on May 24, 2025 3:04 am

**Data source**: News articles scraped from ~20 global news outlets (2025), including BBC, Reuters, NPR, The Guardian, Al Jazeera, and others. Extracted by [kosmopulse.com](http://kosmopulse.com) .

**Method**:

– Named Entity Recognition (spaCy) to extract people, places, organizations from article text

– Co-occurrence matrix of entity pairs

– Hierarchical clustering (Ward linkage)

– Final visualization via Plotly Express (Treemap/Sunburst)

**Tools**:

– Python (pandas, spaCy, sklearn, scipy, plotly)

– Jupyter + Colab for preprocessing and clustering

**Visualization**:

Each box represents an entity (like “Donald Trump” or “Ukraine”). Size reflects how often it appeared across the dataset as an entity along side other entities. Boxes are nested based on clustering — showing which names and topics tend to appear together and as subtopics of each other in global media coverage.

for the original HIGH-resolution PDF (width=3000, height=2000) check out [https://www.kosmopulse.com/post/we-ve-added-5-new-news-sources-and-a-curious-visualization-to-match](https://www.kosmopulse.com/post/we-ve-added-5-new-news-sources-and-a-curious-visualization-to-match)

“I also created a 60s video version of this exploration if you’re curious — [https://youtu.be/3H5bcNKXihM](https://youtu.be/3H5bcNKXihM)

View 2 Comments

2 Comments

Serious-Parking-2625 on May 24, 2025 3:04 am

**Data source**: News articles scraped from ~20 global news outlets (2025), including BBC, Reuters, NPR, The Guardian, Al Jazeera, and others. Extracted by [kosmopulse.com](http://kosmopulse.com) .

**Method**:

– Named Entity Recognition (spaCy) to extract people, places, organizations from article text

– Co-occurrence matrix of entity pairs

– Hierarchical clustering (Ward linkage)

– Final visualization via Plotly Express (Treemap/Sunburst)

**Tools**:

– Python (pandas, spaCy, sklearn, scipy, plotly)

– Jupyter + Colab for preprocessing and clustering

**Visualization**:

Each box represents an entity (like “Donald Trump” or “Ukraine”). Size reflects how often it appeared across the dataset as an entity along side other entities. Boxes are nested based on clustering — showing which names and topics tend to appear together and as subtopics of each other in global media coverage.

for the original HIGH-resolution PDF (width=3000, height=2000) check out [https://www.kosmopulse.com/post/we-ve-added-5-new-news-sources-and-a-curious-visualization-to-match](https://www.kosmopulse.com/post/we-ve-added-5-new-news-sources-and-a-curious-visualization-to-match)

“I also created a 60s video version of this exploration if you’re curious — [https://youtu.be/3H5bcNKXihM](https://youtu.be/3H5bcNKXihM)
Mr-Fister-the-3rd on May 24, 2025 3:09 am

*it was not still high res

Tags

[OC] Treemap of 50,000+ news articles clustered by named entities — shows how global topics interconnect. (Hope Its still High-res 😅)

[OC] Entity Treemap from 50,000+ News Articles

2 Comments