Algorithm to implement a word cloud like Wordle [layout]


Here's a really nice javascript one from Jason Davies that uses d3. You can even use webfonts with it.





My Questions

  • Is there an algorithm available that does what Wordle does?
  • If no, what are some alternatives that produces similar kinds of output?

Why I'm asking

  • just curious
  • want to learn

Tag Cloud in C#

Building a tag cloud is, as I see it, a two part process:

First, you need to split and count your tokens. Depending on how the document is structured, as well as the language it is written in, this could be as easy as counting the space-separated words. However, this is a very naive approach, as words like the, of, a, etc... will have the biggest word-count and are not very useful as tags. I would suggest implementing some sort of word black list, in order to exclude the most common and meaningless tags.

Once you have the result in a (tag, count) way, you could use something similar to the following code:

(Searches is a list of SearchRecordEntity, SearchRecordEntity holds the tag and its count, SearchTagElement is a subclass of SearchRecordEntity that has the TagCategory attribute,and ProcessedTags is a List of SearchTagElements which holds the result)

double max = Searches.Max(x => (double)x.Count);
List<SearchTagElement> processedTags = new List<SearchTagElement>();

foreach (SearchRecordEntity sd in Searches)
    var element = new SearchTagElement();                    

    double count = (double)sd.Count;
    double percent = (count / max) * 100;                    

    if (percent < 20)
        element.TagCategory = "smallestTag";
    else if (percent < 40)
        element.TagCategory = "smallTag";
    else if (percent < 60)
        element.TagCategory = "mediumTag";
    else if (percent < 80)
        element.TagCategory = "largeTag";
        element.TagCategory = "largestTag";


It's not great, but there is an open-source project (alas, in PHP) that does word clouds over time. The example uses presidential speeches.

label nodes outside with minimum overlap with other nodes/edges in networkx

I have previously attempted something similar with the main idea being to keep out of the way of the edges mostly.

Assuming that the edges are straight lines, there are two simple and similar ways to achieve this:

  1. On the basis of the angles that the edges of a node's neighbourhood are making with respect to the node itself.

  2. On the basis of the centroid of the neighbourhood's nodes.

So, find the angles that the edges departing from a node towards its neighbourhood form and try to position the label AWAY from the majority of the edges; OR estimate the centroid of a node's neighbourhood and position the label along the opposite direction.

The first solution can be a little bit problematic, primarily because of the way that the atan2 function operates (which essentially determines the edge angles) but it does provide some flexibility in terms of positioning the label.

The second solution is the simplest and works as follows:

import networkx as nx
import matplotlib.pyplot as plt

#Build the graph
#Please note, the code here is as per the original post
G = nx.complete_graph(5)
mapping = {0:'aaaaaaa',1:'bbbbbbb',2:'ccccccc', 3:'dddddddd', 4:'eeeeeeeee'}
G = nx.relabel_nodes(G,mapping)

plt.figure(figsize=(10,10), facecolor="w", frameon=False)
#Get a graph layout
pos = nx.graphviz_layout(G, prog="fdp") #calculate position (x,y) coordinates
#Here is an alternative layout, please see below.
#pos = nx.layout.spring_layout(G)
nx.draw_networkx_edges(G,pos, width=2,edge_color='r')
#Show the original position of the labels using a Green colour.

#Please note, the code below uses the original idea of re-calculating a dictionary of adjusted label positions per node.
label_ratio = 1.0/8.0
pos_labels = {} 
#For each node in the Graph
for aNode in G.nodes():
    #Get the node's position from the layout
    x,y = pos[aNode]
    #Get the node's neighbourhood
    N = G[aNode]
    #Find the centroid of the neighbourhood. The centroid is the average of the Neighbourhood's node's x and y coordinates respectively.
    #Please note: This could be optimised further
    cx = sum(map(lambda x:pos[x][0], N)) / len(pos)
    cy = sum(map(lambda x:pos[x][1], N)) / len(pos)
    #Get the centroid's 'direction' or 'slope'. That is, the direction TOWARDS the centroid FROM aNode.
    slopeY = (y-cy)
    slopeX = (x-cx)
    #Position the label at some distance along this line. Here, the label is positioned at about 1/8th of the distance.
    pos_labels[aNode] = (x+slopeX*label_ratio, y+slopeY*label_ratio)

#Finally, redraw the labels at their new position.
#Show the figure

This works, mostly, for nodes that are largely in the periphery of the Graph but is challenging for nodes that are positioned towards the centre of the graph because the centroid will not provide a reliable direction that avoids the majority of the edges.

Here is the output for graphviz's fdp layout...

...and here is the output for networkx' spring layout.

Please note the proximity of the green and black coloured labels on the second figure. Essentially, the centroid of ddddddd's neighbourhood is relatively close to the node's actual position.

For a more complex solution, you might want to check more complex algorithms such as the one that is used by Wordle in order to adapt the initial position of the label if it intersects an edge.

Hope this helps.