This package implements a modularity maximizer for graph clustering based on
the recommendations of Noack, A., Rotta, R.: Multi-level algorithms for
modularity clustering. Technical Report 0812.4073v2, arXiv (2008).
The clustering algorithm is launched as follows:
java -jar clustering.jar options files
The simplest use consists in providing as a command line option the name of a file that contains the edges of the graph to cluster. This is done as follows:
java -jar clustering.jar GraphFile
The file must contain one edge per line given by two vertex identifiers (any string without a blank) and one optional weight, all separated by blanks. For instance:
toto titi 2.4
Missing weights are replaced by 1. Graphs are considered as undirected. A warning is issued if the original graph is directed and if it has incompatible arcs, that is at least two arcs of the form A -> B and B -> A with different weights.
If the clustering program is called as specified, the final partition will be printed on the standard output. Each line of the output consists in a vertex identifier followed by a class number. Classes are numbered starting from zero. Please note that the printing order is deterministic but not done in any simple order with respect to vertex identifiers.
Alternatively, the partition can be saved into a file if the program is called with two parameters, that is like this:
java -jar clustering.jar GraphFile PartitionFile
The program offers several options specified via command line parameters:
- "-reduction val" specifies the relative reduction in vertex numbers that triggers taking a snapshot of the coarsened graph. The default is 0.25. A value of 1 (or above) turns off the multilevel refinement.
- "-priorizer val" specifies the criterion used by the coarsening to rank fusion candidates. The default is the significance criterion as recommended by Noack and Rotta's paper.
- "-verbose N" specifies a verbosity level. High values will produce a lot of output and slow down very significantly the program. Defaults to 0 (no spurious output).
- "-debug N" specifies a debugging level. Higher values turn on consistency tests of the program. This slows down very significantly the processing. Defaults to 0 (no tests).
- "-graph file" provides an alternative way of specifying the graph file.
- "-part file" provides an alternative way of specifying the partition file.
- "-mod file" provides a filename to save the modularity of the obtained partition
The program can do Monte Carlo based analysis of the significance of the obtained partition in terms of modularity value. This is done by generating random graphs similar to the one under study (based on the configuration model) and by computing a maximal modularity clustering of those random graphs. Modularities obtained this way are saved in the modularities file for further analysis. The related options are:
- "-random N" specifies the number of random graphs to build.
- "-rgg val" specifies the random graph generator. The default is the switching technique while the matching technique is also usable.
- "-thread N" specifies the number of threads to use to perform the significance analysis. Defaults to the number of cpus (cores) reported by the system.
Modularities are saved in the file specified by the -mod option. Each line contains a modularity value followed by a tabulation and by either "Original" for the original graph or "Random" for a random graph.
The program can also perform recursive sub-partitioning of the clusters found in the optimal partition. This behavior is triggered by the -recursive option. When activated, the program does a maximal modularity partitioning of each top clusters, considering the corresponding subgraph in total isolation (i.e., discarding outside connections, in particular when calculating the degree of the vertices). This process is conducted recursively until some limit is reached. The exact behavior of the program is controlled via the following options:
- "-termination val" chooses the termination algorithm between 'significance' and 'size'. The second solution ('size') uses only the size of the cluster to decided whether to continue the splitting process. The minimal size is given by the "-minsize" parameter. The first solution ('significance', default choice) uses both the size as a prior test for splitting and the significance of the obtained clustering as a posterior acceptance test: using a Monte Carlo simulation, the modularity of the clustering is compared to maximal modularities obtained on similar random graphs. The clustering is accepted only if it has a higher modularity of the ones obtained on random graphs. Note that in both algorithms, is the maximal modularity clustering consists in only one cluster, the recursion is obviously terminated.
- "-minsize N" specifies the minimal cluster size below which to sub-clustering is attempted (defaults to 4)
- "-recrandom N" specifies the number of random graphs to generate in the 'significance' algorithm (defaults to 50)
When instructed to perform recursive sub-partitioning, the program outputs a partition description that is quite similar to the one used for the top level clustering only. More precisely, additional columns are added to the file, one per level of sub-partitioning. Each new cluster is assigned a globally unique id, starting with 0 in the top level. When a cluster cannot be split according to termination criteria, it retains its id in all the subsequent levels.