ParaMiner is a generic, parallel algorithm for mining closed frequent patterns in transactional datasets.

The generic base of ParaMiner can be instanciated to solve different mining task. This version of the code comes with 3 instances of ParaMiner:

See Section 2 for more details about ParaMiner existing instances.

1 Quick start

1.2 Extract the archive

tar xzvf paraminer-1.0.tgz

1.3 Build ParaMiner

cd paraminer
./configure 
make
make install 

Default install directory is /usr/local. You can specify a different install directory with –prefix:

./configure --prefix=<path/to/install_dir>

If you do not wish to install, ParaMiner executable files can be found in the <paraminer>/src/ directory.

1.4 Run ParaMiner

You can now run a built-in instance to solve a standard pattern mining problem (see Section 2) or you can start writing your own instance of ParaMiner (see Section 3).

For example: the fim instance can be used to mine closed frequent itemset in a transactional dataset.

paraminer_itemsets data/fim/mushroom.dat 1500 -t 4

Will output all the closed frequent itemsets occurring in mushroom dataset with an absolute frequency of at least 1500. The -t switch specifies the number of threads. See paraminer_itemsets -h for further details.

2 Built-in instances of ParaMiner

In order to solve a pattern mining problem, you need the adequate instance of ParaMiner.

This version of ParaMiner comes with 3 instances:

  • ParaMinerFIM to mine frequent closed itemsets,
  • ParaMinerCRG to mine closed relational graphs, and
  • ParaMinerGRI to mine graduals patterns.

2.1 ParaMinerFIM for closed frequent itemset mining

Extract closed frequent itemset from a transactional dataset.

2.1.1 Input file format

The input file format is the standard ASCII format of the FIMI dataset repository:

  • The whole dataset is described by a single ASCII file;
  • each line is a transaction;
  • each transaction contains distinct items;
  • transactions must be ordered;
  • last line must be empty.

e.g. test.dat

1 2 4 6
1 2 3 5 7
2 3

is a valid dataset.

2.1.2 Running the ParaMinerFIM

You can mine closed frequent itemsets occurring in this dataset by executing the following command:

./paraminer_itemsets test.dat 2

Alternatively, if you have a multi-core/parallel computer with 8 cores, you can exploit them by executing the following command:

paraminer_itemsets test.dat 2 -t 8

2.1.3 Output format

  • each line is a frequent closed itemset;
  • frequency is stored at the end of the line into brackets.

e.g.

./paraminer_itemsets test.dat 2 

will generate the following output on the standard output:

2 (3)
1 2 (2)
3 2 (2)

The results can be stored by redirecting the standard output into a file:

./paraminer_itemsets test.dat 2 -t 1 > results.out

2.2 ParaMinerCRG for closed frequent connected relational graphs mining

Extract connected relational graphs from relational graphs datasets. Relational graphs (graphs with distinct labels)

2.2.1 Input format

A graph dataset is a directory containing a collection of ASCII files. Each ASCII file is the description of one graph from the dataset.

The files must have the following format:

  • the first line is the number of distinct vertexes labels in the graph dataset;
  • each following line is a triplet <vertex id> <vertex id> <edge value> describing one edge of the graph where:

<vertex id> are integer identifiers for the two vertexes of the edge. <edge value> is any real number. Edges with a value bellow the edge threshold (mandatory argument of ParaMinerCRG) are disregarded. This is typically used to simplify the overly complex graphs before the mining process. If unnecessary, use <edge value> = 1 for every edge.

For example:

10 
1 2 1 
2 3 1 
3 4 1

Describes the following relational graph:

(2) -- (1)
 |
 |
(3) -- (4)

An graph dataset example can be found in <paraminer_directory>/data/crg/test.

2.2.2 Running ParaMinerCRG

You can mine closed connected relational graphs occurring in the example graph dataset by executing the following command:

./paraminer_cgraphs  <paraminer_directory>/data/crg/test 1 1

2.2.3 Output format

  • Each line is a list of edges that representing a connected subgraph that is frequent in the dataset.
  • The line ends with the frequency of the graph.

For example, mining the example dataset will generate the following outpout.

( 1, 2 ) (2)
( 3, 4 ) (2)
( 1, 2 ) ( 1, 4 ) ( 2, 2 ) ( 3, 4 ) (1)
( 1, 2 ) ( 3, 4 ) ( 2, 3 ) (1)
4 patterns mined

2.3 ParaMinerGRI for gradual pattern mining

See [ 7 ] for more information about gradual patterns.

3 Creating a new instance of ParaMiner

This section describe how to create your own instance of ParaMiner. You need to create a new instance if you want to mine a type of patterns that is not supported by any ParaMiner built-in instance.

For example let's say we want to mine periodic patterns, which is not supported by default in ParaMiner.

First start by creating a paraminer_local_periodic.cpp file which will contain an implementation of the following C++ functions:

3.1 A selection criterion

In a function called membership_oracle(). The selection criterion to distingish candidate patterns from patterns.

It takes as an argument a closed pattern P and a possible augmentation element e. It must return a non-null value if and only if the candidate pattern P U {e} is a pattern.

For example for our closed dark pattern mining problem, it can be as simple as:

bool membership_oracle(P, e){
  return is_a_periodic_pattern(P U {e}); 
}

3.2 A closure operator

In a function called clo()

The closure operator can be used to limit the redundancy in the resulting set of Patterns. Takes a pattern as an argument, and returns a closed pattern. The identity function is a valid closure operator.

This function as to be a valid closure operator

clo(P){
  return P;
}

It is worth noting that ParaMiner's efficiency relies on closed pattern. Therefore defining a closure operator according to the problem definition is usually a good idea. Many example of closure operators have been proposed in [ 2 ]. If your problem satisfies some properties a default closure operator (better than the identity) can be used. A section is dedicated to this in [ 1 ].

3.3 A main function

The main function is here to achieves three goals:

  1. Parse the command line arguments
  2. Load and pre-process the dataset
  3. Invoque the clogen() routine to start the exploration.

3.3.1 Parsing the command line arguments

You must start your main function by calling the parse_clogen_arguments(argc, argv) function. It will capture the arguments used by ParaMiner remove them from argv and decrease argc.

3.3.2 Loading the dataset

The dataset must be loaded into a table called tt which is of type TransactionTable.

If your dataset is stored as described in 2.1.1, you can use the built-in function read_transaction_table() It takes two argument, the filename and the transaction table.

So far our clogen_local_dark.cpp file looks like this:

int main(int argc, char **argv){

load_transaction_table (&tt, argv[1])

...

}

3.3.3 Invoking the search space exploration

Once your dataset is loaded into tt, you must call the clogen() main routine with empty_set as an argument if you want to start the exploration from the emptyset.

4 Bugs and bug reports

Repport bugs and/or comments at: FirstName.LastName@cs.kuleuven.be

My FirstName is Benjamin My LastName is Negrevergne

5 Publications

5.1 Main publication:

(If you use ParaMiner for your your research, please cite this publication.)

[ 1 ] ParaMiner: A generic pattern mining algorithm for multi-core architectures [to appear] Benjamin Negrevergne · Alexandre Termier · Marie-Christine Rousset and Jean-François Méhaut DAMI/DMKD

5.2 Other important reads

[ 2 ] Arimura, H., & Uno, T. (2005). A polynomial space and polynomial delay algorithm for enumeration of maximal motifs in a sequence. Algorithms and Computation, 724-737.

[ 3 ] Boley, M., Horváth, T., Poigné, A., & Wrobel, S. (2010). Listing closed sets of strongly accessible set systems with applications to data mining. Theoretical computer science, 411(3), 691-700.

[ 4 ] Benjamin Negrevergne. A Generic and Parallel Pattern Mining Algorithm for Multi-Core Architectures. PhD thesis, Grenoble University, 2011.

[ 5 ] Uno, T., Kiyomi, M., & Arimura, H. (2004, November). LCM ver. 2: Efficient mining algorithms for frequent/closed/maximal itemsets. In Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI 04).

[ 6 ] Negrevergne, B., Termier, A., Méhaut, J., & Uno, T. (2010, June). Discovering closed frequent itemsets on multicore: Parallelizing computations and optimizing memory accesses. In High Performance Computing and Simulation (HPCS), 2010 International Conference on (pp. 521-528). IEEE.

5.3 Gradual itemset mining

[ 7 ] Anne Laurent, Benjamin Négrevergne, Nicolas Sicard, and Alexandre Termier. Pgp-mc: Towards a multicore parallel approach for mining gradual patterns. In DASFAA, pages 78-84, 2010.

6 Authors and license

Authors:

  • Benjamin Negrevergne
  • Alexandre Termier

It was developped at Grenoble University / LIG.

License: ParaMiner is distributed under the LGPLv3 See LICENSE file in source directory for more informations.

Date: 2013-04-21 Sun

Validate