ParaMiner
Efficient Pattern Mining
ParaMiner is a generic, parallel algorithm for mining closed frequent patterns in transactional datasets.
The generic base of ParaMiner can be instanciated to solve different mining task. This version of the code comes with 3 instances of ParaMiner:
- ParaMinerFIM to mine frequent closed itemsets,
- ParaMinerCRG to mine closed relational graphs, and
- ParaMinerGRI to mine graduals patterns.
See Section 2 for more details about ParaMiner existing instances.
1 Quick start
1.1 Download ParaMiner
1.2 Extract the archive
tar xzvf paraminer-1.0.tgz
1.3 Build ParaMiner
cd paraminer ./configure make make install
Default install directory is /usr/local. You can specify a different install directory with –prefix:
./configure --prefix=<path/to/install_dir>
If you do not wish to install, ParaMiner executable files can be found in the <paraminer>/src/ directory.
1.4 Run ParaMiner
You can now run a built-in instance to solve a standard pattern mining problem (see Section 2) or you can start writing your own instance of ParaMiner (see Section 3).
For example: the fim instance can be used to mine closed frequent itemset in a transactional dataset.
paraminer_itemsets data/fim/mushroom.dat 1500 -t 4
Will output all the closed frequent itemsets occurring in mushroom dataset with an absolute frequency of at least 1500. The -t switch specifies the number of threads. See paraminer_itemsets -h for further details.
2 Built-in instances of ParaMiner
In order to solve a pattern mining problem, you need the adequate instance of ParaMiner.
This version of ParaMiner comes with 3 instances:
- ParaMinerFIM to mine frequent closed itemsets,
- ParaMinerCRG to mine closed relational graphs, and
- ParaMinerGRI to mine graduals patterns.
2.1 ParaMinerFIM for closed frequent itemset mining
2.1.1 Input file format
The input file format is the standard ASCII format of the FIMI dataset repository:
- The whole dataset is described by a single ASCII file;
- each line is a transaction;
- each transaction contains distinct items;
- transactions must be ordered;
- last line must be empty.
e.g. test.dat
1 2 4 6 1 2 3 5 7 2 3
is a valid dataset.
2.1.2 Running the ParaMinerFIM
You can mine closed frequent itemsets occurring in this dataset by executing the following command:
./paraminer_itemsets test.dat 2
Alternatively, if you have a multi-core/parallel computer with 8 cores, you can exploit them by executing the following command:
paraminer_itemsets test.dat 2 -t 8
2.1.3 Output format
- each line is a frequent closed itemset;
- frequency is stored at the end of the line into brackets.
e.g.
./paraminer_itemsets test.dat 2
will generate the following output on the standard output:
2 (3) 1 2 (2) 3 2 (2)
The results can be stored by redirecting the standard output into a file:
./paraminer_itemsets test.dat 2 -t 1 > results.out
2.2 ParaMinerCRG for closed frequent connected relational graphs mining
Extract connected relational graphs from relational graphs datasets. Relational graphs (graphs with distinct labels)
2.2.1 Input format
A graph dataset is a directory containing a collection of ASCII files. Each ASCII file is the description of one graph from the dataset.
The files must have the following format:
- the first line is the number of distinct vertexes labels in the graph dataset;
- each following line is a triplet <vertex id> <vertex id> <edge value> describing one edge of the graph where:
<vertex id> are integer identifiers for the two vertexes of the edge. <edge value> is any real number. Edges with a value bellow the edge threshold (mandatory argument of ParaMinerCRG) are disregarded. This is typically used to simplify the overly complex graphs before the mining process. If unnecessary, use <edge value> = 1 for every edge.
For example:
10 1 2 1 2 3 1 3 4 1
Describes the following relational graph:
(2) -- (1) | | (3) -- (4)
An graph dataset example can be found in <paraminer_directory>/data/crg/test.
2.2.2 Running ParaMinerCRG
You can mine closed connected relational graphs occurring in the example graph dataset by executing the following command:
./paraminer_cgraphs <paraminer_directory>/data/crg/test 1 1
2.2.3 Output format
- Each line is a list of edges that representing a connected subgraph that is frequent in the dataset.
- The line ends with the frequency of the graph.
For example, mining the example dataset will generate the following outpout.
( 1, 2 ) (2) ( 3, 4 ) (2) ( 1, 2 ) ( 1, 4 ) ( 2, 2 ) ( 3, 4 ) (1) ( 1, 2 ) ( 3, 4 ) ( 2, 3 ) (1) 4 patterns mined
3 Creating a new instance of ParaMiner
This section describe how to create your own instance of ParaMiner. You need to create a new instance if you want to mine a type of patterns that is not supported by any ParaMiner built-in instance.
For example let's say we want to mine periodic patterns, which is not supported by default in ParaMiner.
First start by creating a paraminer_local_periodic.cpp file which will contain an implementation of the following C++ functions:
3.1 A selection criterion
In a function called membership_oracle(). The selection criterion to distingish candidate patterns from patterns.
It takes as an argument a closed pattern P and a possible augmentation element e. It must return a non-null value if and only if the candidate pattern P U {e} is a pattern.
For example for our closed dark pattern mining problem, it can be as simple as:
bool membership_oracle(P, e){ return is_a_periodic_pattern(P U {e}); }
3.2 A closure operator
In a function called clo()
The closure operator can be used to limit the redundancy in the resulting set of Patterns. Takes a pattern as an argument, and returns a closed pattern. The identity function is a valid closure operator.
This function as to be a valid closure operator
clo(P){ return P; }
It is worth noting that ParaMiner's efficiency relies on closed pattern. Therefore defining a closure operator according to the problem definition is usually a good idea. Many example of closure operators have been proposed in [ 2 ]. If your problem satisfies some properties a default closure operator (better than the identity) can be used. A section is dedicated to this in [ 1 ].
3.3 A main function
The main function is here to achieves three goals:
- Parse the command line arguments
- Load and pre-process the dataset
- Invoque the clogen() routine to start the exploration.
3.3.1 Parsing the command line arguments
You must start your main function by calling the parse_clogen_arguments(argc, argv) function. It will capture the arguments used by ParaMiner remove them from argv and decrease argc.
3.3.2 Loading the dataset
The dataset must be loaded into a table called tt which is of type TransactionTable.
If your dataset is stored as described in 2.1.1, you can use the built-in function read_transaction_table() It takes two argument, the filename and the transaction table.
So far our clogen_local_dark.cpp file looks like this:
int main(int argc, char **argv){ load_transaction_table (&tt, argv[1]) ... }
3.3.3 Invoking the search space exploration
Once your dataset is loaded into tt, you must call the clogen() main routine with empty_set as an argument if you want to start the exploration from the emptyset.
4 Bugs and bug reports
Repport bugs and/or comments at: FirstName.LastName@cs.kuleuven.be
My FirstName is Benjamin My LastName is Negrevergne
5 Publications
5.1 Main publication:
(If you use ParaMiner for your your research, please cite this publication.)
[ 1 ] ParaMiner: A generic pattern mining algorithm for multi-core architectures [to appear] Benjamin Negrevergne · Alexandre Termier · Marie-Christine Rousset and Jean-François Méhaut DAMI/DMKD
5.2 Other important reads
[ 2 ] Arimura, H., & Uno, T. (2005). A polynomial space and polynomial delay algorithm for enumeration of maximal motifs in a sequence. Algorithms and Computation, 724-737.
[ 3 ] Boley, M., Horváth, T., Poigné, A., & Wrobel, S. (2010). Listing closed sets of strongly accessible set systems with applications to data mining. Theoretical computer science, 411(3), 691-700.
[ 4 ] Benjamin Negrevergne. A Generic and Parallel Pattern Mining Algorithm for Multi-Core Architectures. PhD thesis, Grenoble University, 2011.
[ 5 ] Uno, T., Kiyomi, M., & Arimura, H. (2004, November). LCM ver. 2: Efficient mining algorithms for frequent/closed/maximal itemsets. In Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI 04).
[ 6 ] Negrevergne, B., Termier, A., Méhaut, J., & Uno, T. (2010, June). Discovering closed frequent itemsets on multicore: Parallelizing computations and optimizing memory accesses. In High Performance Computing and Simulation (HPCS), 2010 International Conference on (pp. 521-528). IEEE.
5.3 Gradual itemset mining
[ 7 ] Anne Laurent, Benjamin Négrevergne, Nicolas Sicard, and Alexandre Termier. Pgp-mc: Towards a multicore parallel approach for mining gradual patterns. In DASFAA, pages 78-84, 2010.
6 Authors and license
Authors:
- Benjamin Negrevergne
- Alexandre Termier
It was developped at Grenoble University / LIG.
License: ParaMiner is distributed under the LGPLv3 See LICENSE file in source directory for more informations.