SDC, Data Science and Knowledge

Data Mining

From SDC, Data Science and Knowledge
Jump to navigation Jump to search
Data Mining group

Group publications:

The Data Mining group focuses on machine learning and knowledge extraction from complex data (eg. images, databases, etc.). Our research have two aims: on the one hand, it consists in conceiving and implementing knowledge extraction methods, and, on the other hand, to apply those methods to analyse databases and numerical images. Our approaches belongs to the machine learning area, in particular to clustering and relational data mining. Our main application domains are remote sensing or medical images, biochemical data, and also customer relationship management.

Data mining is an important step in the process from data to knowledge. Thus, for example, understanding the processes and development of systems, more or less anthropic, in various spatial and temporal scales (urbanization pressure on land, biodiversity loss etc..) from satellites or other data becomes a major component in various areas such as the study of the environment or urbanism. But the current analysis techniques are less and less able to address the current avalanche of heterogeneous data often incomplete or inaccurate and increasingly supplied as continuous streams. But if the characteristics of mining methods are generally well known and understood by the analyst-statistician or computer scientist, it is rarely the same for the user. Thus, quite often it is necessary to try several algorithms with different parameters to determine which best suits the question. The user must also take into account the indeterminacy of many unsupervised classification methods. Moreover, it is necessary to take into account the variable quality of raw or preprocessed data, the robustness of learning methods to noise, and the sensitivity of results to changes in methods or parameters of data acquisition / construction, in order to suggest more appropriate strategies for data cleaning and preprocessing. Finally, the data being supplied continuously, a dynamic dimension and the need for incremental learning ability in a changing environment are added. There are currently no surefire way to choose the best method and its parameters, as this choice is strongly related to the application domain and a priori knowledge on it and the data to be processed. One approach increasingly proposed to circumvent this problem is based on the intuition that the methods are complementary or at least can corroborate among themselves. Thus, mechanisms of confrontation and unification of results from different methods and data can be used to provide the user with a relevant summary. A promising avenue in this area is based on collaboration between different methods. Nevertheless, we learn even better than what we address relates to what we already know and that the objective of the task is known and understood: it is not desirable that datai nterpretation is done by a person ignorant of the topic. Thus, the interpretation process often requires the presence of a thematic expert, but is unfortunately very time-consuming. Though reducing that by introducing direct involvement of the expert knowledge in this process requires modeling and formalizing classes / objects in the real world, to define their possible representations in the data space and finally to study and build mechanisms for extracting and labeling these objects with respect to this knowledge.

Most of these works have been realized in collaboration with the Laboratoire Image, Ville, Environnement and have been validated in the remote sensing domain. Thus, the main application domain of our methods is the clustering of remote sensing images, and more generally the clustering of images.

Thus, our theme is articulated around these different aspects:

  • FODOMUST: Multistrategy data mining
  • FODOREL: Relational data mining
  • FODOST: Structured data mining
  • FODOGECO : Data mining and Knowledge


Operations

FODOMUST: Multistrategy data mining

Our works on multistrategy data mining follow three axes:

  • collaborative clustering methods: the idea is to improve the result of one clustering on some data by using an ensemble of clustering methods and making them collaborate. Keywords: data mining, collaborative clustering, unsupervised classification, knowledge extraction, complex data
  • evolutionary approaches for feature weighting in Kmeans based algorithms: in the Maclaw method (Modular Approach for Clustering with Local Attribute Weighting), each population only looks for weights and center for one cluster. In our methods, all the populations work in a cooperative way. All populations

search for the partition built from local solutions proposed by individuals that minimizes the LKM cost function. Each individual from a population is evaluated according to individuals from other populations: the better the best result that can build using its solution, the better the individual evaluation.

  • integration of background knowledge in clustering: either directly in Samarah, or in classical techniques such as Kmeans (Germain Forestier's PhD thesis)

All aspects are implemented in a common platform Samarah (Multi-agent learning system for the automatic refinement of hierarchies).


In parallel, we work on the automatic structuration of sets of images and video sequences.

FODOREL: Relational data mining

Relational data mining deals with the knowledge extraction from (relational, of course) databases, and more generally with inductive learning from data that cannot naturally be represented as a single attribute-value table, e.g. chemical reactions.

Our application areas include:

  • chemistry
  • water quality
  • customer relationship management (CRM)
  • geography

Our research topics are:

  • rule discovery
  • naive bayesian classifiers
  • ROC based optimisation
  • propositionalisation, and more generally representing the problem and data preparation

FODOST: Structural data mining

Structural data mining concerns the knowledge extraction from complex data structured by spatial, semantic, or temporal dimensions. Our aim is to make use of the structure between objects to cluster.

We work from multisource, multiview, multiresolution and multitemporal data, mainly in the remote sensing domain.

FODOGECO: Data mining and knowledge management (in collaboration with the Knowledge Engineering Group)

We are concerned here with the semantic interpretation of high-resolution satellite images. To enable the identification of high-level structured objects (house, street…), it is necessary to merge the classifications the regions coming from the analysis of the images with from inferences based on a geographical ontology. This ontology needs to describe not only the urban objects, but also their spatial qualitative and quantitative relationships.

Main projects and collaborations

  • Ongoing projects
    • AIGE FOR BANK (2020 -2021) : Artificial intelligence, governance and Ethics (APP 2020 CNRS Enjeux scientifiques et sociaux de l’intelligence artificielle)
      • Le projet AIGE FOR BANK a pour ambition la formalisation de recommandations développées entre chercheurs en droit et en informatique, proposant, dans le domaine bancaire, une méthode à même de concevroi des systèmes d’IA, intrinsèquement éthique et responsable. L’objectif est ainsi de favoriser l’acceptabilité sociale de l’utilisation de l’IA, dans un secteur où celle-ci est en pleine expansion et où elle revêt des conséquences sociétales fortes, en sensibilisant les développeurs à une éthique by design. En s’appuyant sur un cas d’usage particulier (la reconnaissance faciale en matière bancaire), le dialogue entre informaticiens et droit qui se nouera au gré des nombreuses rencontres est essentiel pour donner un caractère opérationnel aux préconisations formulées dans le livre blanc, dont il pourra ensuite être discuté d’une éventuelle généralisation.
    • POPLAB (2020-2021) : Plateforme innovante pour l’éducation (Appel à Manifestation d’Intérêt Economie numérique - région Grand Est) :
      • PopLab est une plateforme destinée aux enseignants pour Préparer, Organiser et Partager le savoir et proposer aux élèves des cours captivants au rendu digne d’un designer. Cette phase 2 du projet vise à construire une version intégrée aux environnements des établissements scolaires, à créer des fonctionnalités d’échanges prof/élèves très avancés, à mettre en place un système de machine learning et une collecte des données nécessaire à un futur projet d’intelligence artificielle.
    • HIATUS (2019-2023) : Historical Image Analysis for Territory evolUtion Stories (PRCE)
      • Les images aériennes d'archive offrent un point de vue unique sur près de 100 ans d'évolution des territoires avec possibilité de restituer la 3D. Elles sont désormais disponibles dans de nombreux pays. Ces séries temporelles longues sont très hétérogènes d'un point de vue spatial, spectral et temporel et leur géoréférencement actuel n'est qu'approximatif. Le projet vise à développer de nouvelles méthodes pour (1) effectuer un géoréférencement précis des images et produire des séries temporelles denses d'orthoimages et de Modèles Numériques de Surface à large échelle, (2) extraire des informations sémantiques sur l'occupation des sols et leur évolution dans différents cas de figure (urbain, forêt, agriculture) et (3) valoriser les outils développés et les sorties du projet à travers des services web développés sur les plateformes existantes.
    • Consortium : LaSTIG, LETG, DYNAFOR, LIVE, ICUBE, Kermap / Coordinateur : Arnaud Le Bris


  • Past projects
    • COCLICO COllaboration, CLassification, Incrémentalité et COnnaissances (Collaboration, Classification, Incrementality and Knowledge) (November 2012- October 2016)
      • Objective : The Coclico project is a ANR research project to study and propose a generic method for an innovative multi-scale analysis of large volumes of spatio-temporal data (e.g., remote sensing images) provided as a stream of highly variable quality, implementing a multistrategy approach in which incremental collaboration between different data mining methods will be guided by knowledge of both the thematic field (Geosciences, Geography) formalized in ontologies and of the domain analysis (knowledge of the methods), add guaranteeing a objective of final quality taking into account both the quality of data and of knowledge.
      • Web site : http://icube-coclico.unistra.fr
    • Reframe : Rethinking the Essence, Flexibility and Reusability of Advanced Model Exploitation (2013 - 2015)
      • Objective : The overall objective is the development of generic techniques, methodologies and paradigms for extracting and reusing knowledge (in the form of models, features, rules, ranking information or probabilities) which can be reframed across different operating contexts. As a result, problems such as the example above will be handled as follows: (i) the changes in operating context, such as location and resolution, are identified and formalised; (ii) a more flexible and generic data representation is then used, where inputs and outputs are defined as elements in a hierarchy rather than as single numerical or nominal attributes; (iii) models are built as more versatile and generic entities, working with different locations and resolutions, taking different granularities for inputs and outputs, and producing additional information such as distributions, rankings, reliabilities, etc.; (iv) these models are adapted and integrated through the use of proper reframing operators (aggregation and disaggregation techniques in this particular case); and (v) more sophisticated and powerful performance metrics, evaluating the quality of knowledge in a wide range of operating contexts are used to assess at which resolutions and locations the model is optimal and which learning resolution (e.g. days or weeks) can make the model more flexible and powerful for reframing and deployment in other operating contexts.
      • Web site : http://reframe-d2k.org
    • FOSTER Spatio-temporal data mining - application to the understanding and monitoring of erosion (January 2011 - September 2014)
      • Objective : The FOSTER project aims at building, from the available data, dynamic models to support monitoring and studying the evolution of the environment, actually the erosion. Data consist of satellite images, (symbolic and numerical) spatio-temporal data, and background knowledge. Those data are heterogeneous, multi-scale, noisy, and have missing values. They are large, for instance a satellite image and the associated Digital Elevation Model (DEM) need 26Gb.
        The environmental processes are complex and we would like to learn several models, eg. sharp changes as well as monitoring slowly evolving phenomena. Studying dynamic systems evolving in space and time raises questions on the spatial relation and temporal evolutions to consider. The area studied in this project are located in New Caledonia (a travel over there is possible) and French Alps.
      • Web site : http://foster.univ-nc.nc
    • CNES (National center for space studies)
      • ORFEO GT3 study (2010-2011 and 2011-2012): modelling objects of interest in remote sensing images and their spatial relationships for a knowledge extraction guided by those informations
      • PhD grant with Thalès (2009-2012): clustering of temporal sequences of heterogeneous satellite images
    • Roche pharmaceutical company (2010-2012): analysis of images to extract knowledge about the efficiency of drugs
    • DAHLIA (2010-2013): PhD grant with Christophe Collet (MIV team) on monitoring daily activities of elderly people at home
    • GeOpenSim (2007-2011): learning the classes and the evolution rules of urban areas.
    • RBS (2007-2010): automatic structuration of sets of images and video sequences
    • ECOSGIL (2005-2008): extraction of spatial knowledge for an integrated management of the littoral
    • FoDoMuSt (2004-2008): multistrategy data mining on remote sensing images
    • CNES (2007-2008): interactive interpretation of remote sensing images