SDC, Data Science and Knowledge

Clément Charnay

From SDC, Data Science and Knowledge
Jump to navigation Jump to search

Former PhD student in the SDC team (formerly BFO team) of the ICube laboratory of the University of Strasbourg, from October 2012 to June 2016.


ICube Laboratory
Télécom Physique Strasbourg
300 bd Sébastien Brant - CS 10413
F - 67412 Illkirch cedex
Office: C331
Phone: +33 (0) 3 68 85 45 78
Email: charnay (at) unistra (dot) fr


PhD Thesis

Title: Enhancing Supervised Learning with Complex Aggregate Features and Context Sensitivity

Promotor: Nicolas Lachiche (Tenured Senior Associate Professor, ICube-SDC)

Co-advisor: Agnès Braud (Tenured Associate Professor, ICube-SDC)

Funding: Grant from the French Ministry of Higher Education and Research

Defended: June 30th, 2016

Overview: This PhD thesis focuses on two strong points of the Data Mining Theme of the SDC team of the ICube laboratory: relational data mining on the one hand, and cost-sensitive learning on the other hand. These two points are currently studied as part of the european project REFRAME, in collaboration with the University of Bristol and the Polytechnic University of Valencia.

Relational data mining is a subfield of data mining where data is not represented according to the classic attribute-value model, in which every row of a single table would represent a training instance of a model with its properties, including the attribute to predict. Here, data is represented by several tables linked with foreign keys, which represent the different kinds of objects constituting the problem. A table, called the main table, contains the training instances (for instance, molecules) with the attribute to learn and other tables (for instance a table of the atoms constituting the molecules) contain the secondary objects linked to the main ones. We intend to take into account the properties of such secondary objects in the learning process on the main objects. A way to do so, in which we are more particularly interested, is the use of complex aggregates. They constitute a way to aggregate the secondary objects linked to one main object that meet a certain condition. More intuitively, the allow to summarize in one value the secondary table. Two examples of such an aggregate would be the number of carbon atoms in the molecule, or the average charge of the oxygen atoms of the molecule. However, the number of possibilities for the aggregate condition and the aggregate function make the exhaustive generation of all complex aggregates intractable. One of the goals of the PhD thesis is to propose a heuristic allowing to explore the complex aggregate space and to generate incrementally the ones that are relevant to address the given problem.

The other domain on which this PhD thesis focuses on is multi-class cost-sensitive learning. In this kind of problem, the attribute to learn can take many values, i.e. more than 2, contrary to the binary problems for which many learning algorithms are designed. Moreover, all the classification errors do not have the same cost, as expected in a medical domain, where diagnosing a disease for a sane patient will not have the same impact as not diagnosing the disease for a sick patient. In this framework, we are particularly interested in to binarization approaches, which consist in reducing a multi-class problem into several binary problems. More particularly, we consider the case where the binarization uses scorers, the scores being used to set decision thresholds between the two classes of the binary subproblems.


Teaching assistant at the UFR Mathématiques-Informatique (department of Mathematics and Computer Science) and at the IUT Robert Schuman (University Institute of Technology) of the University of Strasbourg.


  • IUT Computer Science S1: Databases and SQL (10h TD/28h TP)
  • IUT Computer Science S1: Introduction to Algorithmics and Programming (26h TP)


  • IUT Computer Science S1: Databases and SQL (10h TD/28h TP)
  • IUT Computer Science S1: Data Structures and Fundamental Algorithms (14h TD/14h TP)


  • L3/S6P Mathematics: Object-Oriented Programming (18h TD/12h TP)
  • L3/S5P Computer Science: Databases 2 (22h TP)
  • L3/S5P Computer Science: Operating Systems Basis (12h TP)