DataScience Lab
Table of Contents
- News / info
- (Tentative) planning for the year
- Assignment 1
- Assignment 2
- Assignement 3
- HowTo
- FAQ
- Can I develop approach X (method that has not been discussed in class).
- Is it mandatory to use the dataset or the metric specified by the professors?
- Do I have to use Git? Can I use Jupyter Notebook instead?
- I don't have enough computing power.
- How do I use scp to copy files on the lamsade servers
- Is there a way to simplify the process of logging in and copying files using ssh/scp?
News / info
- /!\ preliminary presentations will be 5 minutes only. You will be stoped mid-sentence at the end of the 5 minutes. Upload your slides (as pdf) on your git repository before the class, call them "slides.pdf".
(Tentative) planning for the year
Note: A1 = assignment 1, Ax = assignment x.
Date | Description |
---|---|
September, 18 | Class intro + Intro A1 |
September, 25 | Group sessions |
October, 2 | Preliminary presentations A1 |
October, 8 | Deadline A1 23h59 |
October, 9 | final presentations A1. Intro A2 |
October, 16 | Alexandre's presentation on PR + group sessions |
October, 23 | — NO CLASS --- |
October, 30 | — NO CLASS --- |
November, 06 | Preliminary presentations A2 |
November, 12 | Deadline A2 23h59 |
November, 13 | Final presentations A2 + Intro A3 |
November, 20 | Lucas' presentation + group session |
November, 27 | — NO CLASS --- |
December, 04 | — NO CLASS --- |
December, 10 | Deadline A3 23h59 |
December, 11 | Preliminary + final presentation A3 |
Assignment 1
Links
Refs
- Recommender Systems : The Textbook by Charu C. Aggarwal (read the section about MF, available at the library)
- For PCA, non-linear PCA, kernel PCA etc. see Generalized Principal Component Analysis by René Vidal Yi Ma and S.Shankar Sastry: here
- Deep Matrix Factorization, by Xue et al.
Assignment 2
Links
Refs
- f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization: https://arxiv.org/abs/1606.00709
- Wasserstein GAN: https://arxiv.org/abs/1701.07875
- Discriminator Rejection Sampling: https://arxiv.org/abs/1810.06758
- Metropolis-Hastings Generative Adversarial Network: https://proceedings.mlr.press/v97/turner19a.html
- Latent reweighting, an almost free improvement for GANs: https://ieeexplore.ieee.org/document/9706934
- Discriminator optimal transport: https://proceedings.neurips.cc/paper_files/paper/2019/hash/8abfe8ac9ec214d68541fcb888c0b4c3-Abstract.html
- Refining Deep Generative Models via Discriminator Gradient Flow: https://arxiv.org/abs/2012.00780
- Your GAN is Secretly an Energy-based Model and You Should use Discriminator Driven Latent Sampling: https://arxiv.org/abs/2003.06060
- MMGAN: Generative Adversarial Networks for Multi-Modal Distributions: https://arxiv.org/abs/1911.06663
- Gaussian Mixture Generative Adversarial Networks for Diverse Datasets, and the Unsupervised Clustering of Images: https://arxiv.org/abs/1808.10356
- python packages used to evaluate PR / FID
Assignement 3
HowTo
Group sessions
Class presentations
For each assignment, each group is expected to give exactly one presentation (either a preliminary presentation or a final presentation).
- WARNING1: Timing will be extreamly strict (i.e. you will be interrupted in the middle of your sentence.)
- WARNING2: Focus on the novelty of your work (and not on what has been presented during class)
Preliminary presentations
- 5 minutes (~ 5 slides)
- Briefly & clearly state the problem you are woking on (one slide).
- Present and compare approaches you are considering for solving the problem
- Describe what you have implemented (briefly)
- Discuss possible experiments and evaluation metrics.
- Present preliminary results if you have any.
Final presentations
- 5 minutes (~ 5 slides)
- Briefly & clearly state the problem you are woking on (one slide).
- Present and compare approaches you have studied during this assignment.
- Describe what you have implemented (briefly)
- Discuss the evaluation metrics you have used.
- Show experimental results and disucss these results.
Reports
- 1 front page with student names, name the team, and optionally project title
- 5 extra pages max. (ref not included, figures included),
- pdf file named report.pdf
- has to be available on the git repository by the deadline (NO EMAIL!)
Reports should contain:
- a detailed list of what you have implemented, together with the name of the file in your repository containing the corresponding source code. If you have used external libraries to do something important, please mention it;
- a list of experimentations conducted, with a conclusion;
- anything interesting that you have learned from working on the assignment.
They should not contain:
- detailed description of the principles of the techniques seen in class;
- extensive code listing, (brief pseudo code is ok).
FAQ
Can I develop approach X (method that has not been discussed in class).
You are very much encouraged to study & implement something that we have not discussed in class, as long as it is a solution to the problem we're trying to solve.
Typically, it is a good idea to compare some approach that we have discussed in class with something that we have not discussed in class so your experience can profit other students (and so we can have new ideas for next year).
Is it mandatory to use the dataset or the metric specified by the professors?
If you can, you should. It's better if you run at least one experiment that is comparable with the experiments of the other groups working on the same problem.
However comparative experiments are not always very insightful, so you are also encouraged to conduct other types of experiments using different datasets or different metrics to better understand how your approach behave. Be creative.
Last year, one group made a random dataset generator so they could plot the performance of their algorithm w.r.t. the size of the dataset. From that plot, they concluded that their approach could never scale to any realistic dataset:). That's just an example, but it was good work, and it turned out that generating realistic random matrices was also an interesting problem
Do I have to use Git? Can I use Jupyter Notebook instead?
Git and Jupyter notebook are two very different tools. Yes, you have to use Git. You can also use Jupyter Notebook if you want.
Git is a tool to manage a source code repository. It is used to version your code (keep track of the changes) and collaborate with other developers (merge multiple concurrent versions of the code). You have to use it, because this is how I am going to access your code/report at the end of the project. You also have to use it because it's critical for you to know how to use it if you ever want to collaborate with someone, or handle a code base that contains more than few lines of code.
Jupyter Notebook is an interactive browser-based code editor. It can be used to run few lines of Python code in your browser, but it is not so convenient when you have a large code base, or when you want to run your code on a distant server, or not interactively. You can use it if you want, but I will not check it unless you explicitly refer to it in your report.
In case you want to use it, I recommend that you first write a Python Module with all the important functions inside (See Python Modules). Then, you can import this module in your Jupyter Notebook and call the functions from there. This way, you can also write a simple non-interactive script so that you can run your program on a remote server.
I don't have enough computing power.
You can either use Google Colab (within Jupyter Notebooks hosted at Google), or access the GPUs servers hosted at Lamsade (the computer science Lab at Dauphine), using ssh. Your account has been created already, you just need your private key, send me an email if you want it.
Then open a terminal and type
chmod 600 /path/to/private/key/id_rsa_<username> ssh <username>@ssh.lamsade.dauphine.fr -p 5022 -i id_rsa_<username>
You need to replace <username>
with your own username.
Then you can choose one of these machines:
- Ourasi: 20 cores / 40 threads / 32GiB RAM / 2x NVIDIA A6000
- Kaisertrot: 20 cores / 40 threads / 32GiB RAM / 2x NVIDIA A6000
- Boldeagle: 8 cores / 16 threads / 32GiB RAM / 2x Nvidia GTX 1080 Ti
- Readycash: 8 cores / 16 threads / 32GiB RAM / 2x Nvidia GTX 1080 Ti
These are shared ressources so please do not use more than 1 GPUs at a time!
You can see who else is using the CPUs/GPUs using htop
or nvtop
or nvidia-smi
.
You can also transfer files from your computer to the servers using scp
(see man scp
)
How do I use scp to copy files on the lamsade servers
To copy the local file test.py on your home directory on the lamsade servers:
scp -i idfile -P 5022 test.py username@ssh.lamsade.dauphine.fr:.
also works the other way around
scp -i idfile -P 5022 username@ssh.lamsade.dauphine.fr:test.py .
Notice the .
at the end.
Explanations:
-i idfile
because you need to specify your privite key for the authentification to succeed, do it with-i
. Seeman scp
-P 5022
because the ssh server at lamsade doesn't run on the standard ssh port (22) for security reasons, so you need to specify the actual port. For scp, you can do it with-P
(notice the capital P). See ~man scp.:.
The path specification (the part after the column:
) is a standard unix path, so if it starts with a/
it's an absolute path (i.e. relative to the root of the filesystem/
), otherwise it is relative to the current directory, which in this case is the directory in which you end up when you log on using ssh (your home directory).
Recall that on unix .
always refers to the current directory (and ..
to the parent directory, hence ce command cd ..
).
ssh.lamsade.dauphine.fr
: the remote server specification should always be a valid dns specification (or an ip address). In this case, it refers to the ssh server of the subdomain lamsade of the dauphine domain of the fr dnz area.
Is there a way to simplify the process of logging in and copying files using ssh/scp?
Yes, you can configure your ssh client to remember all the important information (key, username, port etc.) but the exact way to do it depends on the ssh client your are using.
If you are running unix locally, your ssh client is the program that is executed when you type the command ssh
. It is configured using various config files that are located in the .ssh
directory that is in your home directory.
You can start by copying your private key inside your (local) .ssh
directory. Ssh will find it and try it automatically when you log in.
You can also specify the port and the complete dns inside a file call config
in your .ssh directory.
Mine contains this:
Host lamsade Hostname ssh.lamsade.dauphine.fr User bnegrevergne Port 5022
so I can just type ssh lamsade
and scp file lamsade:.
If you are using another ssh client, I am sure you can do this as well, but I don't know how. If you find out, tell me how, and I'll put it here for the others.