Untitled

CS858 F19 Paper #19 Reviews and Comments
===========================================================================
Paper #19 With Great Training Comes Great Vulnerability: Practical Attacks
against Transfer Learning


Review #19A
===========================================================================
* Reviewer: Laura Graves <laura.graves@uwaterloo.ca>
* Updated: 18 Oct 2019 2:27:02pm EDT

Paper summary
-------------
The authors show that models trained using transfer learning from a previously trained model such as InceptionV3 are vulnerable to black-box attacks crafted against the teacher model. They showed that multiple common teacher models lead to vulnerable learner models, and they test their findings.

Transfer learning is when you take a fully trained model and replace the last $k$ layers with new layers and retrain to work on your own chosen task with new data. The intuition is that the low and intermediate layers learn to distinguish and separate the important structures in the features and the final layers combine those in a semantically important way, so by replacing the final layers only we can take advantage of the highly trained early and middle layers to get an effective model with barely any training data. By crafting adversarial examples to target a hidden layer instead of an output layer, the authors show that they can leverage the similarity in the early and middle layers to effectively craft adversarial examples.

Further, they show that while these attacks can be defended against, there's a significant performance cost to doing so. This is a further illustration of the security/performance tradeoff we've seen many times so far.

Problems in Understanding
-------------------------
I think everything in this paper was communicated very clearly!

Strengths
---------
I really liked a lot of this paper. Here are my highlights:

First, the black box attack is effective with \textbf{no other queries}! I think this is a huge plus to this paper that they don't draw a lot of attention to. Instead of having to create a substitute model and rely on transferable adversarial examples, we can get it right the first and only time we do a query.

Second, the use of DSSIM instead of an $\ell_p$ norm for image difference is a great benefit. I'm sure we don't need to hear more about how poor $\ell_p$ norms are for a similarity metric, and it's great that they used this instead. It feels like they had a really good attention to detail in this paper - there were no obvious bad choices in metrics, model selection, attack methods, etc.

Third, the authors display a really interesting chain of attacks and complications. The attack is most effective around the layer that the model takes over from, so use that layer. If you don't know that layer, that's fine, here's a method for finding it. If you don't know the teacher model, that's fine, here's a method for finding it. If they're using dropout as a defense, that's fine, we can add our own dropout layer and attack it anyway. Reading the paper felt like a series of back-and-forth points and counterpoints that was compelling and interesting.

Another strength that I felt compelled to mention is that in section 4.4 the authors present an attack method they evaluated that completely failed. I really like this - it's great to see "by the way, we tried this and it didn't work, so don't bother".

Weaknesses
----------
I don't have many complaints at all here! My one nitpick is in section 4.3 when the authors discuss the deep-layer feature extractor for the Iris task. They claim nearly a 100% success rate because the model is so sensitive to small variations that anything out of the norm gets classified as "unknown". I'm not sure if I can really call this success? A misclassification, at least a desireable one, would be to add perturbations such that it's "successfully" classified as a different class - in my view, an "unknown" classification isn't a good target, certainly not enough to call it an attack with 100% success. In another task this may be valid - maybe perturbing an audio recording so that voice recognition systems don't recognize it is worthwhile - but here it's not.

Second, I'm not sure about the motivation here. I think there's a natural tradeoff between the need for privacy and the budget the involved party is willing to put into training - is it realistic to think that a party with a privacy-sensitive dataset is using an off-the-shelf teacher model instead of training something specific to their task that's focused on privacy? Maybe I'm just not imaginative enough, but it seems like there's a pretty loosely defined real-life threat here.

Opportunities for future research
---------------------------------
I think some more evaluation of different defense techniques could be valuable. It's nice that the authors analyzed a couple different tactics, but surely there's more to look at here!


Review #19B
===========================================================================
* Reviewer: Thomas Humphries <t3humphr@uwaterloo.ca>
* Updated: 18 Oct 2019 4:58:10pm EDT

Paper summary
-------------
The authors introduce a new method of generating adversarial examples for models that were trained using transfer learning. They take advantage of commonly available white box access to the teacher models to create adversarial examples that transfer well to student models. This technique allows the adversary to attack a student model with only black box access and as little as one query. A fingerprinting technique is proposed to help identify the teacher model in situations where this relationship might not be public. Experimental evaluations show under what circumstances the attack is most successful and defences are proposed for those cases. The most successful defense involves adding a term to the objective function that penalizes student models that are too similar to the teacher.

Strengths
---------
Overall I thought this paper made very realistic assumptions. Any assumption of the adversary’s knowledge was supported by examples of real-world scenarios or justified with strong arguments.

I really liked that they used something other than the Lp norms to measure human perceptibility. The DSSIM metric seems to be a solid step in the right direction.

The attack is very intuitive and easily adapts to both targeted and untargeted examples.

The authors evaluated the weaknesses of both the attack and defense mechanisms very thoroughly. They also addressed the anomalies in their experimental findings and explored reasoning for them.

Weaknesses
----------
Overall I found this paper to be very well written but a few minor weaknesses were:

They only evaluated one technique when evaluating the transferability of attacks carried out on teacher models. This question of transferability seems too complicated to just be evaluated using one approach.

No averaging of runs or cross validation techniques are mentioned in the experimentation

This research is again restricted to the image classification domain.

Opportunities for future research
---------------------------------
It would be interesting to evaluate transfer learning under other types of attacks such as membership inference attacks. Under the assumptions of prior knowledge provided by the authors, one could evaluate the difference in response between the teacher and student on a given sample. In theory a student model should respond more closely to the teacher on inputs not in the training set but show a more distinct response on inputs it was trained on.


Review #19C
===========================================================================
* Reviewer: Sung-Shine Lee <s469lee@uwaterloo.ca>
* Updated: 20 Oct 2019 1:03:26pm EDT

Paper summary
-------------
The paper presents a type of attack that targets student models that were trained on top of a teacher model using transfer learning. The attack assumes white-box access to teacher and use the information to generate adversarial examples of the black-boxed student model. The attack tries to perturb an image of specific class and simulates the neuron output at the K-th layer (where the student start to learn) for another class under a perturbation budget. Furthermore, the authors present a method to identify the specific teacher used to train a student from a set of teachers. Finally, defenses such as random dropout, neuron distance injection, and ensemble of models were discussed.

Problems in Understanding
-------------------------
The selection of perturbation budget P is unclear to me: what does it mean by P=0.003 is a safe threshold? what is the definition of safe here?

Strengths
---------
- The authors tested the transfer learning across transfer between similar classes and transfer between very different classes.
- The paper revisits the attack model when describing the experiment and made it very clear for why they think the conducted experiment sticks with the attack model they stated previously.
- The attack were able to launch on randomly selected source-target pair and demonstrated effectiveness on methods that transfer learning services are recommending to customers.
- The use of DSSIM as a perturbation metric instead of L2 is very interesting.

Weaknesses
----------
- The transfer learning of the flower case: As it was fully retrained, I didn't see the rationale for transfer being beneficial, it should compare with the baseline (training from scratch)
- The results for the transfer are not really surprising: when the tasks are more alike and when the retrained layer is fewer, misclassification attack using teacher will be more effective to the student. In a sense, the student is not really blackbox: only the retrained layer are blackbox.
- The intuition of injecting neuron distance when finding a new local optimum is not presented, making it harder to understand.

Opportunities for future research
---------------------------------
- The student model in the paper retrains the last couple of layers, without changing the number of layers. It would be interesting to see how would it effect the results when we add different numbers of layers after the teacher's model.
- While not directly related, the attacks on Transfer learning seems to be interesting in general: what happens to membership inference attack when it is applied on a student model?


Review #19D
===========================================================================
* Reviewer: Rasoul Akhavan Mahdavi <r5akhava@uwaterloo.ca>
* Updated: 20 Oct 2019 2:01:08pm EDT

Paper summary
-------------
Transfer learning is becoming mainstream among ML platforms and consumers with limited computing power. But the the nature of this method introduces new opportunities for adversaries to exploit student models to misclassify. The paper starts by giving an introduction into transfer learning and the different methods that exist. Then attacks on transfer learning are described with reasonable assumptions that fit the transfer learning setup and actually utilize the specific vulnerabilities that arise due to the use of this method. The misclassification attacks are formulated as an optimization problem and experiments are conducted on 4 different dataset that have been learned with the different transfer learning methods. The experiments show the effectiveness of the attacks on 2 of the 3 learning methods. Assumptions about the knowledge of the learning method is also relaxed in the next subsection to achieve other attack scenarios as well. Finally, defenses are discussed and experimented that achieve partial success in defending against the attacks.

Strengths
---------
- The paper was very very well written and I easily and enjoyably read through it. It had a very good pace, making assumptions and relaxing them one by one in upcoming sections, resulting in a very logical and understandable flow for the paper.
- Assumptions that the claims are true under are very well specified. No exaggeration is made of the claims and many of the flaws that I could think of was at each point where addresses somewhere ahead in the paper.
- Very well picked problem, extracted from a real life scenario.

Weaknesses
----------
- What the paper mentions as a targeted attack, isn't actually a targeted attack. Here the target is one specific image, not a specific class. While in the eyes of the attacker, this is a tighter constraint, from the viewpoint of defense, the methods are only defending against a very harder attack than a targeted attack, so the attacker might be able to succeed against the defenses with a more relaxed constraint.

- No comment is made on how to craft the fingerprinting image, how hard it is and the assumptions about it, which is probably white box access to the teacher models. The trivial approach might be brute force which might not be that efficient.

- Only adversarial misclassification is considered a threat. Many other interesting scenarios can be considered dangerous, for example membership inference of the data used to train the student.

- The benefit of Injecting Neuron Distance relative to Full Retaining is vague. Effort has to be put into solving the optimization to inject neuron distance, but why not put that same effort into full retraining instead? As the results show, the full retraining is not only robust against attacks, but also improves the accuracy by definition.

Opportunities for future research
---------------------------------
- Instead of choosing one picture as the target, the entire pictures of that class can be chosen as target and instead of trying to exactly mimic the hidden layers behavior, we can try to generate a sample that is indistinguishable at the specific layer using a GAN.
- Can we design other types of attacks on the concept of transfer learning? Attacks like membership inference? Gaining information about the members of the training dataset for the student is interesting, especially if training dataset is known (for example for the creator of the teacher model) and can pose a threat to users of student models


Review #19E
===========================================================================
* Reviewer: Nils Hendrik Lukas <nlukas@uwaterloo.ca>
* Updated: 20 Oct 2019 9:11:11pm EDT

Paper summary
-------------
The paper proposes a new attack against student models learned through transfer learning from a publicly known teacher model. The authors present a method to enhance adversarial example generation for a student queryable through black-box by verifying whether the teacher is fooled by the crafted example. They distinguish between three types of transfer learning and show that the model with the highest similarity to the teacher is most vulnerable to attacks. Their defense either adds randomness to the prediction via dropout or tries to maximize the distance of the student model to the teacher (neuron-distance technique).

Problems in Understanding
-------------------------
-

Strengths
---------
+ Many novel ideas, i.e. using DSSIM instead of the L2 norm to generate adversarial examples, checking which teacher is associated to a student model, giving graphs on vulnerability per layer

+ Very well written, good introduction that covers use-case and basic knowledge required to understand the rest

+ Many defenses were evaluated and their strengths and weaknesses mentioned properly

+ Highly practical, since the authors evaluate their idea on public infrastructure (google cloud etc.) and used state-of-the-art models

Weaknesses
----------
- The authors could have given the L2 norms for the DSSIM budget usage to relate them to other works in the field

- The defenses did not work most of the time e.g. against non-targeted adversarial examples. Also confidence intervals are missing and for dropout we have a rather large drop in accuracy

- More explanation on the generation of the adversarial examples could be given. The authors solve their own optimization problem with an Adadelta optimizer, but do they subsequently select the best examples or have any post-processing? Are the source images randomly chosen or somehow selected?

- Too little on the fingerprinting approach. Would have like to see a better evaluation instead of just settling on the easiest case.

Opportunities for future research
---------------------------------
Do a survey on how many models in the wild are actually linkable to one of a set of public teacher models and inform the host of their vulnerability.

Find a better approach to link models than their current fingerprinting approach. Theirs seem to work only in the simplest case where the teacher model is not modified at all.


Review #19F
===========================================================================
* Reviewer: John Abraham Premkumar <jpremkum@uwaterloo.ca>
* Updated: 20 Oct 2019 9:14:50pm EDT

Paper summary
-------------
In this paper, Wang et al. hypothesize that the usage of teacher student model of learning could expose certain vulnerabilities, and these could be exploited to perform misclassification attacks against these student models.

They detail their attack, that uses a targeted misclassification at a certain layer (usually the last frozen layer) for the teacher to induce a similar misclassification in the student. They also show that they can perform these attacks when it is not known who the teacher is by performing teacher fingerprinting.

In their experiments, their attacks are more effective against deep layer transfer learning methods, as these preserve more similarities between the teacher and student.

The authors go on to try and develop defenses against their own attacks, exploring two methods that use dropout and injection of neuron distances respectively, of which it is the latter that they recommend after experimentation. They mention that ensemble defenses could also be used, but needs more work.  They finally conclude by summarizing their paper.

Strengths
---------
1: They test the attacks on multiple popular deep learning frameworks, which include Google Cloud ML, Microsoft Cog-nitive Toolkit and Pytorch.

2: They choose to use another metric Instead of Lp distance, calledDSSIM, which (according to the authors) is an objective asessment metric for image quality that supposedly closely matches what a human would perceive as the quality of an image

3: They cover all three transfer learning methods in their experiments through choice of student models

Weaknesses
----------
1: The non targeted attacks still need them to have target images and perform multiple targeted attacks

2:They use only image classification tasks in their tests

4: Their attack is not effective for all transfer methods

5: Attack against unknown targets assumes that they know all possible teacher models being used

Opportunities for future research
---------------------------------
Whether the teacher model (or the student models individually) being adversarially trained would mitigate attacks


Review #19G
===========================================================================
* Reviewer: Lucas Napoleão Coelho <lcoelho@uwaterloo.ca>
* Updated: 20 Oct 2019 9:32:56pm EDT

Paper summary
-------------
The paper proposes an adversarial example attack targeting models trained by transfer learning, where the first N layers of the final model are taken from a pre-trained model. The attacker crafts adversarial examples by adding minimal perturbation to a source image. Effectively approximating a target image's representation in the last unfrozen layer of the teacher's model. The attack is the most effective when all but the last layer of the student model is the same as the teacher's model, but it can also be applied to other cases. The authors propose a couple of possible defences, but neither of them is entirely valid.

Strengths
---------
The idea of the attack is novel, simple and obtain great results.

The authors explore a different similarity metric as a perturbation budget. The replacement is an excellent initiative as we have seen that the $l_p$ norm is inconsistent, being possible to have humanly similar and dissimilar results in the same $l_p$ bound.

Although it's results are not impressive, the second proposed defence is very interesting and I believe it may lead to more interesting future work.

Weaknesses
----------
I don't like the use of non-targeted nomenclature in the paper. It creates confusion since the attack requires a target by nature.

The proposed defences are weak. The authors themselves recognize how bad the dropout defence is. The second defence is far more promising. However, it is still mostly infective against untargeted attacks.

The fingerprinting attack proposed is also hardly effective in practice as it requires white-box access to the defender's network. If that is not the case, the defender can identify a fingerprinting query and mask its results.

I assume that all the teacher models used are vulnerable to adversary examples themselves. They do not cite the consequences of, for instance, adversarially training the teacher model.

Opportunities for future research
---------------------------------
While this work presents an intrinsically targeted adversarial example attack, I believe it would be possible to adapt it to a genuinely non-targetted attack by "stealing" the last layers of the target model.

Evaluate the effect of performing adversarial training to the teacher model.

The fact the second proposed defence was able to find a better local optimum makes me think that similar techniques can be used to improve the accuracy of NN in general.


Review #19H
===========================================================================
* Reviewer: Iman Akbari <iakbariazirani@uwaterloo.ca>
* Updated: 20 Oct 2019 10:16:59pm EDT

Paper summary
-------------
The paper mostly revolves around the vulnerabilities of “transfer learning”, the use of highly generalized public teacher models as starting points and tuning them for specific application by limited training on private datasets. The idea is that the white-box nature of the teacher models can be exploited to perform misclassification attacks on student models, regardless of the student being black-box to the attacker.

The authors use the white-box access to the initial layers of the teacher model and a very small set of “target images” i.e. inputs from other classes to generate perturbations that would make the output of the first K layers similar to that of the target inputs. Hence, the following layers would also receive these “wrong” inputs and continue to misclassify the input. This attack is simple but quite effective as evaluated on VGG-16 and ResNet50 with student models trained on different datasets e.g. CASIA IRIS, GTSRB stop signs, VGG flowers, etc.

After evaluating the attack success for targeted and non-targeted attacks using different number of layers and different perturbation budgets, the authors provide a method for finger-printing teacher models and identifying the model used to train the student. Then multiple defence proposals are provided, most notably by modifying the “frozen” layers of the student as an optimization problem so that the impact on accuracy would very small. The defence seems to be pretty as showcased in figure 8 for impeding targeted attacks without a significant blow to model accuracy, yet doesn’t seem to work as well with non-targeted attacks. Dropout and ensemble learning are also proposed as potential mitigations.

Problems in Understanding
-------------------------
Section 1: “Lack of diversity has amplified the power of targeted attacks in other contexts, i.e. increasing the impact of targeted attacks on network hubs” -> More explanation would have been nice

It is a bit unusual to me that the adversarial example would be very often misclassified as a class not present in the set of target image I used in the attack

Strengths
---------
- Important subject as tuning a well-known generalized public model to specific use has been increasing over the years and become a standard
- Strong attack as it requires very few queries to the victim (student) model and can propose a significant threat to ordinary defences.
- The attack is quite simple, yet very effective according to the evaluations
Good general flow and structure

Weaknesses
----------
- Dropout is presented as defence but doesn’t seem to have very good results given that the classification and adversarial attack accuracy seem to drop together. Also, a ten percent drop in accuracy is much more of a deal for overall model accuracy than the attacker.
- The use of ensemble learning is not evaluated as a defence strategy
- Transferability of teacher model’s adversarial examples could have been compared against the proposed method (in the non-targeted case)

Opportunities for future research
---------------------------------
# Other student-teacher arrangements

The attack assumes the student-teacher model arrangement where the student training initializes the weights to the teacher’s weights. This is different for example from the set-up in Papernot et al’s PATE student-model teacher-model dynamics, where the teacher is simply used to label a public dataset. It would be interesting to explore the vulnerabilities of using public generalized models as teachers in that kind of arrangement too.

# Ensemble defence weaknesses

In case an ensemble of orthogonal models are used as a defence, can we still develop ways to try to generate adversarial examples? After all, each model in the ensemble is a student model. Furthermore, detecting such an ensemble and the teachers used to create it in a black-box setting it is worth studying.

# Generalizing to outside the transfer learning domain

Since the initial layers capture basic heuristics in the image (curves, certain areas, etc.) it would not be very strange if independently trained models would have similar initial layers. Hence, an attacker might even be able to use a locally trained model instead of a teacher to craft adversarial examples against an arbitrary black-box model (not necessary trained with transfer learning) using the method in section 3.


Review #19I
===========================================================================
* Reviewer: Viet Hung Pham <hvpham@uwaterloo.ca>
* Updated: 20 Oct 2019 10:22:33pm EDT

Paper summary
-------------
The paper introduces an adversarial example attack targeting models which have been trained using transfer learning. The attack uses a target image to estimate the activation of the layers K which is common between the student model and the teacher model. By optimizing the perturbation on an original image such that the activation of the layer K would be the same as the target image, the student will predict the original image with the same label as the target image. The paper also proposes various defenses that mitigate the effect of targeted attacks but the success rates of untargeted attacks are still very high.

Problems in Understanding
-------------------------
The paper is pretty easy to read.

Strengths
---------
+ The idea of using a target image as a way to estimate the activation of the hidden layer K is effective (given the Student model use the first K layer) as it removes the need to query the model with many queries to estimate the newly trained layers after layer K in the student model.

+ It is interesting to see that a much more complex distance metric (DSSIM) could be used instead of the L2 distance.

+ The paper present not only the attack and several defenses but also tried some trivial attacks to shows that they would not have worked.

+ It is good that the paper also introduces a way to identify the candidate teacher model as it helps mitigate one of the weaknesses of the attack technique where there are many candidate teacher models available.

Weaknesses
----------
- The paper mentioned that it needs only one query to the student model using the generated adversarial example and would not need any additional queries to prepare the attack. However, this is not true because the attackers need to make 2 queries with the original source and target images to make sure that they are correctly classified by the Student model. For example, if the Student model misclassified the target image in Figure 2 as a Cat and then the attacker tried to perturb the source image to match the hidden layer K output then the perturbed source image will be classified as a Cat which in this case is correct and does not have adversarial behavior. If this is the case, we can have a detector monitor layer K and detect queries with very similar layer K activation.

- The paper compares DSSIM distance with L2 distance which does not make sense since they are completely different metrics. DSSIM would be a little bit more complex than L2 distance, the paper could report if it is more difficult to optimize with DSSIM comparing to L2.

- Overall, the success of the attack is very dependent on the assumption that the student model uses Deep-layer Feature Extractor (uses most of the teacher model). The less dependent on the student model the more the attack breaks down. And from table 1, most of the models that perform a different task would be able to use the Full Model Fine-tuning techniques without much accuracy penalty. The paper should report the attack success rate for all student models when using the fine-tuning approach.

Opportunities for future research
---------------------------------
+ It would be interesting to use active learning to increase the size of student learning data so we can use the teacher as an oracle to train the first K layers. This way the weight of the first K layers would be different and might have better success against untargeted attacks.

+ Model distillation would also be a defense for this attack as it will consolidate the weight of the first K layers and break the attack.


Review #19J
===========================================================================
* Reviewer: Vineel Nagisetty <vnagisetty@uwaterloo.ca>
* Updated: 20 Oct 2019 10:37:46pm EDT

Paper summary
-------------
In this paper, Wang et al. investigate the effects of adversarial attacks in the transfer learning setting - where a student model inherits most of its parameters from a well known teacher model. The authors show that these student models are highly vulnerable to adversarial attacks by an adversary with knowledge of the teacher model and the transfer learning process. Finally, they devise countermeasures and evaluate their efficacy on these attacks.

Problems in Understanding
-------------------------
In the $\textit{Injecting Neuron Distances}$ (6.2), does the process require first completing the training normally and then retraining the Kth layer? Why not just shift the weights of that layer (with mean 0) and then train the student layers?


What are the features DSSIM uses in measuring image similarity? Is it more costly than using $L_p$ norm?

Strengths
---------
1. The designed attack is intuitive, novel and very applicable to practical settings. I like how the authors applied them to real MLaaS provider models and showed their results.

2. The fingerprinting technique to find the teacher model is ingenious.

3. Using DSSIM instead of $L_2$ norm seems great. The images from Figure 3 show how using DSSIM for imperceptibility is better for the given budget.

4. Applying perturbations to different layers of the teacher model (Figure 5) are a great way to find the Kth layer.

Weaknesses
----------
1. The defenses are not able to prevent non-targeted attacks, often with less classification accuracy than these attacks.

2. The information presented in the practical attacks in Section 5.2 could have been presented better.

3. Not much information is given as to the computation tradeoff using DSSIM versus $L_2$ norm. It would be useful to know if optimizing for DSSIM for a given budget requires similar resources as the equivalent amount of $L_2$ norm.

Opportunities for future research
---------------------------------
Natural extensions of this project would be to defend against transfer learning attacks. One different idea would be to include multiple teachers, and include a dense layer that combines the output of N-K layer from all teachers into the student model. The input to a student model would be sent to all the teacher models (N-K layers), then aggregated into the dense layer before prediction. An attack that assumes one teacher model should theoretically not work on this.


Review #19K
===========================================================================
* Reviewer: Andre Kassis <akassis@uwaterloo.ca>
* Updated: 20 Oct 2019 11:08:46pm EDT

Paper summary
-------------
The paper studies deep learning models learned via transfer learning and their vulnerabilities to adversarial examples. The paper argues that the centralization of model training that allows for training models from publicly known teachers result in vulnerabilities to misclassification attacks that leverage the public knowledge of the teacher.

First, the authors explain the transfer learning approach where a student model is learned from a teacher model by freezing the weights of the first K layers in the pre-learned teacher and learning the weights of the rest of the layers based on the student’s task. Afterward, they point out that this learning scheme results in highly vulnerable students as they share the first K layers with publicly known models. Hence, an attacker can craft an adversarial example that resembles an input belonging to a different class in its Kth layer representation, which will, therefore, be likely to be misclassified by the student. It is highly essential to note that no queries to the student are required, which makes the attack highly effective.

In the following sections, the authors validate their hypothesis experimentally on models that solve tasks from the image recognition domain. The attack achieves impressive success rates, especially against students that only retrain the last layer. However, models that share fewer layers with the teacher are less vulnerable, and models that fine-tune the weights learned by the teacher in all layers are shown to be resistant to the attack.

Unfortunately, one major limitation is that the attack requires knowing the teacher model used to train the student. The authors rely on the fact that currently, there is only a small number of publicly available teacher models and leverage this fact to introduce a fingerprinting technique that enables the attacker to identify the teacher of the target model efficiently.

In the following sections, the paper suggests possible defenses and points out their strengths and weaknesses, including a novel defense named “injecting neuron distances” that, given a layer K, attempts to make the student’s internal representation deviate from that of the teacher for all inputs which provides significant benefits compared to other schemes.

Finally, the authors summarize their findings and conclude that further research is required to fully grasp the potential of the attacks they introduced and the possible defenses against them.

Problems in Understanding
-------------------------
The paper is very well-written and easy to understand.

Strengths
---------
 - A large number of experiments – To demonstrate the practicality of their attacks and prove their hypothesis, the authors place a large number of successful attacks that target models trained on various datasets from different domains and that were trained on different (and commonly used) platforms for several tasks.

 - Strong motivations and intelligent attack strategies – The authors make a profoundly important observation concerning models trained via transfer learning and explain how this training method might cause all the derived models to inherit vulnerabilities from the teacher model and demonstrate how this phenomenon can be exploited to devise highly successful attacks against the student models with minimal access to these targets and without the targets being aware of the attacks.

 - DSSIM as a domain-specific distance metric – The paper introduce DSSIM as a distance metric for limiting the perturbation budget when crafting adversarial examples. As opposed to numerous other works in the field, where the perturbation is defined by the L2 norm, a method that does not always guarantee that the resulting images indeed resemble the initial inputs, the paper uses a domain-specific method and the authors even manually examine the resulting images to ensure their attack indeed generates high-quality adversarial examples that meet the requirements. This ensures the generated samples are indeed adversarial and that the attacks are successful.

Weaknesses
----------
 - Relies on minimizing the distance to specific samples – The attack introduced in the paper crafts an adversarial example from input x given input y from a different class by reducing the gap between the internal representations at a given layer K while limiting the perturbation budget. This attack, however, may not always be practical as it highly relies on the availability of a sample from a target class that would allow such an attack to succeed. More specifically, the paper fails to study the capabilities of such attacks in other domains, like malware detection where stricter conditions must be met. If the attacker wants to craft an adversarial example from a malware he would want the model to classify it as benign, there is a need for a target sample that resembles the malicious input enough to enable crafting the example while preserving the malicious nature of the input sample.

 - The attack is weak against specific transfer learning approaches – For instance, the results show that while the attack might be highly effective against students trained with the deep-layer feature extractor approach, it exhibits lower success rates against models trained with mid-layer feature extractor approach and is almost ineffective against the full model fine-tuning technique. The paper does not suggest more sophisticated attacks that perform well against these training approaches.

 - Fingerprinting teacher models has many limitations – First, the method assumes that the students are trained with the deep layer feature extractor approach only. Second, it is assumed there is only a limited number of teacher models to consider, which is unlikely to be the case soon. Lastly, the process itself requires knowing the output probabilities of the student models, which is not always possible since many models only output the final labels. All these limitations render the attack as a whole impractical.

Opportunities for future research
---------------------------------
 - Explore different domains – As pointed out in the weaknesses section, it is unclear whether the attack can generalize to other fields with stricter constraints. It is, therefore, vital to explore different domains and test the attack performance on models for other classification tasks.

 - How to attack different transfer learning approaches – The attack introduced in the paper fails against the mid-layer feature extraction and full model fine-tuning approaches. It is crucial to examine these approaches and understand how they could be attacked and what defenses exist. One possible attack would be to train shadow models of the student models that are leaned via transfer learning using the same teachers and to craft adversarial examples against them. The student and the shadow share a lot of parameters in this case and have very similar decision boundaries, and therefore such an attack is likely to be successful.


Review #19L
===========================================================================
* Reviewer: Karthik Ramesh <k6ramesh@uwaterloo.ca>
* Updated: 20 Oct 2019 11:21:07pm EDT

Paper summary
-------------
This paper hypothesizes that relying on transfer learning for neural networks to achieve faster and better accuracy for training models tends to make the newly trained models vulnerable to misclassification attacks. This can be considered as an attack that extends upon the previous papers that we have read on adversarial attacks. Their attack relies on the fact that when some the layers of the “teacher” model are frozen to transfer the knowledge to the “student”, then if we can achieve a misclassification with regards to those frozen layers of the “teacher”, then this would also transfer onto the “student” model. They then proceed to evaluate their hypothesis on the visual domain using a variety of public datasets and models while having white-box access to the “teacher” and black-box access to the “student” model. They even try to explore defences against this method of attack.

Strengths
---------
1.	They finally explore a distance metric other than L_{p} norm
2.	Fingerprinting without modification of the “teacher” model
3.	They experiment with real-world ML services and show the success of their attacks as well as test their methodology against a variety of datasets and architectures
4.	The attack approach is novel in the machine learning space.

Weaknesses
----------
1.	They don’t explore with the DSSIM metric as a replacement for D(.) as well in equation 1.
2.	No mention of how adversarial retraining on the student model might affect this type of attack. For example, adversarial retraining against the teacher models’ dataset might give robustness over a sufficient space of the input when the tasks or similar. Also, were the initial models adversarially retrained?
3.	Not much details given into the workings of the non-targeted attack. Like for an example, of the 2000 iterations does the source image move towards all 5 of the target images at some point or just towards fewer than 5.

Opportunities for future research
---------------------------------
1.	We should explore whether the non-targeted adversarial attacks can really be considered attacks or that the model is not good enough or if it’s an inherent deficiency of machine learning.
2.	How would adversarial retraining using DSSIM impact classification accuracy? How about adversarial retraining using images obtained from both DSSIM and FGSM methods?


Review #19M
===========================================================================
* Reviewer: Matthew David Rafuse <mrafuse@uwaterloo.ca>
* Updated: 20 Oct 2019 11:21:08pm EDT

Paper summary
-------------
In this paper, the authors develop a black box attack on student models trained in a transfer learning scenario, exploiting general knowledge of the teacher to generated misclassification atttacks. The paper develops a method of matching student models to their teachers using the gini coefficient, and then developing effective misclassification attacks (both targeted and non-targeted). The authors then evaluate and atest a number of proposed defenses, discovering one that works relatively well, by obscuring the link between the student model and the teacher.

Strengths
---------
1. I was *really* impressed with their key insight - that if the inputs of a model at layer $k$ is the same for both target and adversarial examples, any fine-tuning will not change the classification. It's something I never would have realized myself but it completely obvious in hindsight. It turns out this is fairly difficult in practice (as is shown by their mixed success with Mid Layer Feature Extractor examples).

2. The use of DSSIM as a similarity metric instead of the $L_2$, $L_{\infty}$ or other such norm is much better. Many early complaints of adversarial example research was the fact the these norms don't really generate images that are indistinguishable from originals - the DSSIM variants were much more convincing. Figure 11 in particular did a good job of illustrating this.

3. The fact the vulnerability was disclosed to ML providers before publishing is admirable, and I haven't seen it in any of the developed papers yet, despite many papers talking about MLaaS providers. It's possible they do that and don't mention it though, I suppose.

4. This paper was so easy to read. It was really well written, well structured, A+ all around. A real joy to review.

Weaknesses
----------
1. I'm disappointed they decided to punt on ensemble defenses. This seems like it would completely and utterly nullify the attack - simple majority vote by three student models trained on varying teacher models and the issue is gone.

2. The Neuron Distance attack is computationally expensive and isn't perfectly effective. One of the main advantages of transfer learning is saving computation time - we are sacrificing this for better robustness, but it's still fairly effective to attack - a 30% success rate isn't anything to sneeze at. The main draw comes from the obfuscation of the teacher model, but security through obscurity is poor practice and a determined attacker can just try all major teacher models. Neuron Distance is also realtively ineffective against non-targeted miclassification attacks.

3.  Limited to the image recognition domain.

Opportunities for future research
---------------------------------
Ensemble defenses! This seems like a prime candidate for a defense, and ensemble stuff is pretty cool in my book.


Review #19N
===========================================================================
* Reviewer: Tosca Lechner <tlechner@uwaterloo.ca>
* Updated: 20 Oct 2019 11:44:13pm EDT

Paper summary
-------------
In the paper the authors discuss vulnerability to targetted and non-targetted evasion attacks for transfer learning. They assume a student-teacher model in which they have white-box access to the teacher. They evaluate different transfer learning strategies (e.g., retraining last layer, retraining last k layers, finetuning with no fixed layers) on different datasets (dependent on which strategy makes sense for which pairs of datasets. The propose a new attack model in which the attacker generated a sample by perturbing a source sample with fixed rules, in order for the change to imperceptable to a human, such that it closely resembles the representation in the n-k-th layer of a target instance. They show that this attack works well, if the tasks are related closely and the training strategy matches the attack. They also show that their attacks work well in practice when the teacher model is not known, but known to come from a number of candidates. They could then identify the correct candidate and use their white-box attack. They propose several defenses: adding dropout and adding an additional constraint of teacher and student having dissimilar representations in each layer in their training. They show mediocre results for their defenses.

Strengths
---------
They evaluated different types of transfer learning for each of which they provided example datasets to evaluate. They showed that the more related the student and teacher tasks are, the better the attacks work. This is an insight they would not have gotten when only evaluating on one particular strategy.

Well written.  I liked how they presented their ideas and found it easy to follow.

They tested against transferable adverserial examples and showed that their attack works without adverserial examples being transferable. This is an interesting fact that makes their result more interesting/relevant.

They provided defenses against their attack

They showed that their attack works against practical teacher-models from providers like Google and Microsoft.

Weaknesses
----------
Only restricted to the image domain. In particular der pertubation metric is chosen for the image domain. It is therefore unclear how well this would generalize to other domains.

White-box access to teacher model assumed. This however is well-argued in the text as something that would work in most application. They also make the case that model-stealing attacks work well with sufficient black-box queries in order to gain such white-box access.

Their defenses caused a considerable drop in accuracy. Dependent on use-case this might be a problem

Their attacks did not work well for all datasets. It depended on the training strategy used.

They did not design new attacks for their proposed defenses, making the defense-part of the paper weaker

Presentation: It would have been nice to see results sumarized in a table

Opportunities for future research
---------------------------------
Find new attacks agains their proposed defenses

Come up with better defenses, that do not decrease as much on accuracy


Review #19O
===========================================================================
* Reviewer: Nivasini Ananthakrishnan <nanantha@uwaterloo.ca>
* Updated: 20 Oct 2019 11:56:05pm EDT

Paper summary
-------------
The paper describes how to use the fact that a model is built using transfer learning to craft adversarial examples. The paper leverages the transparency of the teacher model to launch attacks on student models to which we only have black-box access. The attack is described assuming knowledge of the teacher model and of the number of layers frozen. Later this assumption is relaxed. The paper discusses how to learn the number of frozen layers and which teacher model is used, given a small pool of teacher models.

Strengths
---------
The attack model is very realistic. This is evident from the evaluations performed on actual popular teaching models for transfer learning. There is discussion about how to relax some of the assumptions in the attack model. The paper also provides defenses for its own attacks.

Weaknesses
----------
Many of the methods described in the paper are not as successful when transfer learning is done by freezing fewer layers. Success rates of adversarial examples for Mid-layer Feature Extractor and Full Model Fine-tuning are low. The fingerprinting method is also only successful for Deep-layer feature extractor.

Opportunities for future research
---------------------------------
We could combine the approach introduced in this paper with the approach of black-box attacks using substitute models. The substitute model could be built by constraining the first K layers to be the same as the teacher model. This restricts the class of substitute models. So the substitute model could be trained with fewer queries.


Review #19P
===========================================================================
* Reviewer: Lindsey Tulloch <ltulloch@uwaterloo.ca>
* Updated: 20 Oct 2019 11:56:26pm EDT

Paper summary
-------------
A few of the large tech companies (Google, Microsoft, Facebook etc.) have begun sharing highly tuned and complex centralized models with the general public which can be further customized for business or individual utility. From a security standpoint, this type of centralized service offering with few models and required transfer learning has some inherent vulnerability which the authors of this paper investigate. The authors find that having white box teacher models increases the vulnerability and generalizability (to student models) of misclassification attacks. They show that a fingerprinting scheme can be performed to confirm which teacher model was used to train a blackbox student models and they offer some mechanisms for defence.

Problems in Understanding
-------------------------
Curious about why the Iris recognition is highly sensitive to input noise. In section 4.3 the authors say that "A detailed anaysis shows that". . .the 100% non-targeted attack success rate. . ."is because the Iris recognition is highly sensitive to input noise" but that is where the detailed analysis ends. I suspect that because it is 50 layers it is able to process and consider a higher number of pixels/features from an image--which seems consistent with other experiments on the Iris dataset but I'd like a bit more clarity on this.

Strengths
---------
The authors show attacks that are highly relevant and damaging to both large companies offering centralized models for widespread use, and individuals and companies that utilize them.  They claim to have disclosed their findings before publishing them which was responsible.

Concepts and background knowledge needed to understand that paper was well explained, organization was logical and it was fairly easy to understand.

Examples and graphs were relevant and aided in overall understanding of the paper.

Weaknesses
----------
The DSSIM was an interesting approach but there wasn't much discussion as to how it compares to the $L_{2}$ norm outside of whether or not changes are detectable. Presumably if the querying process was automated (after training of course) and was never looked at by humans, visible changes to the images wouldn't be that important. The argument could be made that ensuring the image is not visibly different is inexpensive and produces a more confounding attack--so it's obviously all around better, but I didn't see this argument being made and would have liked a bit more analysis.

I would have liked to see some figerprint example images or a visualization as it was a bit difficult to envision how the process worked.

The related work section should have appeared earlier as it did not add much that was not already discussed.