Angelos Katharopoulos

Publications

Self Supervision Does Not Help Natural Language Supervision at Scale

F. Weers V. ShankarA. KatharopoulosY. Yang T. Gunter

CVPR, 2023

Self supervision and natural language supervision have emerged as two exciting ways to train general purpose image encoders which excel at a variety of downstream tasks. Recent works such as M3AE [31] and SLIP [64] have suggested that these approaches can be effectively combined, but most notably their results use small (<20M examples) pre-training datasets and don’t effectively reflect the large-scale regime (>100M samples) that is commonly used for these approaches. Here we investigate whether a similar approach can be effective when trained with a much larger amount of data. We find that a combination of two state of the art approaches: masked auto-encoders, MAE [38] and contrastive language image pre-training, CLIP [69] provides a benefit over CLIP when trained on a corpus of 11.3M image-text pairs, but little to no benefit (as evaluated on a suite of common vision tasks) over CLIP when trained on a large corpus of 1.4B images. Our work provides some much needed clarity into the effectiveness (or lack thereof) of self supervision for large-scale image-text training.

@inproceedings{weers2023mae_clip,
    title={Self Supervision Does Not Help Natural Language Supervision at Scale},
    author={Weers, F. and Shankar, V. and Katharopoulos, A. and Yang, Y. and Gunter, T.},
    booktitle={{Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)}},
    year={2023},
    url={https://arxiv.org/pdf/2301.07836}
}

Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks

D. PaschalidouA. KatharopoulosA. Geiger S. Fidler

CVPR, 2021

Abstract Paper Explore Video Poster Code Bibtex

Impressive progress in 3D shape extraction led to representations that can capture object geometries with high fidelity. In parallel, primitive-based methods seek to represent objects as semantically consistent part arrangements. However, due to the simplicity of existing primitive representations, these methods fail to accurately reconstruct 3D shapes using a small number of primitives/parts. We address the trade-off between reconstruction quality and number of parts with Neural Parts, a novel 3D primitive representation that defines primitives using an Invertible Neural Network (INN) which implements homeomorphic mappings between a sphere and the target object. The INN allows us to compute the inverse mapping of the homeomorphism, which in turn, enables the efficient computation of both the implicit surface function of a primitive and its mesh, without any additional post-processing. Our model learns to parse 3D objects into semantically consistent part arrangements without any part-level supervision. Evaluations on ShapeNet, D-FAUST and FreiHAND demonstrate that our primitives can capture complex geometries and thus simultaneously achieve geometrically accurate as well as interpretable reconstructions using an order of magnitude fewer primitives than state-of-the-art shape abstraction methods.

@inproceedings{paschalidou2021nps,
    title={Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks},
    author={Paschalidou, D. and Katharopoulos, A. and Geiger, A. and Fidler, S.},
    booktitle={{Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)}},
    year={2021},
    url={https://arxiv.org/pdf/2103.10429}
}

Fast Transformers with Clustered Attention

A. VyasA. KatharopoulosF. Fleuret

NeurIPS, 2020

Abstract Paper Code Bibtex

Transformers have been proven a successful model for a variety of tasks in sequence modeling. However, computing the attention matrix, which is their key component, has quadratic complexity with respect to the sequence length, thus making them prohibitively expensive for large sequences. To address this, we propose clustered attention, which instead of computing the attention for every query, groups queries into clusters and computes attention just for the centroids. To further improve this approximation, we use the computed clusters to identify the keys with the highest attention per query and compute the exact key/query dot products. This results in a model with linear complexity with respect to the sequence length for a fixed number of clusters. We evaluate our approach on two automatic speech recognition datasets and show that our model consistently outperforms vanilla transformers for a given computational budget. Finally, we demonstrate that our model can approximate arbitrarily complex attention distributions with a minimal number of clusters by approximating a pretrained BERT model on GLUE and SQuAD benchmarks with only 25 clusters and no loss in performance.

@article{vyas_et_al_2020,
    author={Vyas, A. and Katharopoulos, A. and Fleuret, F.},
    title={Fast Transformers with Clustered Attention},
    booktitle={Proceedings of the international conference on Neural Information Processing Systems (NeurIPS)},
    year={2020}
}

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention

A. KatharopoulosA. Vyas N. Pappas F. Fleuret

ICML, 2020

Abstract Paper Explore Video Slides Code Bibtex

Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, with respect to the input’s length, they are prohibitively slow for very long sequences. To address this limitation, we express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from O(N²) to O(N), where N is the sequence length. We show that this formulation permits an iterative implementation that dramatically accelerates autoregressive transformers and reveals their relationship to recurrent neural networks. Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequences.

@inproceedings{katharopoulos2020lin,
    title={Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention},
    author={Katharopoulos, A. and Vyas, A. and Pappas, N. and Fleuret, F.},
    booktitle={Proceedings of the International Conference on Machine Learning (ICML)},
    year={2020},
    url={https://arxiv.org/pdf/2006.16236.pdf}
}

Processing Megapixel Images with Deep Attention-Sampling Models

A. KatharopoulosF. Fleuret

ICML, 2019

Abstract Paper Poster Slides Code Bibtex

Existing deep architectures cannot operate on very large signals such as megapixel images due to computational and memory constraints. To tackle this limitation, we propose a fully differentiable end-to-end trainable model that samples and processes only a fraction of the full resolution input image. The locations to process are sampled from an attention distribution computed from a low resolution view of the input. We refer to our method as attention sampling and it can process images of several megapixels with a standard single GPU setup. We show that sampling from the attention distribution results in an unbiased estimator of the full model with minimal variance, and we derive an unbiased estimator of the gradient that we use to train our model end-to-end with a normal SGD procedure. This new method is evaluated on three classification tasks, where we show that it allows to reduce computation and memory footprint by an order of magnitude for the same accuracy as classical architectures. We also show the consistency of the sampling that indeed focuses on informative parts of the input images.

@inproceedings{katharopoulos2019ats,
    title={Processing Megapixel Images with Deep Attention-Sampling Models},
    author={Katharopoulos, A. and Fleuret, F.},
    booktitle={Proceedings of the International Conference on Machine Learning (ICML)},
    year={2019},
    type={Short oral},
    url={https://arxiv.org/pdf/1905.03711.pdf}
}

Not All Samples Are Created Equal: Deep Learning with Importance Sampling

A. KatharopoulosF. Fleuret

ICML, 2018

Abstract Paper Poster Slides Code Bibtex

Deep neural network training spends most of the computation on examples that are properly handled, and could be ignored. We propose to mitigate this phenomenon with a principled importance sampling scheme that focuses computation on “informative” examples, and reduces the variance of the stochastic gradients during training. Our contribution is twofold: first, we derive a tractable upper bound to the per-sample gradient norm, and second we derive an estimator of the variance reduction achieved with importance sampling, which enables us to switch it on when it will result in an actual speedup. The resulting scheme can be used by changing a few lines of code in a standard SGD procedure, and we demonstrate experimentally, on image classification, CNN fine-tuning, and RNN training, that for a fixed wall-clock time budget, it provides a reduction of the train losses of up to an order of magnitude and a relative improvement of test errors between 5% and 17%.

@inproceedings{katharopoulos2018is,
    title={Not All Samples Are Created Equal: Deep Learning with Importance Sampling},
    author={Katharopoulos, A. and Fleuret, F.},
    booktitle={Proceedings of the International Conference on Machine Learning (ICML)},
    year={2018},
    type={Short oral},
    url={https://arxiv.org/pdf/1803.00942.pdf}
}

Learning local feature aggregation functions with backpropagation

D. PaschalidouA. KatharopoulosC. Diou A. Delopoulos

EUSIPCO, 2017

Abstract Paper Poster Code Bibtex

This paper introduces a family of local feature aggregation functions and a novel method to estimate their parameters, such that they generate optimal representations for classification (or any task that can be expressed as a cost function minimization problem). To achieve that, we compose the local feature aggregation function with the classifier cost function and we backpropagate the gradient of this cost function in order to update the local feature aggregation function parameters. Experiments on synthetic datasets indicate that our method discovers parameters that model the class-relevant information in addition to the local feature space. Further experiments on a variety of motion and visual descriptors, both on image and video datasets, show that our method outperforms other state-of-the-art local feature aggregation functions, such as Bag of Words, Fisher Vectors and VLAD, by a large margin.

@inproceedings{katharopoulos2017learning
      title = {Learning local feature aggregation functions with backpropagation},
      author = {Paschalidou, Despoina and Katharopoulos, Angelos and Diou, Christos and Delopoulos, Anastasios},
      publisher = {IEEE},
      month = aug,
      year = {2017},
      url = {http://ieeexplore.ieee.org/Abstract/document/8081307/},
}

Fast Supervised LDA for Discovering Micro-Events in Large-Scale Video Datasets

A. KatharopoulosD. Paschalidou C. Diou A. Delopoulos

ACMM, 2016

Abstract Paper Poster Code Blog Bibtex

This paper introduces fsLDA, a fast variational inference method for supervised LDA, which overcomes the computational limitations of the original supervised LDA and enables its application in large-scale video datasets. In addition to its scalability, our method also overcomes the drawbacks of standard, unsupervised LDA for video, including its focus on dominant but often irrelevant video information (e.g. background, camera motion). As a result, experiments in the UCF11 and UCF101 datasets show that our method consistently outperforms unsupervised LDA in every metric. Furthermore, analysis shows that class-relevant topics of fsLDA lead to sparse video representations and encapsulate high-level information corresponding to parts of video events, which we denote 'micro-events'

@inproceedings{katharopoulos2016fast
        title = {Fast Supervised LDA for Discovering Micro-Events in Large-Scale Video Datasets},
        author = {Katharopoulos, Angelos and Paschalidou, Despoina and Diou, Christos and Delopoulos, Anastasios},
        booktitle = {Proceedings of the 2016 ACM on Multimedia Conference},
        pages = {332,336},
        month = oct,
        year = {2016},
        url = {http://dl.acm.org/citation.cfm?id=2967237},
        month_numeric = {10}
}

Angelos Katharopoulos

Publications

Software