* below denotes equal contribution
Fundamental
Fast and Accurate Factual Inconsistency Detection Over Long Documents
Barrett M. Latiimer, Patrick H. Chen , Xinyuan Zhang, and Yi Yang
In Conference on Empirical Methods in Natural Language Processing
(EMNLP),
2023
Generative AI models exhibit remarkable potential; however, hallucinations across various tasks present a significant challenge, particularly for longer inputs that current approaches struggle to address effectively. We introduce SCALE (Source Chunking Approach for Large-scale inconsistency Evaluation), a task-agnostic model for detecting factual inconsistencies using a novel chunking strategy. Specifically, SCALE is a Natural language inference (NLI) based model that uses large text chunks to condition over long texts. This approach achieves state-of-the-art performance in factual inconsistency detection for diverse tasks and long inputs. Additionally, we leverage the chunking mechanism and employ a novel algorithm to explain SCALE’s decisions through relevant source sentence retrieval. Our evaluations reveal that SCALE outperforms existing methods on both standard benchmarks and a new long-form dialogue dataset ScreenEval we constructed. Moreover, SCALE surpasses competitive systems in efficiency and model explanation evaluations. We have released our code and data publicly to GitHub1 .
Sign-OPT: A Query-Efficient Hard-label Adversarial Attack
Minhao Cheng*, Simranjit Singh*, Patrick H. Chen , Pin-Yu Chen, Sijia Liu and Cho-Jui Hsieh
In International Conference on Learning Representations
(ICLR),
2020
We study the most practical problem setup for evaluating adversarial robustness
of a machine learning system with limited access: the hard-label black-box attack setting for generating adversarial examples, where limited model queries
are allowed and only the decision is provided to a queried data input. Several
algorithms have been proposed for this problem but they typically require huge
amount (>20,000) of queries for attacking one example. Among them, one of
the state-of-the-art approaches (Cheng et al., 2019) showed that hard-label attack
can be modeled as an optimization problem where the objective function can be
evaluated by binary search with additional model queries, thereby a zeroth order optimization algorithm can be applied. In this paper, we adopt the same optimization
formulation but propose to directly estimate the sign of gradient at any direction
instead of the gradient itself, which enjoys the benefit of single query. Using
this single query oracle for retrieving sign of directional derivative, we develop
a novel query-efficient Sign-OPT approach for hard-label black-box attack. We
provide a convergence analysis of the new algorithm and conduct experiments
on several models on MNIST, CIFAR-10 and ImageNet. We find that Sign-OPT
attack consistently requires 5× to 10× fewer queries when compared to the current
state-of-the-art approaches, and usually converges to an adversarial example with
smaller perturbation.
Uncertainty
Overcoming Catastrophic Forgetting by Bayesian Generative Regularization
Patrick H. Chen,
Wei Wei,
Cho-jui, Hsieh,
and Bo Dai
In Conference of Machine Learning
(ICML),
2021
In this paper, we propose a new method to over-come catastrophic forgetting by adding generative regularization to Bayesian inference frame-work. Bayesian method provides a general frame-work for continual learning. We could further construct a generative regularization term for all given classification models by leveraging energy-based models and Langevin dynamic sampling to enrich the features learned in each task. By combining discriminative and generative loss together, we empirically show that the proposed method outperforms state-of-the-art methods on a variety of tasks, avoiding catastrophic forgetting in continual learning. In particular, the proposed method outperforms baseline methods over 15%on the Fashion-MNIST dataset and 10%on the CUB dataset.
Efficiency
FINGER: Fast Inference for Graph-based Approximate Nearest Neighbor Search
Patrick H. Chen,
Wei-cheng Chang,
Jyun-yu Jiang,
Hsiang-fu Yu,
and Cho-jui, Hsieh
To Appear In International World Wide Web Conference
(WWW),
2023
Approximate K-Nearest Neighbor Search (AKNNS) has now become
ubiquitous in modern applications, such as a fast search procedure
with two-tower deep learning models. Graph-based methods for
AKNNS in particular have received great attention due to their
superior performance. These methods rely on greedy graph search
to traverse the data points as embedding vectors in a database. Under this greedy search scheme, we make a key observation: many
distance computations do not influence search updates so that these
computations can be approximated without hurting performance.
As a result, we propose FINGER, a fast inference method for efficient graph search in AKNNS. FINGER approximates the distance
function by estimating angles between neighboring residual vectors.
The approximated distance can be used to bypass unnecessary computations for faster searches. Empirically, when it comes to speeding
up the inference of HNSW, which is one of the most popular graph-based AKNNS methods, FINGER significantly outperforms existing
acceleration approaches and conventional libraries by 20% to 60% across different benchmark datasets.
ELIAS: End-to-end Learning to Index and Search in Large Output Spaces
Nilesh Gupta,
Patrick H. Chen,
Hsiang-fu Yu,
Cho-jui, Hsieh,
and Inderjit S. Dhillon
In Advances in Neural Information Processing Systems
(NeurIPS),
2022
Extreme multi-label classification (XMC) is a popular framework for solving many real-world problems that require accurate prediction from a very large number of potential output choices. A popular approach for dealing with the large label space is to arrange the labels into a shallow tree-based index and then learn an ML model to efficiently search this index via beam search. Existing methods initialize the tree index by clustering the label space into a few mutually exclusive clusters based on pre-defined features and keep it fixed throughout the training procedure. This approach results in a sub-optimal indexing structure over the label space and limits the search performance to the quality of choices made during the initialization of the index. In this paper, we propose a novel method ELIAS which relaxes the tree-based index to a specialized weighted graph-based index which is learned end-to-end with the final task objective. More specifically, ELIAS models the discrete cluster-to-label assignments in the existing tree-based index as soft learnable parameters that are learned jointly with the rest of the ML model. ELIAS achieves state-of-the-art performance on several large-scale extreme classification benchmarks with millions of labels. In particular, ELIAS can be up to 2.5% better at precision@1 and up to 4% better at recall@100 than existing XMC methods. A PyTorch implementation of ELIAS along with other resources is available at https://github.com/nilesh2797/ELIAS.
DRONE: Data-aware Low-rank Compression for Large NLP Models
Patrick H. Chen,
Hsiang-fu Yu,
Inderjit S. Dhillon,
and Cho-jui, Hsieh
In Advances in Neural Information Processing Systems
(NeurIPS),
2021
The representations learned by large-scale NLP models such as BERT have been
widely used in various tasks. However, the increasing model size of the pre-trained
models also brings efficiency challenges, including inference speed and model
size when deploying models on mobile devices. Specifically, most operations
in BERT consist of matrix multiplications. These matrices are not low-rank and
thus canonical matrix decompositions do not lead to efficient approximations. In
this paper, we observe that the learned representation of each layer lies in a low-dimensional space. Based on this observation, we propose DRONE (data-aware
low-rank compression), a provably optimal low-rank decomposition of weight
matrices, which has a simple closed form solution that can be efficiently computed.
DRONE can be applied to both fully-connected and self-attention layers appearing
in the BERT model. In addition to compressing standard models, our method
can also be used on distilled BERT models to further improve the compression
rate. Experimental results show that DRONE is able to improve both model size
and inference speed with limited loss in accuracy. Specifically, DRONE alone
achieves 1.92x speedup on the MRPC task with only 1.5% loss in accuracy, and
when DRONE is combined with distillation, it further achieves over 12.3x speedup
on various natural language inference tasks.
Clustering and Constructing User Coresets to Accelerate Large-scale Top-𝐾 Recommender Systems
Jyun-yu Jiang*,
Patrick H. Chen*,
Cho-jui, Hsieh,
and Wei Wang
In International World Wide Web Conference
(WWW),
2020
Top-K recommender systems aim to generate few but satisfactory
personalized recommendations for various practical applications,
such as item recommendation for e-commerce and link prediction
for social networks. However, the numbers of users and items can
be enormous, thereby leading to myriad potential recommendations
as well as the bottleneck in evaluating and ranking all possibilities.
Existing Maximum Inner Product Search (MIPS) based methods
treat the item ranking problem for each user independently and
the relationship between users has not been explored. In this paper,
we propose a novel model for clustering and navigating for top-K
recommenders (CANTOR) to expedite the computation of top-K
recommendations based on latent factor models. A clustering-based
framework is first presented to leverage user relationships to partition users into affinity groups, each of which contains users with
similar preferences. CANTOR then derives a coreset of representative vectors for each affinity group by constructing a set cover with
a theoretically guaranteed difference to user latent vectors. Using
these representative vectors in the coreset, approximate nearest
neighbor search is then applied to obtain a small set of candidate
items for each affinity group to be used when computing recommendations for each user in the affinity group. This approach can
significantly reduce the computation without compromising the
quality of the recommendations. Extensive experiments are conducted on six publicly available large-scale real-world datasets for
item recommendation and personalized link prediction. The experimental results demonstrate that CANTOR significantly speeds up
matrix factorization models with high precision. For instance, CANTOR can achieve 355.1x speedup for inferring recommendations
in a million-user network with 99.5% precision@1 to the original
system while the state-of-the-art method can only obtain 93.7x
speedup with 99.0% precision@1.
MulCode: A Multiplicative Multi-way Model for Compressing Neural Language Model
Yukun, Ma*
Patrick H. Chen*,
and Cho-jui, Hsieh
In Conference on Empirical Methods in Natural Language Processing
(EMNLP),
2019
It is challenging to deploy deep neural nets on
memory-constrained devices due to the explosion of numbers of parameters. Especially, the
input embedding layer and Softmax layer usually dominate the memory usage in an RNNbased language model. For example, input
embedding and Softmax matrices in IWSLT2014 German-to-English data set account for
more than 80% of the total model parameters. To compress these embedding layers, we
propose MulCode, a novel multi-way multiplicative neural compressor. MulCode learns
an adaptively created matrix and its multiplicative compositions. Together with a prior
weighted loss, MulCode is more effective than
the state-of-the-art compression methods. On
the IWSLT-2014 machine translation data set,
MulCode achieved 17 times compression rate
for the embedding and Softmax matrices, and
when combined with quantization technique,
our method can achieve 41.38 times compression rate with very little loss in performance.
Efficient Contextual Representation Learning With Continuous Outputs
Liunian Harold Li,
Patrick H. Chen,
Cho-jui, Hsieh,
and Kai-wei Chang
In Transactions of the Association for Computational Linguistics
(TACL),
2019
Contextual representation models have
achieved great success in improving various downstream tasks. However, these
language-model-based encoders are difficult to train due to the large parameter sizes
and high computational complexity. By
carefully examining the training procedure,
we find that the softmax layer (the output
layer) causes significant inefficiency due
to the large vocabulary size. Therefore,
we redesign the learning objective and
propose an efficient framework for training
contextual representation models. Specifically, the proposed approach bypasses
the softmax layer by performing language
modeling with dimension reduction, and
allows the models to leverage pre-trained
word embeddings. Our framework reduces
the time spent on the output layer to a
negligible level, eliminates almost all the
trainable parameters of the softmax layer
and performs language modeling without
truncating the vocabulary. When applied
to ELMo, our method achieves a 4 times
speedup and eliminates 80% trainable
parameters while achieving competitive
performance on downstream tasks.
Learning to Screen for Fast Softmax Inference on Large Vocabulary Neural Networks
Patrick H. Chen,
Si Si,
Sanjiv Kumar,
Yang Li,
and Cho-jui, Hsieh,
In International Conference on Learning Representations
(ICLR),
2019
Neural language models have been widely used in various NLP tasks, including
machine translation, next word prediction and conversational agents. However, it
is challenging to deploy these models on mobile devices due to their slow prediction speed, where the bottleneck is to compute top candidates in the softmax
layer. In this paper, we introduce a novel softmax layer approximation algorithm
by exploiting the clustering structure of context vectors. Our algorithm uses a
light-weight screening model to predict a much smaller set of candidate words
based on the given context, and then conducts an exact softmax only within that
subset. Training such a procedure end-to-end is challenging as traditional clustering methods are discrete and non-differentiable, and thus unable to be used with
back-propagation in the training process. Using the Gumbel softmax, we are able
to train the screening model end-to-end on the training set to exploit data distribution. The algorithm achieves an order of magnitude faster inference than the
original softmax layer for predicting top-k words in various tasks such as beam
search in machine translation or next words prediction. For example, for machine translation task on German to English dataset with around 25K vocabulary,
we can achieve 20.4 times speed up with 98.9% precision@1 and 99.3% precision@5 with the original softmax layer prediction, while state-of-the-art (Zhang
et al., 2018) only achieves 6.7x speedup with 98.7% precision@1 and 98.1% precision@5 for the same task.
GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking
Patrick H. Chen,
Si Si,
Yang Li,
Ciprian Chelba
and Cho-jui, Hsieh
In Advances in Neural Information Processing Systems
(NeurIPS),
2018
Model compression is essential for serving large deep neural nets on devices with
limited resources or applications that require real-time responses. As a case study, a
neural language model usually consists of one or more recurrent layers sandwiched
between an embedding layer used for representing input tokens and a softmax
layer for generating output tokens. For problems with a very large vocabulary
size, the embedding and the softmax matrices can account for more than half of
the model size. For instance, the bigLSTM model achieves great performance
on the One-Billion-Word (OBW) dataset with around 800k vocabulary, and its
word embedding and softmax matrices use more than 6GBytes space, and are
responsible for over 90% of the model parameters. In this paper, we propose
GroupReduce, a novel compression method for neural language models, based
on vocabulary-partition (block) based low-rank matrix approximation and the
inherent frequency distribution of tokens (the power-law distribution of words).
The experimental results show our method can significantly outperform traditional
compression methods such as low-rank approximation and pruning. On the OBW
dataset, our method achieved 6.6 times compression rate for the embedding and
softmax matrices, and when combined with quantization, our method can achieve
26 times compression rate, which translates to a factor of 12.8 times compression
for the entire model with very little degradation in perplexity.