Submitted by olmec-akeru t3_z6p4yv in MachineLearning

…and why?

So the science has moved on quite considerably since the linear methods of PCA and others; about 5±1 years back we had t-SNE and later on VAEs then UMAP. I appreciate that each of these methods is taking a subtly different (ok ok ok, sometimes its not that subtle) view of the problem, but I wonder what approaches are SOTA now?

Where to now?

150

Comments

You must log in or register to comment.

Deep-Station-1746 t1_iy2k52e wrote

It fully depends on assumptions. Assumptions about the data, and the model. Without assumptions, you can't do dim reduction.

So, what are you willing to assume? e.g. Assuming you have vast quantities of text data, arguably the current best dim reducers are generative transformers.

76

koiRitwikHai t1_iy48qvn wrote

can you tell us what are the assumptions for PCA, tsne, VAE and Umap?

9

SleekEagle t1_iy4k5m4 wrote

It's been a while since I looked at tsne and umap but the assumption for PCA is that the data lives near an affine subspace and for VAE that the data is well modeled by the distribution whose parameters you are finding. My thoughts but I'm sure there's other considerations that I'd love to hear other people chime in with!

6

manojs t1_iy4jrt6 wrote

If an expert on the topic can respond to this, that would be awesome.

2

olmec-akeru OP t1_iy2pewq wrote

Awesome answer: for a set of assumptions, what would you use?

I've seen some novel approaches on arXiv on categorical variables; but can't seem to shake the older deep-learning methods for continuous variables.

1

NonOptimized t1_iy2qybl wrote

>Awesome answer: for a set of assumptions, what would you use?
>
>I've seen some novel approaches on arXiv on categorical variables; but can't seem to shake the older deep-learning methods for continuous variables.

Could you link some of these novel approaches? I'm quite curious what they could be?

3

BrisklyBrusque t1_iy3s0ha wrote

Cool links. I’ll add “entity embeddings” into the mix. Entity embeddings reimagine a categorical variable as a continuous-valued vector and allow us to skip one-hot encoding.

6

olmec-akeru OP t1_iy7a1yc wrote

I fear that the location in the domain creates a false relationship to those closer on the same domain

i.e. if you encode at 0.1, 0.2, …, 0.9 you're saying that the category encoded to 0.2 is more similar to 0.1 and 0.3 than it is to 0.9. This may not be true.

1

BrisklyBrusque t1_iy8wfoa wrote

I freely admit I haven’t looked into the math. But my understanding was the embeddings are a learned representation. They are not arbitrary; instead they aim to put categories close to one another on a continuous scale only in those situations where it is justified.

1

new_name_who_dis_ t1_iy3jblq wrote

I’d say that PCA is the most useful method still. The fact that it’s very quick and is a linear transform makes it very easy to use and interpretable

59

cptsanderzz t1_iy4ni89 wrote

How does it make it more interpretable? I feel like wrapping 12 different features into 2 loses a lot of explain ability. I’m just learning about trade offs to dimensionality reduction, so this is a genuine question.

10

ZombieRickyB t1_iy4ox9u wrote

PCA doesn't just give a visual embedding, it gives an explicit coordinate system. (1,0) in a 2d dimensionality reduction example naturally corresponds to a particular unit vector in the initial space. If you know what your coordinates mean in that space, that gives guidance. Those unit vectors are generalized eigenvectors in the usual sense

A nice example of this: if you have a bunch of, say, black and white images of faces, vectorize them, perform PCA, take one of the eigenvectors, turn it back into an image, and display it, you get something resembling a face for at least the first few dimensions. By construction, these vectors are orthogonal, so there's some mechanism of variation that bears to be a little interpretable

24

new_name_who_dis_ t1_iy4ol7g wrote

It depends on what data you're working with and what you're trying to do. For example for me, I've worked a lot with 3d datasets of meshes of faces and bodies that are in correspondence. And I actually used autoencoders to compress them to the same dimensions as PCA and compared the two.

Basically with the network I'd get less error on reconstruction (especially at lower dimensions). However, the beauty of the PCA reduction was that one dimension was responsible for the size of the nose on the face, another was responsible for how wide or tall the head is, etc.

And you don't get such nice properties from a fancy VAE latent space. Well you can get a nice disentangled latent space but they don't happen for free usually, you often need to add even more complexity to get so nice and disentangled. With PCA, it's there by design.

7

olmec-akeru OP t1_iy7ai6s wrote

>beauty of the PCA reduction was that one dimension was responsible for the size of the nose

I don't think this always holds true. You're just lucky that your dataset contains confined variation such that the eigenvectors represent this variance to a visual feature. There is no mathematical property of PCA that makes your statement true.

There have been some attempts to formalise something like what you have described. The closest I've seen is the beta-VAE: https://lilianweng.github.io/posts/2018-08-12-vae/

2

new_name_who_dis_ t1_iy84a83 wrote

It’s not really luck. There is variation in sizes of noses (it’s one of the most varied features of the face) and so that variance is guaranteed to be represented in the eigenvectors.

And beta-VAEs are one of the possible things you can try to get a disentangled latent space yes, although they don’t really work that well in my experience.

1

olmec-akeru OP t1_iy8ajq0 wrote

> the beauty of the PCA reduction was that one dimension was responsible for the size of the nose

You posit that an eigenvector will represent the nose when there are meaningful variations of scale, rotation, and position?

This is very different to saying all variance will be explained across the full set of eigenvectors (which very much is true).

1

new_name_who_dis_ t1_iy8b0jr wrote

It was just an example. Sure not all sizes of nose are found along the same eigenvector.

1

vikigenius t1_iy4tb05 wrote

That's not PCA's fault. No matter what kind of technique you use, if you try to represent a 768d vector in a 2d plane by using dimensionality reduction, you will lose lots of explainabilility.

The real question is about what properties do you care to preserve when doing dimensionality reduction and validating that your technique tries to preserve as much of it as possible.

7

olmec-akeru OP t1_iy7a8fe wrote

So this may not be true: the surface of a Riemannian manifold is infinite, so you can encode infinite knowledge onto its surface. From there the diffeomorphic property allows one to traverse the surface and generate explainable, differentiable, vectors.

2

vikigenius t1_iy7kxuu wrote

Huh? Diffeomorphisms are dimensionality preserving. You can't have a diffeomorphism between Rn to R2 unless n=2. That's the only way your differential mapping is bijective

So I am not sure how the diffeomorphisms guarantee that you can have lossless dimensionality reduction.

What can happen is that if your data inherently lies on a lower dimensional manifold. For instance if you have A subset of Rn that has an inherent dimensionality of just 2 then you can trivially just represent it in 2 dimensions. For example if you have a 3d space where the 3rd dimension is an exact linear combination of the 1st and 2nd then it's inherent dimensionality is 2 and you can obviously losslessly reduce it to 2d.

But most definitely not all datasets have an inherent dimensionality of 2.

1

gooblywooblygoobly t1_iydf4z2 wrote

A super trivial example is that a (hyper)plane is a Riemannian manifold. Since we know that PCA is lossy, and PCA projects to a (hyper)plane, it can't be that projecting to manifolds are enough to perfectly preserve information.

1

Dylan_TMB t1_iy6ie5a wrote

Compared to non-linear methods like kernel PCA and T-SNE it is super interpretable. Also I don't know what dimensionality reduction technique doesn't squeeze features into less features, that's kind of the point no?

1

olmec-akeru OP t1_iy7amgl wrote

I'm not sure: if you think about t-SNE its trying to minimise some form of the Kullback–Leibler divergence. That means its trying to group similar observations into the embedding space. Thats quite different to "more features into less features".

1

Dylan_TMB t1_iy7brke wrote

I would disagree. t-SNE is taking points in a higher dimensional space and is attempting to find a transformation that places the points in a lower dimensional embedding space while preserving the similarities in the original dimensional space. In the end each point will have it's original vector (more features) mapped to a lower dimensional vector (less features). The mapping is non-linear but that is all that's the result of the operation.

1

olmec-akeru OP t1_iy7i546 wrote

Heya! Appreciate the discourse, its awesome!

As a starting point, I've shared the rough description from wikipedia on the t-SNE algorithm:

> The t-SNE algorithm comprises two main stages. First, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects are assigned a higher probability while dissimilar points are assigned a lower probability. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence (KL divergence) between the two distributions with respect to the locations of the points in the map. While the original algorithm uses the Euclidean distance between objects as the base of its similarity metric, this can be changed as appropriate.

So the algorithm is definitely trying to minimise the KL divergence. In trying to minimise the KLD between the two distributions it is trying to find a mapping such that dissimilar points are further apart in the embedding space.

2

imyourzer0 t1_iy35ydo wrote

I don’t know why people worry so much about the state of the art. Sometimes, the right tool just happens to have already existed for a while. In a lot of cases, PCA works just fine, or to the point where something much more current won’t give you a much better answer. Like another commenter has already said, depending on the assumptions you can or are willing to make, the best choice needn’t be at the bleeding edge.

38

olmec-akeru OP t1_iy362iv wrote

Cool, I get this—but I think its important not to keep ones head in the sand. There are new techniques and its important to grok them.

23

imyourzer0 t1_iy5unj3 wrote

I certainly wouldn’t advise anyone to ignore new methods. That’s a point well taken. I’m only saying that when you have a working method, and its assumptions about your data can be validated (either logically or with tests), you don’t need to start looking for SOTA methods.

Really, the goal is solving a problem, and to do that, you want to find the most appropriate method—not just the newest method. It’s one thing to “keep up with the Joneses” so that when problems arise you know as many of the available tools as possible, but picking an algorithm usually doesn’t depend on whether that algorithm is new.

3

ktpr t1_iy3v2i1 wrote

Unfortunately, deeply understanding your problem and its relationship to prior algorithms is a lot more work than just telling someone that you applied a SOTA algorithm and got decent results.

edit - an apostrophe

15

olmec-akeru OP t1_iy7aszg wrote

Completely correct.

The corollary remains true though: applying the correct algorithm is a function of knowing the set of available algorithms. The newness of the algorithm isn't a ranking feature.

2

resented_ape t1_iy3xso2 wrote

The t-SNE paper came out in 2008, so things are not moving as quickly as OP suggests IMO.

For the t-SNE/UMAP family of algorithms, which can be boiled down to attempting to embed the k-nearest neighbors graph, Dmitry Kobak has published several papers of interest. I would point you to Attraction-Repulsion Spectrum in Neighbor Embeddings as a good place to start.

17

olmec-akeru OP t1_iy7augu wrote

Thank you, thank you, thank you! Appreciate this helpful link!!

1

IntelligenXia t1_iy3wjh4 wrote

UMAP !

https://umap-learn.readthedocs.io/en/latest/

As simple as

mapper2 = umap.UMAP().fit(data)
14

daking999 t1_iy4innd wrote

Good for visualization maybe? Not good for further downstream analysis.

5

RetroPenguin_ t1_iy4mjz5 wrote

Yeah, can't be a preprocessing step for clustering etc. But nice for qualitative visualizations, I suppose.

2

brucebay t1_iy53ib4 wrote

One neat feature with UMAP is you can map the new data to the same space later. None of the algorithms in Scikit does that if I remember correctly.

The problem with umap is you can not give your custom distance metric in the original space, and request UMAP to preserve proximity of original points.

1

Atom_101 t1_iy2pta1 wrote

Unsupervised contrastive learning is sort of like dimensionality reduction. So something like a CLIP Huge can be considered sota I guess.

11

olmec-akeru OP t1_iy363p5 wrote

Thanks!

3

Tgs91 t1_iy491cv wrote

Important piece of this question is whether you want a lossy dim reduction or lossless. With something like PCA you can reconstruct the original dimension. With a deep learning based method, there is some degree of information loss (which could be a good thing since it's supervised information loss / choosing important information to retain). If you want to be able to reconstruct the original inputs, you'd need to also build a generator model that is supervised to reconstruct the original. You can get very high quality reconstructions, but it won't be an exact match of the original.

3

radarsat1 t1_iy4mskn wrote

? if you remove dimensions after PCA, you cannot reconstruct the original data ... well, you get something very blurry in any case.

3

itsyourboiirow t1_iy5aa1i wrote

Correct. But you don't necessarily have to discard the extra dimensions to do PCA.

1

radarsat1 t1_iy5e7jo wrote

but the question is about dimensionality reduction

1

JustOneAvailableName t1_iy7bqul wrote

PCA is also lossy

> You can get very high quality reconstructions, but it won't be an exact match of the original.

With a GAN (among others), a VAE for example is fuzzy

1

BrisklyBrusque t1_iy3slot wrote

Well sometimes dimension reduction is used to maintain the most important aspects of the data in as few vectors as possible, particularly when we want to visualize high-dimensional data or escape the curse of dimensionality. Other times dimension reduction is more of a regularization technique. Think of self-organizing maps, RBMs, autoencoders, and other neural nets that learn a representation of the data, which can then be passed to another neural net as the new training sample.

So dimension reduction is itself a technique with many distinct applications.

9

trutheality t1_iy49f5f wrote

For dimensionality reduction to feed into other computational tasks like classification/clustering SOTA is to use data-specific NNs: autoencoders, [thing]2vec, transformer embeddings, graph embeddings, stuff like that.

UMAP is nice for visualization but you should pretty much never do any downstream computational analysis (like clustering) on the output.

That said, there are plenty of cases where PCA does a good enough job and anything more complicated isn't worth the loss of interpretability and theoretical elegance.

5

ProdigyManlet t1_iy5xbam wrote

Why's UMAP not good for downstream clustering?

2

ZombieRickyB t1_iy60acn wrote

There is a large amount of distortion done in the last step of UMAP, namely the whole cross entropy optimization. The output is designed to look pretty and separate things, in the process being at risk of distorting distances and the underlying "space" for the sake of looking pretty. Clustering quality risks being questionable.

3

backtorealite t1_iy6g1c8 wrote

Does this hold true for tsne? Can tsne be used for downstream clustering?

1

ZombieRickyB t1_iy6md04 wrote

Yep, same idea. You're optimizing a function, there is zero guarantees on any particular distance based distortion. You can use it for visualization but little else

1

backtorealite t1_iy6uivr wrote

But saying it’s good for visualization is equivalent to saying it’s good for decomposing data into a 2D framework. So either it’s completely useless and shouldn’t be used for visualization or has some utility in downstream analysis. Doesn’t really make sense to say both. And we all know it’s not completely useless so I think it’s a bit unfair to say it should only ever be used for visualization.

2

ZombieRickyB t1_iy70dqa wrote

Visualization does not mean it's good for working in. The nonisometric nature is the killer. Your space is fundamentally distorted, what you input is not necessarily reflected in the visualizations.

Here is a stackexchange thread discussing the main issue (same holds for UMAP) but I'll highlight a different example: https://stats.stackexchange.com/questions/263539/clustering-on-the-output-of-t-sne

Probably the most famous example of showing why you really can't assume meaning in general comes from the stereographic projection. This is a diffeomorphism between the sphere without a point and the plane. Points close by on the punctured sphere, specifically around the puncture, absolutely need not be close on the plane; they'll actually be quite far apart. The minute you start working with diffeomorphisms of arbitrarily high conformal distortion, the worse you get your results. Is there data that is reasonably "sphere-valued"? Absolutely, I see it a bunch. Any attempt to flatten it will destroy the geometry in some fashion. This is just one example.

You have two things with these setups that put you at risk: you're taking a minimum of some sort, which never needs to be a smooth mapping, and generally a log of something appears somewhere, which fundamentally changes geometry. It's there, it's just usually hidden because someone at some point is assuming something is approximately Gaussian. That's why the energies look the way they do.

2

olmec-akeru OP t1_iy7bv3u wrote

You can resolve the isometric constraint by using a local distance metric dependent on the local curvature: hint, look at the Riemann curvature tensor.

1

backtorealite t1_iy7z5lu wrote

But I feel like this is equivalent to a statistician telling me to not trust my XGBoost model with 99% accuracy but is fine with my linear model with 80% accuracy. If it works, it works. Unrealistic model data transformations happen in all types of models and as long as you aren’t just selecting the prettiest picture that you arrived on by chance I see now problem with relying on a unsupervised transformation that may consistent of some unrealistic transformations if it fundamentally is still highly effective in getting what you want. If I know my data has interaction and non linear effects but don’t know which variables will have such effects, it seems like a UMAP or tsne transformation to two dimensions is a perfectly reasonable option and preferable to PCA in that situation. I feel like the problems you describe are mostly solved by just adjusting the parameters and making sure the clusters you find are robust to those alterations.

1

ZombieRickyB t1_iy94v38 wrote

I mean, yeah, you're not wrong. If it works for you, it works for you. It's problem space dependent, and there's virtually no research that exists suggesting how much, if at all, things will be distorted in the process for given scenarios. For my work, I need to have theoretical bounds on conformal/isometric distortion, the distances involved are infinitely more important than classification accuracy. I work in settings where near perfect classification accuracy is absolutely not expected, so well separation of clusters just will call for question of results.

There have been a number of cases, both theoretical and in practice, where t-SNE and UMAP give results with questionable reliability. I'm sure I could get an example in my space with little effort as well, and I'd rather go through some nonlinear transforms I know well in terms of how they work than spend a bunch of time tuning optimization hyperparameters that could take forever.

1

trutheality t1_iy953rr wrote

It's "good" for visualization in the sense that it can give you something to look at, but it's not really good for visualization. You can't even guarantee that the nearest neighbor of a point in the projection is its nearest neighbor in the input space.

This paper demonstrates that you can make the output look like anything you want and still minimize the UMAP/tSNE objectives: https://www.biorxiv.org/content/10.1101/2021.08.25.457696v3

1

resented_ape t1_iybprpj wrote

FWIW, I don't think that is what the paper demonstrates. The Picasso method the authors introduce uses a totally different cost function based on distance reconstruction. For a specific set of metrics the authors are interested in, they say that Picasso produces results comparable with UMAP and t-SNE. But it's not the UMAP or t-SNE objective.

With the scRNAseq dataset in the python notebook at the github page for picasso, I found that for the metrics one might usually be interested in with UMAP and t-SNE, e.g. neighborhood preservation (what proportion of the k-nearest neighbors in the input and output space are preserved) or Spearman rank correlation (or triplet ordering preservation) of input vs output distances, Picasso did (quite a bit) worse than UMAP and t-SNE.

This might not be relevant to for downstream scRNAseq workflows -- I will take the authors' word on that. At any rate, on my machine Picasso runs very slowly, and I found its own output to be visually unsatisfactory with the default settings with other datasets that I tried it with (e.g. MNIST), so I have been unable to generate a similar analysis for a wide range of datasets. So take that for what it's worth.

1

proMatrixMultiplier t1_iy6hobk wrote

Does this point hold true for parametric umap as well?

1

ZombieRickyB t1_iy6qf71 wrote

No reason it wouldn't. You might be able to prove some amount of bounds on the amount of distortion but the fundamental issue is still present

1

olmec-akeru OP t1_iy7bppp wrote

Said differently: its not an embedding on a continuous manifold? The construction of simplexes is such that infinite curvature could exist?

1

ZombieRickyB t1_iy957ed wrote

I mean I imagine it is an embedding in the rigorous sense in some cases. Can't really prove it, though. The issue isn't really about curvature so much as distances, though. Can't flatten an object homeopathic (not necessarily diffeomorphic) to a sphere without introducing mega distortion in the metric space sense

1

trutheality t1_iy92ygu wrote

As u/ZombieRickyB said, the short answer is that it distorts distances to the point that you can't rely on them in downstream clustering.

There are two papers that do a really good deep dive into it:

This one: https://www.biorxiv.org/content/10.1101/2021.08.25.457696v1 where they both show that the distances pretty much have to be distorted and that the minimizer of the objective is such that you can make the output look pretty much like anything while minimizing.

And this one: https://jmlr.org/papers/volume22/20-1061/20-1061.pdf that studies which aspects of the objective functions of these methods affect the structure at different scales.

2

jamesxli t1_iy6t06x wrote

Output of t-SNE/UMAP are actually good for downstream analysis and they have been widely used for clusters discovery among very high dimensional data (>10K dim). t-SNE/UMAP have thus often just referred to as clustering algorithms!

2

trutheality t1_iy943od wrote

They've been used for that, but they really really shouldn't.

1

i-heart-turtles t1_iy4sitd wrote

I see many people in this thread dislike umap for downstream tasms, but I don't see any good justification. I think umap is still regarded as one state of the art for top-down manifold learning among many academic researchers for deriving topologically correct embeddings with small local distortion.

For bottom-up manifold learning, I think the state of the art are based on local tangent space alignment. Here is one recent paper Low Distortion Laplacian Eigenmaps: https://www.jmlr.org/papers/v22/21-0131.html

In real life, the big issue for nonlinear dl is usually noise.

4

olmec-akeru OP t1_iy7c0o8 wrote

Thanks for the paper! Its been linked to previously in this post—Figure 1 is gorgeous.

1

ZombieRickyB t1_iy45jln wrote

I get the question but at the same time I don't. It really depends on what your goal is.

Case in point: I can use little more than PCA and maybe a bit of diffusion maps. There are fundamental properties that make me need these Can other methods separate better? Sure! There are some really neat pictures you can make. But, to get those pictures, there are changes that need to be made for my data that make things a bit unworkable unless I can invert them, and generally, I can't. This doesn't matter to other people, but it's everything to me.

State of the art is what you make of it. For me? PCA still is, as it will be for many others. Doesn't matter if you can separate better, that's like the least of my interests. That being said, what is it for you?

Note: I hate quantitative classification benchmarks in general so any take I have related to "SOTA" will always be biased.

2

sin_aim t1_iy5faaj wrote

For more difficult scientific problems try diffusion maps.

2

solresol t1_iy5qxoy wrote

I rather like isolation kernel methods: https://arxiv.org/pdf/2109.14198.pdf

The idea is that you take a random subset of points, and then work out for each point in your dataset which one of those random points it is closest to.

Repeat that process some large number of times. Points that regularly get mapped to the same exemplar are obviously close to each other.

For some tasks that's enough. Otherwise, if you feed that out into something else (e.g. t-SNE) and get much better results than if you try to reduce dimensionality directly.

2

Far-Butterscotch-436 t1_iy3vwok wrote

Its more like a toolkit than one method that works better than the rest. Don't forget to add an auto encoder to your list

1

bigbadlamer t1_iy4dg1l wrote

I work with stock data for the most part, where PCA is still the way to go. Curious if people know if UMAP/t-SNE or other mentioned techniques are used successfully for this subject area.

1

LearnyMcLearnFace t1_iy56wl1 wrote

It's not precisely dimensionality reduction, but one common method is to convert data points into embedding vectors taken from some layer of a neural network that's trained on data points like yours for <waves hands> some purpose.

0

bmrheijligers t1_iy63d4a wrote

How about pca and dimensionality reduction on binary data? Today it has come to my attention that doing pca on binary data might lose some of the trustworthiness I have done to associate with pca on continuous data. For reference see this discussion: pca on binary data - stack exchange

0

flat5 t1_iy4fqcf wrote

What is the best kitchen tool? What is the best vehicle?

Are these questions kind of dumb?

−2