jakderrida

jakderrida t1_jdf8zip wrote

Whoever is downvoting you just doesn't get it.

My joke was that "structural" was so meaningless that it's obviously a backronym solely in service of my pun.

/r/VictorMollo 's joke is that we should all just go off the deep-end and double down on blatantly obvious backronyms.

Notice he used the word "Widget" instead of freaking "Weighted"? He obviously chose to Taylor it that way because he appreciates my puns.

3

jakderrida t1_jdb95pw wrote

How about SPIT, or Sparse Parameter Iso-FLOP Transformations)?

or would SPLIT: Sparse Performance-focused Lightweight Iso-FLOP Transformations work?Or let's choose whatever's SAFIST, or Sparse Accuracy-focused FLOP-Isometric Structural Transformations?

Who cares that I obviously had to shoehorn "Structural" in there just to get my pun across?

17

jakderrida t1_jcotnis wrote

The basis of this rule of thumb is that having too few observations relative to the number of predictor variables can lead to unstable estimates of the model parameters, making it difficult to generalize to new data. In particular, if the number of observations is small relative to the number of predictor variables, the model may fit the noise in the data rather than the underlying signal, leading to overfitting.

1

jakderrida t1_j30u1jq wrote

Reply to comment by hysse in [D] Simple Questions Thread by AutoModerator

To be clear, ChatGPT provided the answer.

Here's how ChatGPT answered your next question...

It is generally true that a Rust backend can provide better performance than a pure Python implementation, especially for a high-throughput task like tokenization. This is because Rust is a systems programming language that is designed to be fast and efficient. It can also make use of low-level optimization techniques that are not possible in Python, such as manual memory management.

That being said, it is still possible to write a fast and efficient tokenizer in Python. There are several ways to do this:

  1. Use optimized libraries: There are several libraries in Python that are optimized for performance, such as NumPy and PyPy. Using these libraries can help to improve the performance of your tokenizer.

  2. Use Cython or Numba: Cython and Numba are tools that can be used to optimize Python code by compiling it to native machine code. This can provide a significant performance boost over pure Python code.

  3. Use multithreading or multiprocessing: If your tokenizer is CPU-bound, you can use multithreading or multiprocessing to parallelize the work and make use of multiple CPU cores.

  4. Use a just-in-time (JIT) compiler: JIT compilers, such as Numba, can analyze your Python code at runtime and generate optimized machine code on the fly. This can provide a significant performance boost over pure Python code.

Overall, there are many ways to optimize the performance of a Python-based tokenizer. It is worth considering using a combination of these techniques to achieve the best results.

1

jakderrida t1_j300s6g wrote

Quantization-aware training: PyTorch provides a set of APIs for performing quantization-aware training, which allows you to train a model with quantization in mind and can often result in higher-quality quantized models. You can find more information about quantization-aware training in the PyTorch documentation (https://pytorch.org/docs/stable/quantization.html#quantization-aware-training).

Post-training static quantization: PyTorch also provides APIs for performing post-training static quantization, which involves quantizing a model that has already been trained. You can find more information about post-training static quantization in the PyTorch documentation (https://pytorch.org/docs/stable/quantization.html#post-training-static-quantization).

Dynamic quantization: PyTorch also supports dynamic quantization, which allows you to quantize a model at runtime. This can be useful for applications where the model needs to be deployed on devices with limited memory or computational resources. You can find more information about dynamic quantization in the PyTorch documentation (https://pytorch.org/docs/stable/quantization.html#dynamic-quantization).

2

jakderrida t1_j300a3x wrote

Sociology: A team of researchers used machine learning to analyze social media data and predict the likelihood of an individual becoming homeless (https://www.nature.com/articles/s42256-019-0106-5).

Psychology: A group of psychologists used machine learning to predict the likelihood of a person developing depression based on their social media posts (https://www.nature.com/articles/s41562-017-0214-1).

Political science: Researchers used machine learning to analyze political text data and predict the likelihood of conflict in different regions of the world (https://www.pnas.org/content/115/41/10302).

12

jakderrida t1_j2zybhc wrote

There are a few ways to determine when to stop training a natural language understanding (NLU) model:

Monitoring the performance on a validation set: One approach is to monitor the performance of the model on a validation set during training and stop training when the performance on the validation set stops improving or starts to degrade. This can help to prevent overfitting and ensure that the model generalizes well to new data.

Using early stopping: Another approach is to use early stopping, which involves setting a maximum number of epochs and stopping training when the performance on the validation set has not improved for a certain number of epochs. This can help to prevent overfitting by stopping training when the model is no longer making progress.

Using human evaluation: If you have access to human annotators, you can also use human evaluation to determine when the model is ready for production. You can use a subset of your data as a test set and have the annotators evaluate the model's performance on this test set. When the model's performance meets your desired accuracy threshold, you can consider it ready for production.

Ultimately, the best way to determine when a model is production-ready will depend on the specific requirements of your application and the resources available to you. It may be helpful to experiment with different approaches and see which one works best for your particular case.

1

jakderrida t1_j2zy3s2 wrote

I would recommend considering the following strategies to handle imbalanced labels in your dataset:

Oversampling: You can oversample the minority classes by generating synthetic examples or by sampling with replacement from the minority classes. This can help to balance the class distribution and improve the model's performance on the minority classes.

Undersampling: You can undersample the majority classes by randomly sampling a smaller number of examples from the majority classes. This can help to balance the class distribution and prevent the model from being biased towards the majority classes.

Weighted loss: You can assign higher weights to the minority classes in the loss function to give them more influence on the model's learning. This can help to balance the class distribution and improve the model's performance on the minority classes.

Class-specific metrics: You can use metrics that are specifically designed to evaluate the model's performance on imbalanced datasets, such as the F1 score or the AUC (Area Under the Curve) of a precision-recall curve.

In your particular case, you may want to consider oversampling or using weighted loss, since you have only one example for some of the minority classes. It may also be helpful to combine these strategies to achieve the best results.

1

jakderrida t1_j2zxul1 wrote

Reply to comment by hysse in [D] Simple Questions Thread by AutoModerator

The Hugging Face library is a popular tool for training a tokenizer and is relatively easy to use. It is based on the Transformers library, which is built on top of PyTorch, and it provides a wide range of pre-trained models and tools for natural language processing tasks.

In terms of efficiency, the Hugging Face library should be sufficient for most use cases. However, if you need to train a very large model or you want to optimize the training process for maximum efficiency, you may want to consider using a more specialized library like PyTorch or TensorFlow directly.

Other natural language processing libraries like NLTK (Natural Language Toolkit) and torchtext are also useful for a variety of tasks, such as text preprocessing, part-of-speech tagging, and language modeling. NLTK is a general-purpose library that provides a wide range of tools for working with human language data, while torchtext is a PyTorch library that provides tools for preprocessing and working with text data in PyTorch.

3

jakderrida t1_j2zxpxe wrote

The batch size, learning rate, and number of epochs can all affect the model's performance on a smaller dataset. Here are some general guidelines that you can use as a starting point:

Batch size: A smaller batch size can be more appropriate for smaller datasets because it allows the model to make updates based on more diverse data. For example, a batch size of 32 or 64 is a good starting point for a smaller dataset.

Learning rate: The learning rate determines how fast the model updates its weights. A higher learning rate can allow the model to make rapid progress at the beginning of training, but it can also make the model more prone to overfitting. A lower learning rate can make the model's progress slower, but it can also help the model to generalize better to new data. A learning rate in the range of 0.001 to 0.01 is a good starting point for a smaller dataset.

Number of epochs: The number of epochs is the number of times the model sees the entire dataset during training. A smaller dataset may require fewer epochs to prevent overfitting. For example, you may want to start with a small number of epochs (e.g., 10 or 20) and increase it if the model's performance on the validation set is still improving.

Keep in mind that these are just general guidelines, and the optimal batch size, learning rate, and number of epochs will depend on the specific characteristics of your dataset and model. It may be helpful to experiment with different combinations of these hyperparameters to find the best settings for your particular case.

3

jakderrida t1_j0e4mu6 wrote

>It would make something that looked like a generic cousin.

That sounds incredibly impressive, if they can. If we can train DNA to mugshot-style photo enough to generate it from other DNA samples, I would think it would allow us a much better conception of what ancient peoples whose DNA we have actually looked like. Every time I see some computer-rendered depiction, I wonder how close they'd get with my DNA.

2

jakderrida t1_j02bzq2 wrote

>It must be a pretty hard problem.

Not particularly. The only hurdle is the database. I collected all the Seeking Alpha articles and tags very easily before organizing the data and building the model to astonishing success on Colab.

An alternative would be to find literature from great writers (James Joyce, Emile Bronte, etc.) and divide it into paragraphs as texts, remove paragraphs that are too small and tag those paragraphs as a 1 and take awful writing (Twilight, Ann Coulter, Mein Kampf, etc.) and do the same with them tagged as 0 before training the model to separate the two.

2

jakderrida t1_j02apso wrote

Well, for one, flipping the script already occurs. When I was an electrician, a manager overheard me claim that a device measures resistance in the circuit. He proclaimed it measures continuity of the charge going through it. I repeatedly told him that it's the same thing with no success.

If it measures whether it has many citations, the inverse of the probability measure given will be the probability it has low citations.

Now if what you're looking for is something like short stories, the hurdle to cross would be to find pretagged data that you would consider a reliable measure of "interesting/engaging" to be converted into mutually exclusive dummy variables for the NLP tool to train for. The reason I mentioned published research and citations is only because it's massive, well-defined, and feasible to collect metrics with associated texts.

Just to ensure you don't waste your time with any dreams of building the database without outside sources, I want you to realize that the thing about deep learning/neural network technologies is that it tends to produce terrible results unless the training data is pretty massive. Even the 50,000 tagged articles I used from Seeking Alpha would be considered somewhat frivolous of me by most in the ML community. Not because they're jerks or anything, but because that's just how NNs work.

2

jakderrida t1_j025iqs wrote

Problem with that is that using engagement or clicks will just give you an inferior version of Facebook's formula for turning retirees into conspiracy theorists.

On the other hand, I think you could make one. Perhaps by scraping the abstracts of published research and differntiating between those that later received extraordinary amounts of citations and those that didn't. I actually used NLP models against Seeking Alphas author-tagged articles for Bullish and Bearish to the stocks article pertained to and while I started with expectations to beat a coin toss, the results surged to over 90% accuracy.

1

jakderrida t1_izjoa2g wrote

I recall an AMA whereby a blind redditor answered blindness related questions. When asked whether and how they became blind, they quite impressively provided an unemotional account of how he was among several family members his father shot in the head while sleeping before he fled town. I'm admittedly not a parent, but I simply can't grasp why one would murder their whole family before fleeing.

18