Pyspark sentiment analysis invalid output

Question

I am trying to perform sentiment analysis for a use case. Most of the time, it is giving correct results, but in some cases, even positive comments are being marked as negative. How can I fix my code to achieve better accuracy?

My code

from pyspark.sql.functions import udf, col
from pyspark.sql.types import StringType
from transformers import pipeline
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Define the filter_stopwords function
def filter_stopwords(sentence):
    stop_words = set(stopwords.words('english'))
    word_tokens = word_tokenize(sentence)
    filtered_sentence = [w for w in word_tokens if not w in stop_words]
    return " ".join(filtered_sentence)

# Initialize the sentiment analysis pipeline with a different model
sentiment_pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

# Define a function to get sentiment using the pipeline
def get_sentiment(text):
    filtered_text = filter_stopwords(text)
    result = sentiment_pipeline(filtered_text)[0]
    return result['label'].lower()  # returns 'positive', 'negative', etc.

# Register the function as a UDF
sentiment_udf = udf(get_sentiment, StringType())

# df = df.withColumn("sentiment", sentiment_udf(col("text_column")))

Input data

Didn't get it right the first 2 times but when it was fixed it was fixed well.
The Response time was grate -- Note Looks like spelling mistake but his review is positive
The initial agent contact could not resolve my issue but escalated it quickly to someone who could.

for this inputs i am expecting all should be positive instead i am getting negative

Parman M. Alizadeh · Accepted Answer

Your code is perfectly fine, but the issue lies with how the model interprets your inputs, not with the implementation itself. I ran your inputs directly on the Hugging Face model page for distilbert-base-uncased-finetuned-sst-2-english.

The first example ("Didn't get it right the first 2 times but when it was fixed it was fixed well") returns positive as expected. The other two examples, however, return negative, confirming that the issue isn’t your code but how the model processes those inputs. These models are trained on general datasets and are likely to misinterpret domain-specific or nuanced inputs, especially ones with mixed sentiment signals or spelling mistakes like "grate" in your example. Phrases like "could not resolve my issue" in your third input might outweigh the positive sentiment in "escalated it quickly to someone who could."

I suggest you explore the following:

Test your outlier examples on different sentiment analysis models directly on HhuggingFace. You might find models that are more aligned with the nature of your data.
Preprocess your data to fix spelling mistakes and/or rephrase ambiguous or mixed sentiment phrases.
At last, if still not satisfied with the results and can't tolerate the marginal error, you should then annotate your data and use it to fine-tune a sentiment analysis model.

Pyspark sentiment analysis invalid output

Tags:

nlp

nltk

pyspark

huggingface-transformers

sande

1 Answers

Parman M. Alizadeh

Recent Activity

Donate For Us

Pyspark sentiment analysis invalid output

Tags:

nlp

nltk

pyspark

huggingface-transformers

sande

1 Answers

Parman M. Alizadeh

Related questions

Recent Activity

Donate For Us