Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to apply a function for each row in the column python [closed]

I have a code like this

imdb_dir = '/home/yan/PycharmProjects/NLTK_ML/aclImdb'  
train_dir = os.path.join(imdb_dir, 'train') 
labels = []
texts = []
for label_type in ['neg', 'pos']:

    dir_name = os.path.join(train_dir, label_type)
    for fname in tqdm(os.listdir(dir_name)):

        if fname[-4:] == '.txt':
            # Read the text file and put it in the list
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
            # Attach the corresponding label
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

max_words = 10000 
tokenizer = Tokenizer(num_words=max_words) 
tokenizer.fit_on_texts(texts) 
sequences = tokenizer.texts_to_sequences(texts) 

word_index = tokenizer.word_index
maxlen = 100 # Make all sequences 100 words long
data = pad_sequences(sequences, maxlen=maxlen)

labels = np.asarray(labels)

indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

training_samples = 20000 
validation_samples = 5000

x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]

glove_dir = '/home/yan/PycharmProjects/NLTK_ML' # This is the folder with the dataset


embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt')) 

for line in tqdm(f):
    values = line.split()
    word = values[0] # The first value is the word, the rest are the values of the embedding
    embedding = np.asarray(values[1:], dtype='float32') 
    embeddings_index[word] = embedding 
f.close()

all_embs = np.stack(embeddings_index.values())
emb_mean = all_embs.mean() # Calculate mean
emb_std = all_embs.std() # Calculate standard deviation
emb_mean,emb_std

embedding_dim = 100

word_index = tokenizer.word_index
nb_words = min(max_words, len(word_index)) 

embedding_matrix = np.random.normal(emb_mean, emb_std, (nb_words, embedding_dim))

for word, i in word_index.items():
    if i >= max_words:
        continue
    embedding_vector = embeddings_index.get(word)

    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen, weights = [embedding_matrix], trainable = False))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['acc'])

history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_data=(x_val, y_val))

def getPositivity(my_text):
    seq = tokenizer.texts_to_sequences([my_text])
    seq = pad_sequences(seq, maxlen=maxlen)
    prediction = model.predict(seq)
    return prediction

df_ticker['prediction'] = df_ticker['text'].apply(lambda text:getPositivity(text))
# print(df_ticker)
df_ticker.to_csv('NLP_ML.csv', index=False)

it gives me [[0.45654]] i have a dataframe

                          dt       id  text compare   timestamp
3  2021-04-12 03:17:37+00:00  gu7tiax  riot 60$ call       1  2021-04-12
4  2021-04-12 13:15:04+00:00  gu91gf2  vix🚀🚀🚀       0  2021-04-12
5  2021-04-12 14:22:04+00:00  gu99dqg  tsal to the moon       0  2021-04-12

I want to get apply this function for each row in the column I tried like this but it outputs only the last row and I want to have all the rows

                          dt       id  text   timestamp      prediction
5  2021-04-12 14:22:04+00:00  gu99dqg  tsal to the moon       0  2021-04-12  [[0.29380253]]

I thought that the issue is on the last few lines of code where I apply getPositivity function. I tried to apply moview reviews NLTK on the new dataframe. So i tried to applt machine learning on the whole new dataframe but it gives me only one line, when I want to output all the lines with the positivity score. Appreciate your help

like image 833
YanRemes Avatar asked Feb 02 '26 19:02

YanRemes


1 Answers

I ran the code, but as I don't have the necessary tokenizer packages installed, I couldnt get that to run. Instead, I ran a simpler function below:

import tokenizer
my_text = 'any future for evs?'

def getPositivity(my_text):
    prediction = my_text + "??"
    return prediction

df_ticker['prediction'] = df_ticker['text'].apply(getPositivity)
df_ticker

and got the output as per the attached image. I believe the issue lies in the tokenizer side, and not in the apply side. Could you try running my code and sharing your output?

dataframe

Alternatively, the warning message you shared in the comments suggests you have a line like this somewhere:

df_ticker = df[0:5]

(or any other type of restriction on a dataframe)

If this is the case, python occasionally does not like you adding columns to a restriction of the dataframe.

To amend this, you would replace the example I just mentioned with:

df2 = df.copy()
df_ticker = df2[0:5]

etc.

Gregory

like image 155
GPRSmith Avatar answered Feb 05 '26 08:02

GPRSmith



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!