I am trying to translate from English to Arabic using Fairseq. But the interactive.py script translate pieces of text fragment on-the-fly. But I need to use it as reading an input text file and writing output text file write. I referred this GitHub issue - https://github.com/pytorch/fairseq/issues/858 But it doesn't clearly explain on how to do it in general. Any suggestions ?
fairseq-interactive can read lines from a file with the --input parameter, and it outputs translations to standard output.
So let's say I have this input text file source.txt (where every sentence to translate is on a separate line):
Hello world!
My name is John
You can run:
fairseq-interactive --input=source.txt [all-your-fairseq-parameters] > target.txt
Where > target.txt means "put in the target.txt file all (standard) output generated by fairseq-interactive". The file will be created if it doesn't exist yet.
With an English to French model it would generate a file target.txt that looks something like this (actual output may vary depending on your model, configuration and Fairseq version):
S-0 Hello world!
W-0 0.080 seconds
H-0 -0.43813419342041016 Bonj@@ our le monde !
D-0 -0.43813419342041016 Bonjour le monde !
P-0 -0.1532 -1.7157 -0.0805 -0.0838 -0.1575
S-1 My name is John
W-1 0.080 seconds
H-1 -0.3272092938423157 Je m' appelle John .
D-1 -0.3272092938423157 Je m'appelle John.
P-2 -0.3580 -0.2207 -0.0398 -0.1649 -1.0216 -0.1583
To keep only the translations (lines starting with D-), you would have to filter the content of this file. You could use this command for example:
grep -P "D-[0-9]+" target.txt | cut -f3 > only_translations.txt
but you can merge all commands in one line:
fairseq-interactive --input=source.txt [all-your-fairseq-parameters] | grep -P "D-[0-9]+" | cut -f3 > target.txt
(Actual command will depend on the actual structure of target.txt.)
Finally, know that you can use --input=- to read input from standard input.
I found that fairseq-interactive is a bit slow. I think there is another potential solution if you just want input and output files using the fairseq pretrained model. (but not sure if it will be faster)
Basically, you can load the model in python and use model.translate
from fairseq.models.transformer import TransformerModel
trans = TransformerModel.from_pretrained(
'models/',
checkpoint_file='checkpoint_best.pt',
data_name_or_path='bin/',
is_gpu=True
).cuda()
inputs = "Di-mairt Clodh-bhualadh a cheud leabhair,"
print(trans.translate(inputs))
Following this idea, you can read the file and translate it easily. But maybe there is a better way to translate the file directly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With