With docx files, i retrieve application/x-tika-ooxml, but i should retrieve application/vnd.openxmlformats-officedocument.wordprocessingml.document instead
Here is my method :
public String retrieveMimeType(InputStream stream) throws IOException, TikaException {
TikaInputStream tikaStream = null;
TikaConfig tikaConfig = new TikaConfig();
MediaType mediaType = null;
try {
mediaType = tikaConfig.getDetector().detect(TikaInputStream.get(stream), new Metadata());
} catch (Throwable t) {
throw t;
} finally {
if (tikaStream != null) {
try {
tikaStream.close();
} catch (IOException e) {
}
}
}
return mediaType.toString();
}
And my dependecies :
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.27</version>
</dependency>
I use tika-core, and tika-parsers for retrieve the right mimetype, but it still give me the bad mimetype...
Update your tika modules. The version of tika-core and it's modules should always be the same.
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-core</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers-standard-package</artifactId>
<version>2.1.0</version>
</dependency>
The new microsoft document formats (docx, xlsx, ...) are just zip archives from the outside. Older tika versions will not look into them by default, which is why, depending on the version, they will detect them as either application/zip or application/x-tika-ooxml. You can read more about this here.
Analyzing the archives however can result in a decrease in performance. To prevent this you could, depending on your use case, determine the mime type by name (see below) or use existing mime types like the Content-Type header.
final Metadata metadata = new Metadata();
metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, fileName);
detector.detect(stream, metadata);
In a HTTP request the file name might also be in the Content-Disposition header.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With