Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Separating ASCII text from binary content in a file

I have a file that has both ASCII text and binary content. I would like to extract the text without having to parse the binary content as the binary content is 180MB. Can I simply extract the text for further manipulation ... what would be the best way of going about it.

The ASCII is at the very beginning of the file.

like image 228
Ankur Avatar asked Feb 02 '26 14:02

Ankur


2 Answers

There are 4 libraries to read FITS files in Java here:

Java

nom.tam.fits classes

A Java FITS library has been developed which provides efficient -- at least for Java -- I/O for FITS images and binary tables. The Java libraries support all basic FITS formats and gzip compressed files. Support for access to data subsets is included and the HIERARCH convention may be used.

eap.fits

Includes an applet and application for viewing and editing FITS files. Also includes a general purpose package for reading and writing FITS data. It can read PGP encrypted files if the optional PGP jar file is available.

jfits

The jfits library supports FITS images and ASCII and binary tables. In-line modification of keywords and data is supported.

STIL

A pure java general purpose table I/O library which can read and write FITS binary tables amongst other table formats. It is efficient and can provide fast sequential or random read access to FITS tables much larger than physical memory. There is no support for FITS images.

like image 78
OscarRyz Avatar answered Feb 04 '26 05:02

OscarRyz


I am not aware of any Java classes that will read the ASCII characters and ignore the rest, but the easiest thing I can come up with here is to use the strings utility (assuming you are on a Unix-based system).

SYNOPSIS strings [ - ] [ -a ] [ -o ] [ -t format ] [ -number ] [ -n number ] [--] [file ...]

DESCRIPTION Strings looks for ASCII strings in a binary file or standard input. Strings is useful for identifying random object files and many other things. A string is any sequence of 4 (the default) or more printing characters ending with a newline or a null. Unless the - flag is given, strings looks in all sections of the object files except the (__TEXT,__text) section. If no files are specified standard input is read.

You could then pipe the output to another file and do whatever you want with it.

Edit: with the additional information that all the ASCII comes at the beginning, it would be a little easier to extract the text programmatically; still, this is faster than writing code.

like image 28
danben Avatar answered Feb 04 '26 07:02

danben