Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding why a Java pattern matches incorrect text inside buffered .java file?

Tags:

java

regex

buffer

Full disclosure: still very new to Java.

I'm contributing to the open source xmage project and we agreed to remove the copyright header text from all files in the project in favour of a single LICENSE.txt file inside the project root.

Due to the limitations of my IDE recognising regex patterns across multiple files I decided to write a script. Here's the script:

Note: updated and working script (won't overwrite hidden files or symbolic links)

/*
  Remove the copy right header from all files inside project.
*/
import java.io.IOException;
import java.io.*;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.regex.Pattern;

public class RemoveHeaders {

  private static String readEntireFile(String filePath) {
    String content = "";

    try {
      content = new String(Files.readAllBytes(Paths.get(filePath)));
    }
    catch (IOException e) {
      e.printStackTrace();
    }

    return content;
  }

  private static void saveFileToDisk(String filePath, String content) {
    File file = new File(filePath);
    Path path = Paths.get(filePath);

    try (FileWriter writer = new FileWriter(file)) {
      if (Files.isWritable(path) && !Files.isSymbolicLink(path) && !Files.isHidden(path)) {
        writer.write(content);
        writer.flush();
      }
    } catch (IOException e) {
      e.printStackTrace();
    }
  }

  private static String removeMatchingText(String content, Pattern pattern) {
    return pattern.matcher(content).replaceAll("");
  }

  public static void recursivelyGetFilesAndRemoveHeaders(String path) {
    Pattern copyrightHeader = Pattern.compile("(?i)/\\*(?:\r?\n|\r) ?\\*.*?Copyright[\\S\\s]*?\\*/");
    File currentDirectory = new File(path);
    File[] files = currentDirectory.listFiles();

    if (files == null) {
        return;
    }

    for (File file : files) {
      if (file.isDirectory()) {
        recursivelyGetFilesAndRemoveHeaders(file.getAbsolutePath());
      } else {
        String filePath = file.getAbsolutePath();
        String fileContents = readEntireFile(filePath);
        String updatedContents = removeMatchingText(fileContents, copyrightHeader);
        if (fileContents != updatedContents) {
          saveFileToDisk(filePath, updatedContents);
        }
      }
    }
  }

  public static void main(String args[]) {
    String rootPath = System.getProperty("user.dir");
    recursivelyGetFilesAndRemoveHeaders(rootPath);
  }
}

For clarity and easy reference this is the (updated) regex being used:

"(?i)/\\*(?:\r?\n|\r) ?\\*.*?Copyright[\\S\\s]*?\\*/"

Here is where things get weird: I tested the pattern on this file and to my surprise, not only was the copyright header comment removed, but the pattern matched line 1428 and removed everything up until line 1554.

I thought it might be a buffering problem and so rewrote the readEntireFile function using FileInputStream, making sure to close the stream before returning the result of reading the file - but this produced the same result.

My thinking now is that it is some quirk of the regex handler in Java? Using the exact same pattern in JavaScript on the same file results in the expected match (only the copyright header).

Just in case here's the system and java details:

openjdk version "1.8.0_171"
OpenJDK Runtime Environment (build 1.8.0_171-8u171-b11-0ubuntu0.16.04.1-b11)
OpenJDK 64-Bit Server VM (build 25.171-b11, mixed mode)

Thanks for taking the time to read through this - been working on this all day and am totally stuck and a little confused. Cheers!

like image 791
GrayedFox Avatar asked Dec 01 '25 06:12

GrayedFox


1 Answers

You forgot to flush the output.

Adding this in the saveFileToDisk method will do:

writer = new FileWriter(file);
writer.write(content);
writer.flush();

(For me, it did not remove the Copyright header, but at least the rest of the removal was omitted. However, the former might be related to the one of the reasons, Wiktor already mentioned in his/her comments.)

Thus, since java 7, use of the try-with-resources statement is encouraged:

try (FileWriter writer = new FileWriter(file)) {
    writer.write(content);
}

This statement will automatically close the FileWriter, which implies flushing.


For explanation:

Streaming output in java (both OutputStream and Writer) is - in concept - buffered. This means that every implementation class is allowed to buffer the throughput internally without explicitly documenting it.

All write methods do therefore not necessarily write to the underlying resources immediately. However, both base classes offer a flush method that does exactly that: it flushes the internal buffer to the underlying resource.

like image 116
Izruo Avatar answered Dec 03 '25 22:12

Izruo



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!