首页 > 资讯 > 后端开发 > Python >一文探索Java文件读写更高效方式

221

分享到

一文探索Java文件读写更高效方式

2024-04-02 19:04:59 221人浏览独家记忆

Python 官方文档：入门教程 => 点击学习

摘要

目录背景场景分析场景1：小文件单文件压缩方式1：网上流传（流传在坊间的神话，其实是带刺的玫瑰）方式2：使用缓冲区方式3：使用通道方式4：使用mmp场景2：大文件单文件压缩场景3：大文

背景

最近在探秘kafka为什么如此快？其背后的秘诀又是什么？

怀着好奇之心，开始像剥洋葱一样逐层内嵌。一步步揭晓kafka能够吊打MQ的真因。了解之后不得不说kafka：yyds。

了解到顺序存盘的运用

探测到稀疏索引的引进

知晓其零拷贝技术的威力

嗅觉到mmp（内存映射文件）的神来之笔

......

mmp如此神奇，那么运用于文件压缩，是否同样可以实现飞速压缩呢？

又怀着好奇之心，决定用实际行动证明这个结论（否则我们的知识只能纸上谈兵）

编码是我们的本能功能，好奇之心是我们永远的利器。不能丢

曾几何时，有位BA告诉我他的经历：DEV转为BA后，代码就生疏了，后来他强迫自己每个迭代都领一个小需求鞭策自己。

曾几何时，有位前辈告诉我：即使你以后成长为架构师甚至更高职位，也不能丢失编码这件神器。否则你会发现会很尴尬——会被人称为“需求翻译机”

......

这不是心灵鸡汤，这是来自灵魂的谏言，我深刻了解到：编码真的是学到老活到老的工作。

看到很多优秀的同事离职远去，通过交流感触更加深厚

所以，大家一定记得：学会一个知识要努力应用一遍。这样才能记得牢固；在学习中要不求甚解，完全知道这个知识也要知道为什么这么做

......

场景分析

场景1：小文件单文件压缩

1、原始文件介绍：63.7M、 csv文件、单个文件
2、对比技术介绍：网上流传、使用缓冲区、使用管道、使用mmp
3、对比结果展示：

方式1：网上流传（流传在坊间的神话，其实是带刺的玫瑰）

小王刚入职不久，有一天突然接到需求，要压缩文件，之前没写过，怎么办？这个时候会在网上搜到这个方法

执行结果（效率很吓人）

zipMethod=withoutBuffer

costTime=327000ms

代码如下：

public void zipFileWithoutBuffer(String outFile){
    long beginTime = System.currentTimeMillis();
    File zipFile = new File(outFile);
    File inputFile = new File(INPUT_FILE);
    try(ZipOutputStream zipOutputStream = new ZipOutputStream(new FileOutputStream(zipFile))) {
        try (InputStream inputStream = new FileInputStream(inputFile)){
            zipOutputStream.putNextEntry(new ZipEntry(inputFile.getName()));
            int temp;
            while ((temp = inputStream.read()) != -1){
                zipOutputStream.write(temp);
            }
        }
        printResult(beginTime,"withoutBuffer");
    } catch (Exception e) {
        e.printStackTrace();
        System.out.println("error" + e.getMessage());
    } 
}

方式2：使用缓冲区

小王很开心，提交代码，翻转了需求状态，可验收。

小花是团队资深技术达人，走查代码发现一脸懵逼：网上搜的？这个会很慢，你再研究研究

小王又换了一种思路，借助缓冲区BufferedOutputStream

执行结果（快了很多）

zipMethod=withBuffer

costTime=5170ms

代码如下：

public void zipFileWithBuffer(String outFile){
    long beginTime = System.currentTimeMillis();
    File zipFile = new File(outFile);
    File inputFile = new File(INPUT_FILE);
    try(ZipOutputStream zipOutputStream = new ZipOutputStream(new FileOutputStream(zipFile));
        BufferedOutputStream bufferedOutputStream = new BufferedOutputStream(zipOutputStream)) {
        try (BufferedInputStream bufferedInputStream = new BufferedInputStream(new FileInputStream(inputFile))){
            zipOutputStream.putNextEntry(new ZipEntry(inputFile.getName()));
            int temp;
            while ((temp = bufferedInputStream.read()) != -1){
                bufferedOutputStream.write(temp);
            }
        }
        printResult(beginTime,"withBuffer");
    } catch (Exception e) {
        e.printStackTrace();
        System.out.println("error" + e.getMessage());
    } 
}

方式3：使用通道

小王怀着忐忑的心情，又一次召集大家走查代码。

小花：速度要求没那么高，这样做已经差不多了，代码可以提交了

其实最近研究kafka，接触过NIO，知晓：nio有种技术叫通道：Channel

执行结果（好快）

zipMethod=withChannel

costTime=1642ms

代码如下：

public void zipFileWithChannel(String outFile){
    long beginTime = System.currentTimeMillis();
    File zipFile = new File(outFile);
    File inputFile = new File(INPUT_FILE);
    try(ZipOutputStream zipOutputStream = new ZipOutputStream(new FileOutputStream(zipFile));
        WritableByteChannel writableByteChannel = Channels.newChannel(zipOutputStream)) {
        try (FileChannel fileChannel = new FileInputStream(inputFile).getChannel()){
            zipOutputStream.putNextEntry(new ZipEntry(inputFile.getName()));
            fileChannel.transferTo(0,inputFile.length(),writableByteChannel);
        }
        printResult(beginTime,"withChannel");
    } catch (Exception e) {
        e.printStackTrace();
        System.out.println("error" + e.getMessage());
    } 
}

方式4：使用mmp

研究kafka过程中，不止知晓nio有种技术叫通道：Channel，还有种技术叫mmp

执行结果（好快）

zipMethod=withMmp

costTime=1554ms

代码如下：

public void zipFileWithMmp(String outFile){
    long beginTime = System.currentTimeMillis();
    File zipFile = new File(outFile);
    File inputFile = new File(INPUT_FILE);
    try(ZipOutputStream zipOutputStream = new ZipOutputStream(new FileOutputStream(zipFile));
        WritableByteChannel writableByteChannel = Channels.newChannel(zipOutputStream)) {
        zipOutputStream.putNextEntry(new ZipEntry(inputFile.getName()));
        MappedByteBuffer mappedByteBuffer = new RandoMaccessFile(INPUT_FILE,"r").getChannel()
                .map(FileChannel.MapMode.READ_ONLY,0,inputFile.length());
        writableByteChannel.write(mappedByteBuffer);
        printResult(beginTime,"withMmp");
    } catch (Exception e) {
        e.printStackTrace();
        System.out.println("error" + e.getMessage());
    } 
}

场景2：大文件单文件压缩

1、原始文件介绍：585M、 csv文件、单个文件
2、对比技术介绍：使用缓冲区、使用管道、使用mmp
3、对比结果展示：

使用缓冲区	使用通道	使用mmp
costTime=46034ms	costTime=11885ms	costTime=10810ms

场景3：大文件多文件压缩

1、原始文件介绍：585M、 csv文件、5个文件
2、对比技术介绍：使用缓冲区、使用管道、使用mmp
3、对比结果展示：

使用缓冲区	使用通道	使用mmp
costTime=173122ms	costTime=53982ms	costTime=50543ms

分析结论

1、对比见下表

压缩场景	网上流传	使用缓冲区	使用通道	使用mmp
场景1：小文件单文件压缩（60M）	327000ms	5170ms	1642ms	1554ms
场景2：大文件单文件压缩（585M）	--	46034ms	11885ms	10810ms
场景3：大文件多文件压缩（5个585M）	--	173122ms	53982ms	50543ms
场景4：100K文件单文件压缩	--	28ms	26ms	24ms
场景5：5K文件单文件压缩		18ms	20ms	23ms
场景5：1K文件单文件压缩		15ms	21ms	24ms

结论：

1）网上流传的方法不可取，效率最差
2）使用缓冲区虽然性能还凑合，但和两种nio技术（通道和mmp）相比，还是差了很多，尤其是在中型文件（500M左右）的单文件压缩和多文件压缩
中，对比更加明显
3）通道技术和mmp技术对比相差不大，小型文件基本没影响，大型文件差距也在几秒之间
4）文件大于10K时，推荐使用通道技术或者mmp技术进行文件压缩
5）文件小于10K时，推荐使用缓冲区技术（比两种nio技术表现了更好的性能）
6）如果有些团队在使用api，可以看看其源码是否使用了nio技术。如果不是，建议修改为文中方式

另外，操作文件操作时，都可以尝试使用nio技术，测试下其效率，理论上应该都是很可观的

背后机密

1、网上流传方法

FileInputStream的read方法如下：

public int read() throws IOException {
    return read0();}private native int read0() throws IOException;

这是调用本地方法与原生操作系统进行交互，从磁盘中读取数据。每读取一个字节数据就调用一次这个方法（一次交互很耗时）。

这个方法还是每次读取一个字节，假如文件很大，这个开销是巨大的

2、使用缓冲区

BufferedInputSream read方法如下：

public synchronized int read() throws IOException {
    if (pos >= count) {
        fill();
        if (pos >= count)
            return -1;
    }
    return getBufIfOpen()[pos++] & 0xff;}

这样虽然也是一次读一个字节，但不是每次都从底层读取数据，而是一次调用底层系统读取了最多buf.length个字节到buf数组中，然后从 buf中一次读一个字节，减少了频繁调用底层接口的开销。

3、使用通道

在复制大文件时，FileChannel复制文件的速度比BufferedInputStream/BufferedOutputStream复制文件的速度快了近三分之一，体现出FileChannel的速度优势。NIO的Channel的结构更加符合操作系统执行I/O的方式，所以其速度相比较于传统的IO而言速度有了显著的提高。

操作系统能够直接传输字节从文件系统缓存到目标的Channel中，而不需要实际的copy阶段（copy: 从内核空间转到用户空间的一个过程）