How do I manage my data?

How do I manage my data?

Determining Usage

Each user is assigned a disk space quota, or allotment, when their account is created to insure that there will be sufficient disk space for all our users' needs. If you need additional space, please submit your request to manager@stat.berkeley.edu.

The commands below can show you how disk space you are consuming. For additional information, consult the respective man pages.

bigfiles Recursively searches a directory for big files.
du -h Recursive summary of disk use. Gives disk use for each directory and all its subdirectories.
ls -l Lists size of all files in the current directory.
find ~ -ls Most information: lists size of all files in all directories and subdirectories.
quota Reports your disk quota and current disk use.

 

Temporary Disk Space

There are several places where users can temporarily store data. These areas are often larger than home directories, but they are not backed up and may be removed without notice.

/tmp

Files put in the /tmp directory are only accessible on the machine on which they were created and are automatically wiped everytime the computer is rebooted. Files may also be deleted with little or no warning if resources become scarce. However, if you need a large amount of disk space for a short amount of time, /tmp provides a solution which does not need any staff intervention. Remember that there is no guarantee that files stored in /tmp are safe. Do not use /tmp for data that that is difficult or expensive to re-create. No special permissions are required to use /tmp. To reference your files using /tmp, use '/tmp' as the prefix to the name of the file, for example '/tmp/myfile'..

The limit to the amount of storage a user can take up is the physical limitation of the partition. However, if /tmp is full, editors, compilers, and many other programs will not work or behave erratically. To find out how much space is available in /tmp on your system, type 'df -k /tmp'. Do not use /tmp if less than 30% of the space is available. Remove files when they are no longer needed.

/var/tmp, /Users/Shared

The /var/tmp directory functions similarly to /tmp, however, files are not automatically removed after the machine is rebooted. This directory does get erased, however, whenever the workstation needs to be reinstalled or reconfigured. Otherwise, the same policies that apply to /tmp apply to /var/tmp.

The /Users/Shared directory functions identically to /var/tmp, except it is only found on our Macintosh computers.

/var/tmp/scratch

The /var/tmp/scratch directory exists on some workstations that have secondary disks. This directory functions similarly to /var/tmp, however, it does not get erased when the computer is reinstalled or reconfigured.

/scratch

Directories under /scratch exist on the file server and can be accessed from every machine. Unlike users' home directories, they are not backed up, but can usually accommodate larger data files. If space becomes limited we may automatically compress files or (if time permitting) ask users to either remove or archive files that are no longer needed in order to make room for other users. Files that are not actively being used should be compressed if possible.

Send email to manager@stat.berkeley.edu to request space under /scratch.

Compression

Infrequently accessed files may be compressed to save disk space.

gunzip file.gz                         Uncompresses file.gz to file
gzip file                              Compresses file to file.gz
tar xzf file.tar.gz                    Uncompresses file.tar.gz to the contents of file.tar
tar czf file.tar.gz file1 [file2...]   Compresses one or more files into file.tar.gz
unzip file.zip                         Uncompresses file.zip
zip file.zip file1 [file2...]          Compresses one or more files to file.zip
uncompress x y ...                     Uncompresses x.Z, y.Z, ... to x, y, ...
compress x y file1 [file2...]          Compresses x, y, ... to x.Z, y.Z, ...
zcat x.Z y.Z ...                       Prints the compressed file(s) to the terminal\

See the UNIX manual pages for the above programs by using the 'man', for example 'man gzip'.

Deleting Files

To remove files and directories type:

rm file         Removes 'file' provided you have write permission on it. 
rm -f file      Removes 'file' if you have write permission in the directory containing it. 
rmdir dir       Removes the empty directory 'dir'. 
rm -rf dir      Recursively remove dir including every file and subdirectory. Use with caution.

Some applications leave behind files that may be removed without adversely affecting the program.

  • Web browser cache files can be removed. You can clear your browser disk cache, as well as instruct the browser to set aside less disk space to use for its cache, in the program's Preference window, usually under 'Advanced'.
  • PostScript, .aux, and .log files produced by LaTeX can be recreated as necessary from the corresponding .tex file.
  • Compiled object files, often with the suffix .o, produced by C, C++, or Fortran compilers, can usually be deleted. If need be, they can be recreated from the original programs that produced them.
  • Compilers tend to create lots of big binary files such as '.o' and '.out' files. '.out' files in particular can be quite large. If you have such files which have been unused for several days and which you don't intend to use for several more days, they should be removed. (They can easily be recreated if you need them; see 'help learn_fortran' for more discussion.) The size of '.out' files which you do need can be reduced somewhat by stripping them. Type:

    strip a.out     Strips an already existing 'a.out'. 
    f77 -s ...      Creates a pre-stripped '.out' file when using f77. 
    cc -s ...       Similarly for cc.
  • When programs crash, they sometimes report 'Core dumped' indicating that a large file called 'core' has been created in the program's current working directory. A user may disable core dumps by adding "ulimit -c 0" to ~/.bashrc.

Best Practices

The Library and Research IT have a Research Data Management (RDM) program that, "helps researchers navigate the complex landscape of managing data before, during, and after their research."