Jay's journal: file

Showing posts with label file. Show all posts

Wednesday, October 2, 2013

Git -- Cheat Sheet

Here is a cheat sheet for git.

1. Create
git init                  # create a local repository
git clone <url>    # clone a repository from url

2. Commit
git commit -m "commit message"

3. Browse
git log                  # history of change
git status             # files changed in working directory
git diff                 # diff between working directory and the index
git diff HEAD       # diff between working directory and the most recent commit
git diff --cached   # diff between the index and the most recent commit
git show <object>   # show object
gitk                             # git repository (GUI) browser

4. Stage
git add <file>                  # add file to the index
git reset HEAD <file>    # unstage the staged file

5. Undo
git commit -a --amend          # fix the last commit
git reset --hard <commit> # discard any changes and reset to the commit
git revert HEAD                    # revert the last commit
git revert <commit>             # revert the specific commit
git checkout -- <file>       # unmodify the modified file

6. Branch
git branch <new_branch> # create a branch named new_branch based on HEAD
git branch -d <old_branch> # delete the branch named old_branch
git checkout <branch>           # switch to the branch
git checkout -b <branch>    # create a new branch then switch to it
git merge <branch>                # merge the specified branch into HEAD

7. Update
git fetch               # download latest changes from origin
git pull                 # fetch from and integrate (merge) with origin

8. Publish
git push                 # update origin

Monday, September 30, 2013

Git -- basics

Git is a distributed version control system (DVCS) designed to handle things from small to very large projects. DVCS features many advantages over the traditional centralized VCS, because users have the entire history of the project on their local disks (repository). Two of them are:

It allows users to work productively without network connection, and
it makes most operations much faster, since most operations are local.

Git employs snapshots instead of file diffs to track files. Every time we do a commit, it basically takes a picture of working directory at that moment and stores that snapshot. In addition to the working directory, Git has two main data structures:

index (also called staging area or cache) -- a mutable file that caches information about the working directory, and
object database -- an immutable, append-only storage that stores files and meta-data for our project.

The object database contains four types of objects:

blob (binary large object) -- each file that we add to the repository is turned into a blob object.
tree -- each directory is turned into a tree object.
commit -- a commit is a snapshot of the working directory at a point in time.
tag -- a container that contains reference to another object.

Each object is identified by a SHA-1 hash of its contents. In general, git stores each object in a directory matching the first two characters of its hash; the file name used is the rest of the hash for that object. The command git cat-file or git show allows us to view content of objects.

The index (staging area) serves as a bridge between the working directory and the object database. Information stored in the index will go into the object database at our next commit. The following figure shows the normal work flow among them.

Working        Index           Object
directory                       database

     <--------checkout-----------|
     |-----add---->
                           |---commit--->

Each file in our working directory can be in one of two states: tracked or untracked. Tracked files are those that are in the last snapshot or in the index (staging area); they can be further classified as unmodified, modified, or staged. A staged file is one that has been modified/created and added to the index. Untracked files are everything else. The lifecycle of files in working directory is illustrated in the following figure:

untracked         unmodified        modified        staged

     |----------------------add------------------------->
     <---------------------reset------------------------|
                                  |-----edit------>
                                                           |------add----->
                                 <----------commit-------------|

The command git ls-files lists tracked files, whereas git ls-file -o (option o) lists untracked files.

Sunday, August 25, 2013

Automated Remote Backups with Rdiff-backup

One subtle feature of rdiff-backup is that it allows users to make remote backups over Internet using SSH, which makes remote backups very secure since data transferred is encrypted.

One problem is that SSH requires a password for logging, which is not convenient if we want to run rdiff-backup as a cron job. Here we show how to initiate rdiff-backups from a central backup server, and pull data from a farm of hosts to be backed up. For security reasons, the central server uses a non-root user account (rdiffbk) to perform backups, whereas root account is used on each host being backed up. Though root accounts are used on hosts being backed up, they are protected by SSH public-key authentication mechanism with forced-command-only option.

For convenience, I'll call the central backup server canine and three hosts to be backed up beagle, shepherd and terrier. For short, only works on canine and beagle will be shown.

Here is the procedure for backup server canine:

generate one passphrase-free SSH key pair for each host being backed up,
move corresponding ssh key to each host,
create SSH configuration file, and
create a cron job file

Step 1: generate one passphrase-free SSH key pair for each host being backed up

To generate RSA type pair for host beagle, we issue

ssh-keygen -t rsa -f id_beagle-backup

where private key will be saved in file id_beagle-backup and public key id_beagle-backup.pub.

Step 2: move corresponding ssh key to each host

To move id_beagle-backup.pub to host beagle, we may choose to use any preferred method (for example, ftp, sftp, or ssh-copy-id), since public key is not sensitive. Other hosts can be done similarly.

Step 3: create SSH configuration file

To define how to connect to host beagle with backup key, we place the following lines into file ~rdiffbk/.ssh/config. Other hosts need to be configured similarly.

host beagle-backup
    hostname beagle
    user root
    identifyfile ~rdiffbk/.ssh/id_beagle-backup
    protocol 2

Step 4: create a cron job file

The following cron job file automates the remote backups daily at 200am, 210am, and 220am, respectively.

0 2 * * * rdiff-backup beagle-backup::/remote_dir beagle/remote_dir
10 2 * * * rdiff-backup shepherd-backup::/remote_dir shepherd/remote_dir
20 2 * * * rdiff-backup terrier-backup::/remote_dir terrier/remote_dir

By default setting, rdiff-backup uses SSH to pipe remote data. Therefore, both SSH server and rdiff-backup are required in hosts to be backed up.
What left on host beagle and others (shepherd, terrier) is simply to give permission to canine to access it (through SSH) and run rdiff-backup. This can be done in the following two steps:

Step I: create an authorized-keys file for root account

To enable SSH public key authentication for root account, we need to create the file /root/.ssh/authorized_keys, which consists public key for user rdiffbk@canine, forced command and other options. The public key (id_beagle-backup.pub) should be available for beagle once we have done Step 2. A sample authorized_keys file is as follows:

command="rdiff-backup --server --restrict-read-only /",from="canine",no-port-forwarding,no-X11-forwarding,no-pty ssh-rsa AAAAB3.... rdiffbk@canine

Here, for security reason, rdiff-backup server is restricted to real only, and
we disable port-forward, X11-forward and pty options. See here for more details.

Step II: configure SSH server for root access

As we saw here, this can be done by put the following line in the SSH server configuration file (sshd_config):

PermitRootLogin forced-commands-only

Sunday, July 28, 2013

Rdiff-backup by Examples

Rdiff-backup is a python script that backs up one directory to another. Some features of rdiff-backup as claimed in its official site are: easy to use, creating mirror, keeping increments, and preserving all information. Here are some examples of its usage.

1. simple backing up (backup local directory foo to local directory bar):

rdiff-backup foo bar

2. simple remote backing up (backup local directory /some/local_dir to directory /whatever/remote_dir on machine hostname.net):

rdiff-backup /some/local_dir hostname.net:://whatever/remote_dir

Ssh will be used to open the necessary pipe for remote backing up.

3. simple restoring from previous backup (restore from bar/dir to foo/dir):

cp -a bar/dir foo/dir

4. simple restoring from the latest remote backup (restore from hostname.net:://whatever/remote_dir to local directory /some/local_dir):

rdiff-backup -r now hostname.net:://whatever/remote_dir /some/local_dir

5. restoring from a certain version of a remote backup (restore from backup done 15 days ago):

rdiff-backup -r 15D hostname.net:://whatever/remote_dir /some/local_dir

6. restoring from an increment file (restore file pg.py to its version dated 2011-11-30T00:28:38+08:00)

rdiff-backup hostname.net:://remote-dir/rdiff-backup-data/increments/pg.py.2011-11-30T00:28:38+08:00.diff.gz /local_dir/pg.py

Monday, July 8, 2013

How to Mount LVM partitions/disks

Logical volume manager (LVM) is suitable for many occasions, e.g., managing large disk farms, easily re-sizing disk partitions on small systems, and etc. The following quote from wiki LVM page best describes its common uses:

One can think of LVM as a thin software layer on top of the hard disks and partitions, which creates an illusion of continuity and ease-of-use for managing hard-drive replacement, repartitioning, and backup.

Here are steps on how to mount LVM partitions/disks:

1. scan all disks for volume groups:

vgscan

2. scan all disks for logical volumes:

lvscan

The output consists of one line for each logical volume indicating if it is active and its size.

3. change the availability of the logical volume (if it is inactive):

lvchange -a y /dev/vg_name/lv_name

where vg_name is the name of the volume group found by vgscan and lv_name name of the logical volume found by lvscan. You may use vgchange to change the availability of all logical volumes in a specified volume group.

4. mount the logic volume:

mount /dev/vg_name/lv_name /mount/point

where /mount/point is the mount point for the logic volume.

Monday, July 1, 2013

Access UFS File System under Linux

Unix file system (UFS) is widely used in many Unix systems, for example, FreeBSD, OpenBSD, and HP-UX. There are times that we need to access UFS under Linux systems. The following command allows us to mount UFS2 for read-only (ro) under Linux systems:

mount -t ufs -o ufstype=ufs2,ro /dev/sdXY /mnt/path

Write support for UFS is not compiled into Linux kernels by default. One needs to properly configure and compile kernels for write support.

Wednesday, May 1, 2013

Hard and Symbolic links

A hard link is an entry in a directory file that associates a name with an (existing) file on a file system, which allows a file to appear in multiple paths.

Unix/Linux systems do not allow hard links on directories, since it may create endless cycles. Hard links are limited to files on the same volume, because name and file association in each hard link is through inode. Most file systems that support hard links use link count to keep track on the total number of links created to point to the inode (file). To find all the files which refer to the same file as NAME, we may use command find with the option '-samefile NAME' or '-inum INODE', where INODE is the inode number of NAME. The command ls with option '-il' gives you information on link count and inode for files.

A symbolic link is a special type of file that contains a text string which is interpreted by the operating system as a path to another file/directory. The other file/directory is usually called the "target". A symbolic link is another file that exists independently of its target, i.e., they are two files/directories indexed by two different inodes, as opposed to hard links. Symbolic links are different from hard links in that:

a symbolic link may point to a directory, and
a symbolic link may point to a directory/file in different volume

There is one issue with symbolic links. If a symbolic link is removed, its target remains unaffected. However, there is no automatic update for a symbolic link if its target is moved, renamed, or deleted. The symbolic link continues to exist and point to the original target, which no longer exists. This is called a broken link.

Tuesday, April 30, 2013

Inode

An inode is a data structure in Unix/Linux file systems that stores information (metadata) about a file system object, except file content and file name. The command stat allows us to retrieve most information stored in an inode. Here is an example of system file bash on a Linux box:

File: `/usr/bin/bash'
Size: 902036        Blocks: 1768       IO Block: 4096   regular file
Device: 808h/2056d    Inode: 279026      Links: 1
Access: (0755/-rwxr-xr-x) Uid: (    0/    root)   Gid: (    0/    root)
Context: system_u:object_r:shell_exec_t:s0
Access: 2013-05-01 13:59:45.680608666 +0800
Modify: 2013-01-31 22:47:10.000000000 +0800
Change: 2013-03-21 10:29:27.135450340 +0800
Birth: -

We can see that regular files have the following attributes:

Size in bytes
Device ID where the file is stored
Inode number
Link count
User ID of the file
Group ID of the file
Permissions (access rights) of the file
Timestamps on last access (atime), modify (mtime), and change (ctime)

To get the file content, we need to consult to the inode pointer structure that is also part of information stored in inode. The inode pointer structrure consists of four different pointers:

direct pointer,
singly indirect pointer,
doubly indirect pointer, and
triply indirect pointer

These pointers point directly or indirectly to block locations where pieces of file content are stored.

One question remains: where is the file name stored? The answer is that it is stored in the content of the directory that contains the file. Unix/Linux directories are lists of association structures, each of which consists of one file name and one inode number for that file. That is why we need to specify (implicitly or explicitly) the path whenever we want to access a file in the file systems.

Monday, April 22, 2013

poor men's PGP

I just finished the pmPGP, a CLI for sending/receiving openPGP mime messages.

The pmPGP is based on python and gnupg; it supports sending emails in the following formats:

plain -- regular email
sign -- RFC3156
encrypt -- RFC3156
sign-encrypt -- RFC3156
Sencrypt -- Symmetric encryption (for fun and personal usage)
sign-Sencrypt -- (for fun and personal usage)




Poor man may use pmPGP to store/backup files on email servers.

Sounds interesting? Get it from:

https://github.com/liujay/pmPGP

Saturday, January 5, 2013

find all file descriptors used by a process

A file descriptor (FD) is an abstract indicator for a file accessing. In Unix-like systems, file descriptors can refer to many different objects besides files, such as pipes, unix domain sockets, and internet sockets.

lsof (list open files) is an open source command to report a list of open files and the processes that opened them. To find all file descriptors used by the process with pid, we may issuing the command:

lsof -p pid

To find all internet sockets used by the process with pid, we may issue:

lsof -i -n -P | grep pid

where, -i specifies listing IP sockets only, -n no translation of hostnames, and -P no translation of port names.

What if lsof is not available on your system?

If your system implements the procfs (proc filesystem, /proc), all file descriptors used by the process with pid can be found in the directory /proc/pid/fd. Therefore, on linux systems, you may issue:

ls -l /proc/pid/fd

to get your job done. However, other approach is needed for FreeBSD systems, since procfs is being gradually phased out on FreeBSD. Both fstat (-- identify active files) and procstat (-- get detailed process information) allow us to achieve our goal. You may issue:

fstat -p pid
or,
procstat -f pid

where, pid is the process id of your interest.