Introduction to Unix

19
Introduction to Unix Robert Kofler Introduction What is Unix? • Unix: Uniplexed Information and Computing System • an operating system that is it very stable and well suited for multiple users and multiple tasks • foundation of many modern OS such Linux, Android and Mac OS X • basic ideas: plain text for storing data, hierarchical file system, many small tools important innovation is a modular design the “Unix philosophy”, i.e. the OS provides a limited set of simple tools that each perform a single well defined task; complex tasks and sophisticated workflows can be exectuted by concatenating these commands (piping and shell scripting) instead of using a single, large monolithic tool Rob Pike, one of the early gurus sayed: “power of the system comes from the relationship among the programans rather than from the programs themselfves” centerpiece is the kernel - a master control program that provides services to start and stop programs, handle file systems and schedule tasks Short history of Unix • MIT and Bell Labs developed an OS called Multics around 1960 Ken Thompson and Dennies Ritchi and others got very frustrated with Multics as it had many problems and become too large for efficient maintainance • they decided to redo an OS on a much smaller scale Unix name is actually a joke, making fun of Multics; Unix is pronounced as “eunuchs”, i.e. an emasculated Multics (deprived of power and virility) • 1972 Unix rewritten in C programming language - before that it was written in Assembler; Why is this important? • 1970-1980: large popularity in academia; increasing acceptance also by commercial start ups • 1990 popularity increased even more due to the release of Linux • 2000 Apple released first Mac OS X, based on Unix Evolution of Unix 1

Transcript of Introduction to Unix

Introduction to UnixRobert Kofler

Introduction

What is Unix?

• Unix: Uniplexed Information and Computing System• an operating system that is it very stable and well suited for multiple users and multiple tasks• foundation of many modern OS such Linux, Android and Mac OS X• basic ideas: plain text for storing data, hierarchical file system, many small tools• important innovation is a modular design the “Unix philosophy”, i.e. the OS provides a limited set of simple tools that

each perform a single well defined task;• complex tasks and sophisticated workflows can be exectuted by concatenating these commands (piping and shell

scripting) instead of using a single, large monolithic tool• Rob Pike, one of the early gurus sayed: “power of the system comes from the relationship among the programans rather

than from the programs themselfves”• centerpiece is the kernel - a master control program that provides services to start and stop programs, handle file

systems and schedule tasks

Short history of Unix

• MIT and Bell Labs developed an OS called Multics around 1960• Ken Thompson and Dennies Ritchi and others got very frustrated with Multics as it had many problems and become

too large for efficient maintainance• they decided to redo an OS on a much smaller scale• Unix name is actually a joke, making fun of Multics; Unix is pronounced as “eunuchs”, i.e. an emasculated Multics

(deprived of power and virility)• 1972 Unix rewritten in C programming language - before that it was written in Assembler; Why is this important?• 1970-1980: large popularity in academia; increasing acceptance also by commercial start ups• 1990 popularity increased even more due to the release of Linux• 2000 Apple released first Mac OS X, based on Unix

Evolution of Unix

1

Figure 1:

2

Unix in bioinformatics?

• allows to deal with large data - try to open a 100GB text file in a Word processor• powerful command line tools (Unix philosophy) allows to do even complex tasks with a single line of code• allows automating workflows => try to change the content of 1000 files with GUI based software, the result will be tons

of errors (e.g. when attention shortly drifts to evening beer) and teniditis (thats why I went into bioinformatics)• its probably still used in decades; investment in learning Unix will likely pay of for your entire career (command line

magic)• repeatability of the analysis: workflows may be stored as text files; repeatabiltiy is important for you (in case you

discover an error) and others (they may want to redo an analysis)• widely used: much support and many people developing tools

Recommended Readings

• Unix command cheat sheat: http://cheatsheetworld.com/programming/unix-linux-cheat-sheet/• Brian W. Kernighan, Rob Pike: The Unix Programming Environment• Love, Merlino Reed.., Beginning Unix• Powers and Peek: Unix Power Tools

In medias res

Start the command line

Finder => Applications => Utilities => Terminal

Drag the terminal icon into the quick start bar

The three major questions when you find yourself in a novel environment

• Who am I (whoami)• Where am I (pwd)• Who is the person in my bed (ls)

3

Figure 2:

4

whoamipwdls

Structure of a unix command:

command [parameters] [files]

The strange person in detail

ls -lls -l -hls -lh # the same as -l -h

• -l list file content in detail• -h human readable file sizes

Figure 3:

• col1: permissions and is it a file or a directory• col2: number of links; files in directory or number of hard links• col3: owner• col4: group owner• col5: size• col6,7,8 Last modified, months, day, year (sometimes minutes as well)• last col: file name / directory name

5

Walk of shame: planning the escape

Change directory (cd)cd Desktop # move into the folder Desktop; relative pathcd /Users/robertkofler/Desktop # move into folder Desktop; absolute pathcd . # go to current direktory; when is this useful?cd / # go to the root directory; absolut pathcd .. # go to parent directorycd ~ # go to home folder

Relative path, absolute path

Example of a file tree

6

What is root?

Unix has a hierachical file system with root (“/”) as the top folder

Figure 4:

7

Moving around

Figure 5:

8

You are at student and want to move to Desktop?cd Desktop # move into the folder Desktop; relative pathcd /Users/student/Desktop # move into folder Desktop; absolute path

Task:

Figure 6:

• You are at student and want to move to tasks? (relativ and absolute)• You are at student and want to move to Users? (relative and absolute)• You are at student and want to move to THE FOLDER? (whats the name again? relative and absolute)

9

Figure 7:

• You are at bin and want to move to Temp1? (relativ and absolute)• You are at bin and want to move to python? (relative and abolsute)

10

Task:

• Download the files human-genes.gtf and illumina-reads.fastq.gz from https://drrobertkofler.wikispaces.com/Unix_Alignments

• Navigate to the folder where you stored it• tell me the size in kb

Info about a command

Omg, I forgot all the options of ls, what should I do??man ls # manual for ls# exit with q

Most important buttons in the command line

1.) Tab: autocomplete

Task

• Navigate to the folder where you stored the humang-genes.gtf using the absolute path and autocomplete• What could be the advantages of using autocomplete?

2.) Button up

Previous command

Furnishing your new home

Create directoriesmkdir mydatamkdir mydata/test/raw # does this work ?mkdir -p mydata/test/raw

Create empty filestouch mydata/text.txttouch mydata/test/moretxt.txt

Remove directoriesrmdir mydata/test/rawrmdir mydata/test # does this work?

Note: rmdir only removes empty directories

Remove filesrm mydata/test/moretxt.txtrm -rf mydata # killer command; use with uttermost care (gone is gone)

# -r recursive (subdirectories) -f force (just delete, no questions asked)

Move filestouch test.txt # does this work?mv test.txt oklahoma.fastq.txt # move can be used to renamemkdir usamv oklahoma.fastq.txt usa/ # move files into a new folderrm -rf usa

11

Task:

• Go to your homefolder• create the directory analysis• create the directory analysis/coursework• create the directory analysis/coursework/rawdata• move the two files (human genes and illumina reads) into the directory analysis/coursework/rawdata

Inspecting files

wc myfile.txt # word count; display the number of lines, words and characters (in this order)head myfile.txt # output the first ten lines of the filehead -50 myfile.txt # output the first 50 lines of the filetail myfile.txt # output the last 10 lines of the filetail -2 # output the last 2 lines of the filecat myfile.txt # output the entire filecat myfile1.txt myfile2.txt # output the entire first file than the entire second file (no separation between the files)

Task

• Display the first 20 lines of the human genes and the last 5 lines• How many genes do we have?• Display the first 10 lines of the illumina reads

Piping basics

The symbol “|” is a magic character in Unix, it allows to concatenate commands. THIS IS THE SINGLE MOSTIMPORTANT COMMAND OF UNIX With the pipe command rests the entire power of Unix. Remember the UnixPhilosophy, small commands that only have a very limited function. With the pipe command these small commands can beconcatenated to solve complex tasks.

Meaning of the pipe

use the output of the left command as input for the right commandcat myfile.txt | head -10

• left cat myfile.txt• rigth head -10• Note that most Unix commands can be used with a file and within a pipe

head -10 myfile.txt # Head with a filecat myfile.txt | head -10 # Head with a pipe

Complex pipes examples

cat myfile.txt | tail -10 | wc # Whats going on here?

Question: Is there an upper limit to this pipeing?

Thinking Task:

• Relying only on the commands we already know, display the middle 10 lines of human genes :)

12

Editing file: vi

Vi is a powerful text editor operating solely in the command line. Unix supports two such text editors: vi and emacs; There isa battle raging between the religion of vi and the religion of emacs; I’m of the religion of vi. To be useful in the command lineyou have to be able to know at least one of these two. I’m going to teach vi. However if you already know emacs than justrelax and sleep.vi myfile.txt

Vi has two modes the command mode and the text enter mode. For example when you press “y” in the command mode ityanks the current line (like copy) but when you press y in the text mode it simple writes y.

• Press “i” (insert) to enter the text mode• Press Esc to exit the text mode and enter the command mode• press “:w” to save the current file (command mode)• press “:wq” to save and exit (back into the command line)• press “:q!” to exit vi without saving• press “A” to append text at the end of the current line (changes from text to command mode)• press “dd” to delete the current line (command mode)• press “2dd” to delete the next two lines• press “yy” to yank the current line (put it into memory); What is “2yy”" doing?• press “p” to paste the line from the memory into the text• here is a list of commands http://www.lagmonster.org/docs/vi.html I urge you to study them in some details; Knowing

vi well helps to make you quit powerful in the command line

Task

• create a file that contains a one sentence description of your master thesis with vi• create another file that lists your hobbies (one hobby at a line)• from your hobbies delete some lines, copy and paste some others and edit some other lines• undo the damage and create a nice hobby file; use your name as file name (no spaces); Note that you will share your

hobbies later on with the others.• display the first two hobbies using head

Copy vs Link

Move filestouch original.txt # create a filecp original.txt copy.txt # create a copy of a fileln copy.txt link.txt # create a hard link of a file

Task 4

• Use vi and enter the text “1” into the original.txt, than “2” into the copy.txt and “3” into the link.txt• Display the content of the three files• Notice anything interesting? How can you explain this?• Think of some examples: in which situations would you use copy and whan would you use link?

Addendum copy

cp -r folder otherfolder # recursively (including subfolders) copy the content of one folder to another one

13

Remote access to computers

ssh (Secure shell)

ssh allows to operate on a remote computerssh user_name@compter_name

scp (secure copy)

allows to copy files between computers:#scp from toscp myfile user_name@compter_name:/absolute/path/of/destination

Task

• ssh as user vetgrid03 to computer i122mc146.vu-wien.ac.at• Explore the files in the root folder• move to folder /Volumes/Temp/coursework/hobbies2017• scp you hobbies into the folder /Volumes/Temp/coursework/hobbies2017 (you may want to open a second shell window)• explore the hobbies of your colleagues

Managing processes

Running process in background

sleep 10 # command terminal sleeps for 10 secondssleep 30 & # command terminal sleeps but you can continue working# in other words the command is executed in the background

Question: where is this useful?

Move command from background to foregroundsleep 30 &fg # foreground

Move active command in backgroundsleep 30ctrl+z # suspend commandbg # continue in background

Thus any command can be moved into the background

Display running processes

ps -e

NOTE: the left column is the process id (PID)

Monitor running processes in real time

top# exit with q (quit)

Sort running process by cpu usage, display most computationally demanding processes first

14

top -o cpu # sort order by cpu

Task

• go to vetgrid03• which are the most demanding processes?• what is the PID of this process

kill a task

kill 39534 # 39534 is a PID

Bioinformatics with the command line

grep

Only retain rows that fit the patterngrep 'blabla' file.txt # get rows that contain 'blabla' from file.txt

cat rawdata/human-genes.gtf | grep 'BRCA'

Task

• How many lines contain ‘MA’?

cut

Display only column 1 of the filecat rawdata/human-genes.gtf | cut -f 1

Display several columns of the filecat rawdata/human-genes.gtf | cut -f 1,9

Use a different field deliminatorcat rawdata/human-genes.gtf | cut -f 1 -d " "

Whats going on here??

Task

• How many genes are on chromosome 19?• How many genes are on chromosome 1? Did you notice any problem? How could we overcome it?

grep '^blabla$'# ^ beginning of line# $ end of line

uniq and sort

Use vi and create a file (numbers.txt) with the following content:

15

11222311

The uniq command is eliminating all consecutive redundant entries. We want to get only the unique entries in the file. Whatdo we expect?cat numbers.txt | uniq

What is going on? Any ideas?

We may also count the number of occurences of each numbercat numbers.txt | uniq -c # count the uniq entries

How could we get the counts of occurences of each number? Basically we need to sort the entries before using uniq.cat numbers.txt | sort | uniq -c

NOTE: for this reason sort and uniq frequently comes as pair

NOTE: sort is extremly powerful; it allows to sort by a specific column (-k3) several columns (-k3 -k1), numerical oralphanumerical etc etc. . .

Task

• How many genes are on each chromosome?• Sort the human genes, first by chromosome and than by start position (use the help of man and google).

Problem: We want to obtain a list of genes located on chromosome 3, how could we do this?

As first part of the problem we need to keep only genes located on chromosome 3; Introducing awk :)cat rawdata/human-genes.gtf|awk '$1=="3"'# column 1 needs to match 3# $1 is column 1# $3 is column 3 etc# in case column 1 matches (==) the default behaviour of awk is to print the line

AWK is actually a powerful mini-programming language; Learning awk in more detail can be very useful. However, I onlyhave time for a quick introduction. In case we still have some time we will revisit awk in the end.

Now its easy to get a list of the genes on chromosome 3cat rawdata/human-genes.gtf|awk '$1=="3"'|cut -f 4 -d " "# Who can explain this code to me?

Awk also allows to test for pattern matches, e.g. if the start position of a gene ends with a zero.cat rawdata/human-genes.gtf|awk '$4~/0$/'# $4 column 4# ~ tilde: match the following pattern within //# $ end position (end position of $4)# 0$ ends with a zero

Redirecting the output

16

cat rawdata/human-genes.gtf|awk '$1=="3"'|cut -f 4 -d " " > chr3-genest.txt # store output into file;# if file already exists overwrite itcat rawdata/human-genes.gtf|awk '$1=="3"'|cut -f 4 -d " " >> chr3-genest.txt # append output into file# if file does not exist, create it

Think Task

• Get a list of all genes having exactly 3 exons

Working with zipped files

Zip a file

Zip your hobbiesgzip my-hobbies.txt

Unzip a file

gzip -d my-hobbies.txt

Tasks

• What happened to the file extension of a zipped file?• What happened to the size of the zipped file? By which factors have your hobbies shrunken?• What happens when you read a zipped file (eg with head,tail)?

Use zipped files within a pipe

A fastq file is the standard raw output of Next Generation Sequencing (e.g Illumina) and may also be obtained for PacBio orOxford Nanopore.gzip -cd rawdata/illumina-reads.fastq.gz | head

Figure 8:

Every fastq entry has four lines

• line 1: read name (starts with an @)• line 2: sequence of the entry• line 3: read name (starts with an +)

17

• line 4: quality of the sequence

Details about this file will be presented in another lecture.

Task

• How many entries does this fastq file have?

Question: obtain all reads that contain the character ‘N’

We want to obtain all reads that contain the character ‘N’ and store it in a separate fastq file.

The key problem here is that a single fastq entries has four lines, whereas Unix operates on a line per line basis. It would thusbe nice to merge the content of several lines into a single one.

paste

use vi to create the following file (paste-test.txt)12345678

Now we can trycat paste-test.txt | paste - -cat paste-test.txt | paste - - - -

What is the difference?

Back to the fastq

gzip -cd rawdata/illumina-reads.fastq.gz|paste - - - -|head

Next we need to obtain only entries where the sequence contains the character N; Note that column 2 contains the sequencegzip -cd rawdata/illumina-reads.fastq.gz|paste - - - - |awk '$2~/N/'|head

Repackaging fastq entries to 4-lines per entry using the translate command

The translate command replaces some characters with some others# tr 'replace_this' 'with_this_character'echo "Hallo"|tr 'l' 'r'

Another info: special characters

• \t tabulator• \n new line

gzip -cd rawdata/illumina-reads.fastq.gz|paste - - - - |awk '$2~/N/'|tr '\t' '\n'|head

Finally zip the output and write it into a novel fastq file

18

gzip -cd rawdata/illumina-reads.fastq.gz|paste - - - - |awk '$2~/N/'|tr '\t' '\n'|gzip -c > rawdata/fastq-withN.fastq.gz

Task

• Display the content of rawdata/fastq-withN.fastq.gz;• How many entries have a N in the sequence?

Think task

• Write all fastq entries that are shorter than 20 bp into a separate zipped fastq-file. You need the awk commandlength(somecolumn)

Send me this file as email rokofler at gmail.com

Advanced topics: iterating over many files

The unix command line also allows to iterate over many files

For example we may have forgot to add the exentsion .txt to several text files# lets first create a folder containing our filesmkdir iteratetouch iterate/1...touch iterate/7

Now lets change the file namesfor i in iterate/*; do mv $i $i.txt; done

Note $i is the variable that contains the file name. You could use any variable name.

When you are able to wield for-loops and unix commands you can perform powerful analysis in little time :)

Note: you can also destroy your entire data set in little time. . . so be careful when using for-loops and test them before usingthem.

Summary

I hope this helps to illustrate the Unix philosophy: with a few simple commands, concatenated in creative ways, powerfulanalysis may be achieved. You don’t need programming for this. You only require a bit of creativity. Also its very short,many tasks can be solved with a single line of code. Note that some commands like awk, the for loops, shell scripting are verypowerful, spending more time in learning them will certainly pay off during your further carreer.

19