Untitled

# Linux Notes


strace -p $pid # running program
strace -c $program # summary
strace -e trace=write # filter
ptrace()


uptime

logs:
dmesg | tail
tail /var/log/messages  ( /var/log/syslog /var/log/kern )
journalctl

/var/log/audit/audit.log SELinux
/var/log/secure

vmstat # summary
vmstat 1 5 -w # every 1 sec, print 5 . wide .(first line is summary since reboot)
vmstat -d # disk stats
vmstat -s # summary memory stats

r: runnable (running or waiting to run in queue)
b: uninterruptible sleep (D in ps)

active/inactive: recently used/not

cat /proc/meminfo

softirqs: network tx/rx, timer

dmidecode : hardware info (memory, cpu, board, vendor )

yum install sysstat:
mpstat -P ALL # cpu balance

pidstat
pidstat 1 1
pidstat -p pid

iostat -x 1
sar -n DEV 1 # network throughput
sar -n TCP,ETCP 1 # TCP stats (also: ss, network interface throughput, retransmits

free -m
top : P sort by CPU, M sort by memory


lscpu
cat /proc/cpuinfo
    flags


df -lT # -l local. -T type
lsblk -f # filesystem
file -s /dev/hda1
mount


/proc/loadavg : uptime
/proc/meminfo : free -m
/proc/buddyinfo : memory fragmentation
/proc/slabinfo : slabs: cache of freq used objects in kernel. inode, xfs, net.
                        vmstat --slabs
/proc/version
/proc/locks
/proc/mounts :  mount
/proc/devices : char /(tty, usb) & block
/proc/cgroups
/proc/cmdline : linux boot cmline args: image, root, lang
/proc/interrupts : IRQs per CPU
/proc/dma : ISA Direct Memory Access (floppy, cascade)
/proc/partitions : major/minor blocks
/proc/filesystems : fs types in mount -t  : nodev , xfs
/proc/swaps
/proc/stat : CPUs, processes
/proc/modules : lsmod
/proc/sys/* : sysctl -a

/proc/net/dev : tx/rx: errs, drop
/proc/net/netstat : TCP stats


/proc/1/stat
/proc/1/status : state, memory, signals caught

/proc/1/cmdline
/proc/1/comm
/proc/1/exe : link to executable

/proc/1663/task/*
/proc/2753/environ : starting env vars, env | tr '\0' '\n'

/proc/1/limits
/proc/1/cgroup
/proc/1/ns pid, mnt, user

/proc/1/fd/* : symbolic links
/proc/1/fdinfo/* : info about FDs

/proc/1/mounts, mountinfo, mountstats
/proc/1/oom_*

/proc/1/map_files/* : memory-mapped files
/proc/1/maps :  memory: stack, libraries, heap, vdso, vsyscall (accelerate some syscalls) , /smaps
/proc/1663/numa_maps     Non-Uniform Memory Access
/proc/1/io : io stats

/proc/1/syscall
/proc/1/timers


**Perf numbers**
network: ec2 instances (except tiny ones)
1-2 Gbit/s Baseline, 10 Gbits/s Burst

disk: EBS volumes, general purpose, throughput
128 MiB/s - 250 MiB/s (1,000 MiB/s for io optimized)
                  ~ 2 Gbit/s

**Random**
store ssh session locally:
`ssh <user>@<host> | tee ssh.txt`

downloading a file using remote command without shell login, if allowed:
ssh hostname tar cvjf - /path/to/folder | tar xjf -


**FDs**
opaque identifier to an open file
integer returned by open() syscall
index to an array of open files kept by kernel
(0,1,2 reserved stdin, stdout, stderr)
per process:
    ls -l /proc/$pid/fd
        123 -> /path/to/file

multiples FDs can refer to same file, 2 processes fork()
open file table: offset, access mode r/w, reference to inode object (in-memory or in-Core inode table)

not closing files: too many file descriptors FDs
systemd can limit then
ulimit can limit them

closing file without killin process: gdb
    gdb -p $PID
    p close($FD)
Bash: http://logan.tw/posts/2016/02/20/open-and-close-files-in-bash/
        exec $fd>&-

 overall limit: ulimit -n
 limit per process: cat /proc/$pid/limits
 FDs for process: ls /proc/$pid/fd

**ulimit**: for current Bash session. setrlimit() . Set at /etc/security/limits.conf
control groups, **cgroups**: for user-defined processes

----

- Best way to deal with A piece of Hardware that constantly sending interrupt to OS with command buffer of 64 ?
- memory leak: Valgrind, mtrace() in C
- difference between select, poll, epoll? https://devarea.com/linux-io-multiplexing-select-vs-poll-vs-epoll/#.X2ppb5NKg8N
  select -> poll > epoll - more efficient , less portable. epoll doesn't work with files.
- interrrups. IRQ number, interrupt vector table. DMA. GIC? https://en.wikipedia.org/wiki/Interrupt ,  https://en.wikipedia.org/wiki/Message_Signaled_Interrupts , top-half vs bottom-half
- Memory Barrier instructions prevent reordering of CPU instructions beyond or ahead of the barrier.

**CPUS**

`taskset -pc 1-2 $pid` process binding
`mount -t cpuset ...` exclusive cpu sets

**Virtual Memory**

- space & temporal locality
- pages & page table
  - programs using more (virtual) memory than RAM, loads faster
  - no need for programers to be concerned about RAM addresses
  - process isolation & protection (r,w,x)
  - memory sharing: IPC or shared library
- memory allocation: heap: malloc, brk(). mmap()
- tryign to access address not in page table:  SIGSEGV, segmentation fault
- segments: text, data, stack, heap

Page size: getconf PAGESIZE (4KB)

VM is an abstraction that provides each process and the kernel with large (~unlimited), private and linear (contiguous) memory.

UNIX: swapping: all memory goes to disk. Linux swapping is paging: pages go to disk.
https://www.linuxjournal.com/article/10678

**mblock()** : blocks pages of memory that can't be swapped.

Anonymous memory: private to a process; not tied to a file or device, like heap & stack.

"dirty" pages are modified ones (not yet written to disk). Flushed to disk:
    - 30 s
    - sync()
    - too many dirty pages dirty_ratio


"Fault" does not mean "error" but instead means "unavailable".
    - A minor fault means the page is in memory but not allocated to the requesting process or not marked as present in the memory management unit.
A minor page fault is one that does not require loading from disk.
    - A major fault means the page in no longer in memory.
ps maj_flt, min_flt

`pmap $pid` command: memory mappings of a process
`swapon` create, observe swap `-s` same as `cat /proc/swaps`

tunable sysctl:
  vm.dirty_*
  vm.overcommit_memory
    0: reasonable (default)
    1: always
    2: never
  vm.swappiness 60 (favor paging over page cache)


The **malloc**() function allocates size bytes and returns a pointer to the allocated memory._The memory is not initialized_.

The **free**() function frees the memory space pointed to by the pointer.

Normally, **malloc**() allocates memory from the heap, and adjusts the size of the heap as required, using brk(), sbrk()

Stack is linear (faster), for function calls and temp vars. Heap is larger, tree (slower), global vars.

**mmap** stands for memory-mapped files. It is a way to read and write files without invoking system calls.

https://medium.com/@sasha_f/why-mmap-is-faster-than-system-calls-24718e75ab37

echo 100000 > /proc/sys/kernel/pid_max

`free -m` -> `/proc/meminfo`

**Disks**

Not all IOPS are the same: sequential vs random access

RAID
    0 : striping (better performance, splitting I/O)
    1: mirroring 2x Ws
   10: perf similar to 1, good
    5: checksums etc, bad perf
    6: even worse than 5

iostat -x
iotop
sar -d
pidstat -d -p pid 1 # default -p ALL

**Networking**
NAPI: New API: grouping packets together for interrupts
IRQ balancer: distribute among CPUs

ss -s
netstat -s
netstat -i
ip -s link
ifconfig
lsof -i
sar -n DEV

tcpdump, wireshark: promiscuous, libpcap
strace


/etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE="eth0"
BOOTPROTO="dhcp"

**File system**
File system: boot block, superblock, I-node table, data blocks

Journaling file systems eliminate the need for lengthy file-system consistency checks after a system crash. A journaling file system logs (journals) all metadata updates to a special on-disk journal file before they are actually carried out.


Virtual Memory File System: tmpfs.
`mount -t tmpfs source target`
By default, a tmpfs file system is permitted to grow to half the size of RAM
A tmpfs file system mounted at /dev/shm is used for the glibc implementation of POSIX shared memory and semaphores.

Copy-On-Write COW filesystem: doesn't overwrite blocks but writes new one, update reference, add old blocks to free list.

`du -sh <dir>`

**Files**

- stat(), file metadata, mostly from inode table
- access() system call checks the accessibility of the file
- sticky bit, for directories, acts as the restricted deletion flag.  This makes it possible to create a directory that is shared by many users, who can each create and delete their own files in the directory but can’t delete files owned by other users. The sticky permission bit is commonly set on the /tmp directory for this reason. `chmod +t file`
- umask default 022 (----w--w-)

inotify_init() syscall

**Signals**

- A signal is a notification to a process that an event has occurred. Signals are sometimes described as software interrupts.
- A process could send a signal to another process or to itself (raise()) but source is tipically the kernel:
    - hardware exception (notified the kernel): dividing by zero for ex SIGFPE
    - user created
    - software event: input available on FD, timer, limit reached
 - Small integer defined in signal.h SIGxxxx
 - standard signals: 1-31
 - event generated -> pending -> delivered
 - delivered immediatly to running process or to when process is next scheduled to run. Can add a blocked to signal mask.
 - A signal handler is a function, written by the programmer, that performs appropriate tasks in response to the delivery of a signal
 - Instead of accepting the default for a particular signal, a program can change the action: "signal disposition"
 - two ways of changing the disposition of a signal: signal() and sigaction().
 - kill():  we can use the null signal to test if a process with a specific process ID exists
 - The set of pending signals is only a mask; it indicates whether or not a signal has occurred, but not how many times it has occurred. If the same signal is generated multiple times while it is blocked, then it is recorded in the set of pending signals, and later delivered, just once. One of the differences between standard and realtime signals is that realtime signals are queued.
 - Signal delivery is typically asynchronous, meaning that the point at which the signal interrupts execution of the process is unpredictable. In some cases (e.g., hardware-generated signals), signals are delivered synchronously, meaning that delivery occurs predictably and reproducibly at a certain point in the execution of a program.
 - By default, a signal either is ignored, terminates a process (with or without a core dump), stops a running process, or restarts a stopped process. The particular default action depends on the signal type. Alternatively, a program can use signal() or better, sigaction() to explicitly ignore a signal or to establish a programmer-defined signal handler function that is invoked when the signal is delivered.
 - A process (with suitable permissions) can send a signal to another process using kill(). Sending the null signal (0) is a way of determining if a particular process ID is in use.
 - Each process has a signal mask, which is the set of signals whose delivery is currently blocked.
 - If a signal is received while it is blocked, then it remains pending until it is unblocked. Standard signals can’t be queued; that is, a signal can be marked as pending (and thus later delivered) only once.
 - Using pause(), a process can suspend execution until a signal arrives.
 - Order of delivery of multiple unblocked signals:  the Linux kernel delivers the signals in ascending order. We can’t, however, rely on (standard) signals being delivered in any particular order, since SUSv3 says that the delivery order of multiple signals is implementation-defined.

there's about 30 defined signals, 64 total signal.h
default action is terminate
kill -l

SIGHUP  - 1 (restart/reload some apps)
SIGINT  - 2 Ctrl-C
SIGFPE  - 8
SIGKILL - 9 terminate, can't catch or block
SIGSEGV - 11 segmentation fault
SIGTERM - 15 default
SIGSTOP - SIGCONT - Ctrl-Z (can't block)


 The format string contained in the Linux-specific `/proc/sys/kernel/core_pattern` file controls the naming of all core dump files produced on the system.


# Processes

Process id PID is 2 bytes (65k values)

``#include <unistd.h>``

syscalls: `getpid (), getppid ()`

the `ps` utility has a pid,ppid option

fork() and exec() family
fork() returns child pid to parent or 0 to child

exec family:

- **P** (path): **execvp(), execlp()** : program name in current path (without p: need to give full path)
- **V** : **execv(), execvp(), and execve()** : argument list: array of pointers to strings (L: C-style args)
- **E** (env vars): **execve() and execle()** : aditional arg of array of env vars.

Because exec replaces the calling program with another one, it never returns unless an error occurs.

`execvp (program, arg_list);`

```
char* arg_list[] = {"ls", "-l", NULL}
```

# Signals

`<signal.h>`

Signal Disposition and handler.

masking = ignoring

- Linux Errors: SIGBUS, SIGSEV, SIGFPE (also cause program to terminate)
- Signal from a process to another: SIGKILL, SIGTERM
- Send a command to a running program: SIGUSR1, SIGUSR2.
    Also SIGHUP (wakeup/reload)

 `sigaction(signal, disp, previous disp)` to set signal dispostion, in the disp arg use `sa_handler`:

 - SIG_DFL : default disp
 - SIGIGN : ignore signal
 - pointer to signal-handler function.

Signals are asysnchronous. signal-handler function should be minimal, typically they just record the signal happened, main program checks periodically. (A signal can arrive while at signal-handler).

Even assigning a value to a global variable can be dangerous because the assignment may actually be carried out in two or more machine instructions, and a second signal may occur between them, leaving the variable in a corrupted state. If you use a global variable to flag a signal from a signal-handler function, it should be of the special type `sig_atomic_t`

Process can send itself with abort the SIGABRT signal, which ends process and core dumps.

`kill (pid, SIGTERM);` (permissions?)

Exit codes convention 0: success, non-zero error.
Two bytes. Between 0-127, exit codes above 128: terminated by a signal, 128 + signal number.

# Children

Parent & child can finish independently. For the parent to wait for the child to terminate and get its information: wait() 4 syscalls:

```
wait (&child_status);
if (WIFEXITED (child_status))
```

`waitpid` to wait for a particular child
`wait3`cpu info, `wait4`more options about what child to wait for

Zombie: child terminated without parent waiting, without cleanup (store child's termination status for ex)
`ps` The child process is marked as defunct, and its status code is Z, for zombie.
The init process automatically cleans up any zombie child processes that it inherits.

`wait`  blocks. To clean up children asynchronously: When a child process terminates (or stopped etc), Linux sends the parent process the `SIGCHLD` signal, so parent can handle that. (Default disposition of this signal is do nothing)

# Threads

A child process gets a copy of memory, FDs etc. A **thread** doesn't get any copy, everything is shared among all threads.

A process and all its threads are executing only one program. If a thread calls a exec(), all other threads end.

**pthreads** : Linux implementation of the the POSIX standard thread API. `pthread.h`
Thread ID. `pthread_t`gcc link library `-lpthread`

All threads in a single program share the same address space.

Each thread has its own call **stack**, however. This allows each thread to execute different code and to call and return from subroutines in the usual way. As in a single-threaded program, each invocation of a subroutine in each thread has its own set of local variables, which are stored on the stack for that thread.

Sometimes it is desirable to duplicate a certain variable so that each thread has a separate copy. GNU/Linux supports this by providing each thread with a `thread-specific data area`.

Use **fork** to create new processes or **pthread_create** to create threads.

The Linux **clone** system call is a generalized form of *fork* and *pthread_create* that allows the caller to specify which resources are shared between the calling process and the newly created process. Also, clone requires you to specify the memory region for the execution stack that the new process will use. It should not ordinarily be used in programs.

### Synchronization

**Synchronization**. Most threaded programs are concurrent programs. In particular, there’s no way to know when the system will schedule one thread to run and when it will run another. the system might even schedule multiple threads to run at literally the same time.

The ultimate cause of most bugs involving threads is that the threads are accessing the same data. That’s one of the powerful aspects of threads, but it can also be dangerous.
**Race conditions** bugs: the threads are racing one another to change the same data structure.

To eliminate race conditions, you need a way to make operations **atomic**. An atomic operation is indivisible and uninterruptible; once the operation starts, it will not be paused or interrupted until it completes, and no other operation will take place mean- while.

Linux provides **mutexes**, short for MUTual EXclusion locks. A mutex is a special lock that only one thread may lock at a time. If a thread locks a mutex and then a second thread also tries to lock the same mutex, the second thread is blocked, or put on hold. Only when the first thread unlocks the mutex is the second thread unblocked—allowed to resume execution.

Linux guarantees that race conditions do not occur among threads attempting to lock a mutex; only one thread will ever get the lock, and all other threads will be blocked.

Mutexes provide a mechanism for allowing one thread to block the execution of another. This opens up the possibility of a new class of bugs, called **deadlocks**. A deadlock occurs when one or more threads are stuck waiting for something that never will occur.

A simple type of deadlock may occur when the same thread attempts to lock a mutex twice in a row.The behavior in this case depends on what kind of mutex is being used.Three kinds of mutexes exist:

- Locking a **fast mutex** (the default kind) will cause a deadlock to occur.
- Locking a **recursive mutex** does not cause a deadlock. The mutex remembers how many times pthread_mutex_lock was called on it by the thread that holds the lock; that thread must make the same number of calls to pthread_mutex_unlock before the mutex is actually unlocked and another thread is allowed to lock it.
- Linux will detect and flag a double lock on an **error-checking mutex** that would otherwise cause a deadlock.The second consecutive call to pthread_mutex_lock returns the failure code `EDEADLK`.

Occasionally, it is useful to test whether a mutex is locked without actually blocking on it. `pthread_mutex_trylock` for this purpose.

A **semaphore** provides a method for blocking the threads while they wait for work to do.
A semaphore is a counter that can be used to synchronize multiple threads. As with a mutex, Linux guarantees that checking or modifying the value of a semaphore can be done safely, without creating a race condition.

Each semaphore has a counter value.
A semaphore supports two basic operations:

- A *wait* operation decrements the value of the semaphore by 1. If the value is already zero, the operation blocks until the value of the semaphore becomes positive
- A *post* operation increments the value of the semaphore by 1. If the semaphore was previously zero and other threads are blocked in a wait operation on that semaphore, one of those threads is unblocked and its wait operation completes (which brings the semaphore’s value back to zero).


**Deadlocks** can occur when two (or more) threads are each blocked, waiting for a condition to occur that only the other one can cause.
There's a more general deadlock problem, which can involve not only synchronization objects such as mutexes, but also other resources, such as locks on files or devices. The problem occurs when multiple threads try to lock the *same* set of resources in *different orders*. The solution is to make sure that all threads that lock more than one resource lock them in the same order.

**Signals**. In Linux, the behavior is dictated by the fact that threads are implemented as processes. Signals sent from outside the program are sent to the process corresponding to the main thread of the program. (this is not POSIX).
Within a multithreaded program, it is possible for one thread to send a signal specifically to another thread: *pthread_kill(thread_ID, signal_number)*

Threads should be used for programs that need fine-grained parallelism. For example, if a problem can be broken into multiple, nearly identical tasks, threads may be a good choice.

### IPC

Linux command: **ipcs**

**Shared memory** segments permit fast bidirectional communication among any number of processes. Each user can both read and write, but a program must establish and follow some protocol for preventing race conditions such as overwriting information before it is read. Unfortunately, Linux does not strictly guarantee exclusive access even if you create a new shared segment with IPC_PRIVATE.

Syscalls to manage: **shm**
  shmget() : allocate
  shmctl() : control
  shmat()  : attach
  shmdt()  : detach

Processes must coordinate access to shared memory. **Semaphores** are counters that permit synchronizing multiple threads. Linux provides a distinct alternate implementation of semaphores that can be used for synchronizing processes (called process semaphores or sometimes System V semaphores). Process semaphores are allocated, used, and deallocated like shared memory segments. Although a single semaphore is sufficient for almost all uses, process semaphores come in sets.


**Mapped memory** can be used for interprocess communication or as an easy way to access the contents of a file.

Mapped memory forms an association between a file and a process’s memory. Linux splits the file into page-sized chunks and then copies them into virtual memory pages so that they can be made available in a process’s address space. Thus, the process can read the file’s contents with ordinary memory access. It can also modify the file’s contents by writing to memory.This permits fast access to files.

**mmap**
mmap(
    address/NULL,
    byte length,
    protection PROT_READ |PROT_WRITE |PROT_EXEC,
    extra options flag MAP_FIXED | MAP_PRIVATE | MAP_SHARED,
    FD of file,
    offset from the beginning of the file from which to start the map)

**munmap()** release. Linux automatically unmaps the file when the program terminates.

Specify the MAP_SHARED flag so that any writes to these regions are immediately transferred to the underlying file and made visible to other processes.
If you don’t specify this flag, Linux may buffer writes before transferring them to the file.
Alternatively, you can force Linux to incorporate buffered writes into the disk file by calling **msync()**

The mmap call can be used for purposes other than interprocess communications. One common use is as a replacement for read and write. For example, rather than explic- itly reading a file’s contents into memory, a program might map the file into memory and scan it using memory reads. For some programs, this is more convenient and may also run faster than explicit file I/O operations.

**Pipes**
A pipe’s data capacity is limited. If the writer process writes faster than the reader process consumes the data, and if the pipe cannot store more data, the writer process blocks until more capacity becomes available. If the reader tries to read but no data is available, it blocks until data becomes available.Thus, the pipe automatically synchronizes the two processes.

A call to pipe creates file descriptors, which are valid only within that process and its children. A process’s file descriptors cannot be passed to unrelated processes; however, when the process calls fork, file descriptors are copied to the new child process. Thus, pipes can connect only related processes.

getconf PIPE_BUF path_to_file
4096

create a pipe:

```
int fds[2];
pipe (fds);
```

duplicate FDs:

```
#include <unistd.h>
int dup(int oldfd);
```

to redirect a process’s standard input to a file descriptor fd, use **dup2**:
`dup2 (fd, STDIN_FILENO);`
The symbolic constant STDIN_FILENO represents the file descriptor for the standard input, which has the value 0. The call closes standard input and then reopens it as a duplicate of fd so that the two may be used interchangeably.

A common use of pipes is to send data to or receive data from a program being run in a subprocess.. Easier with popen() / pclose():

`FILE* stream = popen (“sort”, “w”);`


**FIFOs / Named Pipes**

A first-in, first-out (FIFO) file is a pipe that has a name in the filesystem. Any process can open or close the FIFO; the processes on either end of the pipe need not be related to each other.

`mkfifo /tmp/fifo`
ls -l: first char is **p** (named pipe)

write `cat > /tmp/fifo` , read `cat < /tmp/fifo`

syscall mkfifo(path, PERMISSIONS)

Access a FIFO just like an ordinary file. To communicate through a FIFO, one program must open it for writing, and another program must open it for reading.

A FIFO can have multiple readers or multiple writers. Bytes from each writer are written atomically up to a maximum size of PIPE_BUF (4KB on Linux). Chunks from simultaneous writers can be interleaved. Similar rules apply to simultaneous reads.

**Sockets**
A socket is a bidirectional communication device that can be used to communicate with another process on the same machine or with a process running on other machines.

**socket(PF_UNIX or PF_INET, SOCK_STREAM or SOCK_DGRAM, 0 protocol)**
socket (namespace, type, protocol) /. closes() : create / close
connect() : client connects two sockets

bind() : server binds address (labels socket to address)
listen() : server accepts connections
accept(): accepts connections and creates new connection

Any technique to write to a file descriptor can be used to write to a socket. The send() function, which is specific to the socket file descriptors, provides an alternative to write with a few additional choices.

Because it resides in a file system, a local socket is listed as a file (**s** in ls -l)


**sparse files** lots of holes (empty space), so ls shows bigger space than df.
fallocate: create empty file faster than filling with zeros
dd if=/dev/zeros
truncate


/proc/loadavg : active , runnable processes

### Syscalls

hostname -> **uname**()

**access**(file_path, bitwise or of R_OK, W_OK, and X_OK or F_OK)
    returns: 0 good, -1 bad errno: EACCES, EROFS, ENOENT

**fcntl** (FD, operation): access point for several advanced operations on file descriptors.
Allows a program to place a read lock or a write lock on a file, somewhat analogous to the mutex locks.

More than one process may hold a read lock on the same file at the same time, but only one process may hold a write lock. File may not be both locked for R & W

Note that placing a lock does not actually prevent other processes from opening the file, reading from it, or writing to it, unless they acquire locks with fcntl as well.

**flock** also.The fcntl version has a major advantage: It works with files on NFS3 file systems

**fsync** (FD) flushes buffer to disk. Blocks until finishes. Updates file's modification time, 2 writes. **fdatasync** does not update mod time, so faster 1 write.
Also: **open(O_SYNC)** for synchronous I/O, which causes all writes to be committed to disk immediately.

Shell ulimit -> **getrlimit** ,  **setrlimit** superuser for hardlimits

**getrusage** system call retrieves process statistics from the kernel.

**gettimeofday**

**mlock**(start memory, length). Any pages in range will be locked. For performance. security reasons.
**mlockall** lock a program's entire memory space
**munlock** to unlock
Have to be superuser for mlock

If a program attempts to perform an operation on a memory location that is not allowed by these mmap permissions, it is terminated with a SIGSEGV (segmentation violation) signal.
With mmap or **mprotect** and a SIGSEGV handler to monitor memory access


**readlink** system call retrieves the target of a symbolic link

**sendfile** system call provides an efficient mechanism for copying data from one file descriptor to another (the intermediate buffer can be eliminated).

**setitimer** system call is a generalization of the alarm call. It schedules the delivery of a signal at some point in the future after a fixed amount of time has elapsed.

**sysinfo** system call fills a structure with system statistics: uptime, totalram, freeram, procs


—

A directory that has the sticky bit set allows you to delete a file only if you are the owner of the file.

whoami command is just like id, except that it shows only the effective user ID,
su program is a setuid program

—
Vilgrind, (gdb)

Finding Memory Leaks Using **mtrace**: The mtrace tool helps diagnose the most common error when using dynamic memory: failure to match allocations and deallocations.

The **ccmalloc library** diagnoses dynamic memory errors by replacing malloc and free with code tracing their use. Link in gcc.

**Electric Fence** halts executing programs on the exact line where a write or a read outside an allocation occurs.

—

The **stat** call obtains this information about a file. Call stat with the path to the file you’re interested in and a pointer to a variable of type struct stat. If the call to stat is successful, it returns 0 and fills in the fields of the structure with information about that file; otherwise, it returns -1.

—
pidof

1.  Introduction to the Linux Kernel

2.  2  Getting Started with the Kernel

3.  3  Process Management 23

4.  5  System Calls 69
    ...

5.  12  Memory Management 231

6.  13  The Virtual Filesystem 261

---

There are often times in a kernel when you do not want to do work at this moment. A good example of this is during interrupt processing. When the interrupt was asserted, the processor stopped what it was doing and the operating system delivered the interrupt to the appropriate device driver. Device drivers should not spend too much time handling interrupts as, during this time, nothing else in the system can run. There is often some work that could just as well be done later on. **Linux's bottom half handlers** were invented so that device drivers and other parts of the Linux kernel could queue work to be done later on.

The kernel's interrupt handling data structures are set up by the device drivers as they request control of the system's interrupts. To do this the device driver uses a set of Linux kernel services that are used to request an interrupt, enable it and to disable it.
The individual device drivers call these routines to register their interrupt handling routine addresses.

Some interrupts are fixed by convention for the PC architecture and so the driver simply requests its interrupt when it is initialized. This is what the floppy disk device driver does; it always requests IRQ 6. There may be occassions when a device driver does not know which interrupt the device will use. This is not a problem for PCI device drivers as they always know what their interrupt number is. Unfortunately there is no easy way for ISA device drivers to find their interrupt number. Linux solves this problem by allowing device drivers to probe for their interrupts.

First, the device driver does something to the device that causes it to interrupt. Then all of the unassigned interrupts in the system are enabled. This means that the device's pending interrupt will now be delivered via the programmable interrupt controller. Linux reads the interrupt status register and returns its contents to the device driver. A non-zero result means that one or more interrupts occured during the probe. The driver now turns probing off and the unassigned interrupts are all disabled.

----

- read() and write() can be used with files, sockets, pipes. They block.

- select(), poll(), epoll() don't block; once called, they will return a list of file descriptors that are ready.

- epoll() doesn't work with files.

- For storage I/O classically the blocking problem has been solved with thread pools.


- database software may not want to use the Linux page cache. It then became possible to open a file and specify that we want direct access to the device. Direct access, commonly referred to as Direct I/O, or the O_DIRECT flag.

- With Linux 2.6, the kernel gained an Asynchronous I/O (linux-aio for short) interface. Asynchronous I/O in Linux is simple at the surface: you can submit I/O with the io_submit system call, and at a later time you can call io_getevents and receive back events that are ready. It allows programmers to make their code fully asynchronous. But due to the way it evolved, it fell short of these expectations.

- io_uring:  truly asynchronous, works with any kind of I/O.

Alexei (eBPF) and Jens (io_uring, block) work at Facebook

----

5 Process States:  (state field in struct **task_struct**)

 D uninterruptible sleep (usually IO), can't kill with signal TASK\_UNINTERRUPTIBLE
 R running or runnable (on run queue) TASK\_RUNNING
 S interruptible sleep (waiting for an event/signal to complete) TASK\_INTERRUPTIBLE
 T, t stopped, either by a job control signal or because it is being traced. \__TASK\_STOPPED  , \__TASK\_TRACED

**System calls** and **exception handlers** are well-defined interfaces into the kernel.A process can begin executing in kernel-space only through one of these interfaces—all access to the kernel is through these interfaces.

Most exceptions issued by the CPU are interpreted by Linux as **1) error conditions**. If, for instance, a process performs a division by zero, the CPU raises a “Divide error " exception, and the corresponding exception handler sends a SIGFPE signal to the current process.

There are a couple of cases, however, where Linux exploits CPU exceptions to **2) manage hardware resources** more efficiently, like **Page Fault** exception; whenever the kernel tries to access an address that is currently not accessible, the CPU generates a page fault exception and calls the page fault handler.

The first, fork(), creates a child process that is a copy of
the current task. It differs from the parent only in its PID (which is unique), its PPID (parent’s PID, which is set to the original process), and certain statistics andresources, such as pending signals, which are not inherited.The second function,exec(), loads a new executable into the address space and begins executing it.

In Linux, fork() is implemented through the use of copy-on-write pages. **Copy-on-write** (or COW) is a technique to delay or altogether prevent copying of the data. Rather than duplicate the process address space, the parent and the child can share a single copy. For example, if exec() is called immediately after fork()—they never need to be copied.

Linux implements fork() via the **clone()** system call.

The **vfork()** system call has the same effect as fork(), except that the page table entries
of the parent process are not copied. Instead, the child executes as the sole thread in the
parent’s address space, and the parent is blocked until the child either calls exec() or exits.

**Threads** provide multiple threads of execution within the same program in a *shared memory address space*. They can also share open files and other resources. Threads enable concurrent programming and, on multiple processor systems, true parallelism. Linux has a unique implementation of threads.To the Linux kernel, there is no concept of a thread. Linux implements all threads as standard processes.

Threads are created the same as normal tasks, with the exception that the clone() system call is passed **flags** corresponding to the specific resources to be shared:
`clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);`

It is often useful for the kernel to perform some operations in the background.The kernel accomplishes this via **kernel threads** — standard processes that exist solely in kernel- space.The significant difference between kernel threads and normal processes is that kernel threads do not have an address space.

All new kernel threads are forked off the **kthreadd** kernel process.

When a process **terminates**, the kernel releases the resources owned by the process and notifies the child’s parent of its demise. The bulk of the work is handled by do_exit(), defined in kernel/exit.c, which completes a number of chores. EXIT\_ZOMBIE enables the system to obtain information about a child process after it has terminated. After the parent has obtained information on its terminated child, or signified to the kernel that it does not care, the child’s task_struct is deallocated.


----

The kernel can preempt a task running in the kernel so long as it does not hold a lock.

----

The mechanism to signal the kernel is a software interrupt: Incur an exception, and the system will switch to kernel mode and execute the exception (system call) handler.

The defined software interrupt on x86 is interrupt number **128**. The system call number must be passed into the kernel. On x86, the syscall number is fed to the kernel via the **eax** register.

Recently, x86 processors added a feature known as **sysenter**. This feature provides a faster, more specialized way of trapping into a kernel to execute a system call than using the int interrupt instruction.


Interrupts enable hardware to asynchronously signal to the processor.

The **device driver** that manages a given piece of hardware registers an interrupt handler.

These two goals—that an interrupt handler execute quickly and perform a large amount of work—clearly conflict with one another. Because of these competing goals, the processing of interrupts is split into two parts, or halves. The interrupt handler is the **top half**. The top half is run immediately upon receipt of the interrupt and performs only the work that is time-critical, such as acknowledging receipt of the interrupt or resetting the hardware. Work that can be performed later is deferred until the **bottom half**.

Botton half uses three mechanisms: **softirqs**, **tasklets**, and **work queues**.

**/proc/interrupts** : statistics related to interrupts on the system.

----

The kernel performs all page I/O through the page cache.

Writes are maintained in the page cache through a process called write-back caching, which keeps pages “dirty” in memory and defers writing the data back to disk.The flusher “gang” of kernel threads handles this eventual page writeback.

Linux implements a modified version of LRU, called the **two-list strategy**. Instead of maintaining one list, the LRU list, Linux keeps two lists: the **active list** and the **inactive list**. Pages on the active list are considered “hot” and are not available for eviction. Pages on the inactive list are available for cache eviction. Pages are placed on the active list only when they are accessed while already residing on the inactive list. Both lists are maintained in a pseudo-LRU manner: Items are added to the tail and removed from the head, as with a queue. The lists are kept in balance: If the active list becomes much larger than the inactive list, items from the active list’s head are moved back to the inactive list, making them available for eviction.The two-list strategy solves the only-used-once failure in a classic LRU and also enables simpler, pseudo-LRU semantics to perform well.This two-list approach is also known as LRU/2; it can be generalized to n-lists, called LRU/n.

----

Magic SysRq key (system request) key: Alt+PrintScreen (PC).
`echo 1 > /proc/sys/kernel/sysrq`

Ex: SysRq-i Sends a SIGKILL to all processes except init.

----

Anycast: multiple servers, 1 IP address

unshare command: namespaces

lsb_release: distro specific info

nproc: number of CPUs

List processes that are consuming most CPU: ps aux| awk '{print $3, $2, $11}' | head -n 15