My Lazy Admin: May 2018

Thursday, May 31, 2018

Disks & Raid Groups

How RAID parity works?

----------------------

Parity data is used by some RAID levels to achieve redundancy. If a drive in the

array fails, remaining data on the other drives can be combined with the parity

data (using the Boolean XOR function) to reconstruct the missing data. For

example, suppose two drives in a three-drive RAID 5 array contained the

following data:

Drive 1: 01101101

Drive 2: 11010100

To calculate parity data for the two drives, an XOR is performed on their data:

01101101

XOR 11010100

------------

10111001

The resulting parity data, 10111001, is then stored on Drive 3.

Should any of the three drives fail, the contents of the failed drive can be

reconstructed on a replacement drive by subjecting the data from the remaining

drives to the same XOR operation. If Drive 2 were to fail, its data could be

rebuilt using the XOR results of the contents of the two remaining drives,

Drive 1 and Drive 3:

Drive 1: 01101101

Drive 3: 10111001

as follows:

10111001

XOR 01101101

------------

11010100

The result of that XOR calculation yields Drive 2's contents. 11010100 is then

stored on Drive 2, fully repairing the array. This same XOR concept applies

similarly to larger arrays, using any number of disks. In the case of a RAID 3

array of 12 drives, 11 drives participate in the XOR calculation shown above and

yield a value that is then stored on the dedicated parity drive.

what is data striping?

----------------------

In computer data storage, data striping is the technique of segmenting logically

sequential data, such as a file, so that consecutive segments are stored on

different physical storage devices.

Striping is useful when a processing device requests data more quickly than a

single storage device can provide it. By spreading segments across multiple

devices which can be accessed concurrently, total data throughput is increased.

It is also a useful method for balancing I/O load across an array of disks.

Striping is used across disk drives in redundant array of independent disks

(RAID) storage, network interface controllers, different computers inclustered

file systems and grid-oriented storage, and RAM in some systems.

RAID Disk types

---------------

Data disk

- holds data stored on behalf of clients within RAID groups

- and any data generated about the state of the storage system

as a result of a malfunction

Spare disk

- does not hold usable data, but is available to be added to a

RAID group in an aggregate. Any functioning disk that is not

assigned to an aggregate but is assigned to a system

- functions as a hot spare disk

Parity disk

- stores row parity information that is used for data reconstruction

when a single disk drive fails within the RAID group

dParity disk

- stores diagonal parity information that is used for data

reconstruction when two disk drives fail within the RAID group,

if RAID-DP is enabled.

how are RAID groups formed?

--------------------------

- they are automatically formed when you add disks to an aggregate

- Data ONTAP adds new drives to the most recently created RAID group until

it reaches its maximum size (default behavior)

What size of RAID group should I choose?

----------------------------------------

- recommended range of RAID group size is between 12 and 20

Data ONTAP Storage units

------------------------

1. disks

- physical device that you put into the shelves

- e.g, SATA,BSAS,or SAS

2. raid groups

- a collection of one or more disks providing a RAID level

- if its a RAID-DP, there is atleast one data disk and 2 parity

- disks (parity disk + dparity disk)

2. plexes

- a collection of one or more raid groups

3. aggregates

- consists of disks within the raid groups

- collection of one or two plexes

- if unmirrored, it contains a single plex

- if mirrored, it contains 2 plexes

3. volumes

- 2 types: traditional and FlexVol

- tradional -> it inherit properties its containing aggregate (directly tied

to aggr)

- FlexVol -> loosely coupled to its containing aggregate; you can alter

properties on the fly (RECOMMENDED!!!)

4. qtrees

- subdirectory of the root directory of a volume

- you can use qtrees to subdivide a volume in order to group LUNs

5. LUNS

- logical unit of storage under a volume

- you can create LUNs in the root of a volume (traditional or flexible) or in

the root of a qtree

- NOTE: don't create LUNs under Data ONTAP's root volume (/vol/vol0)

Disk types based on speed

-------------------------

SAS - Serial Attached SCSI (faster)

BSAS - Bridged SAS / SATA drives (slower); this

means you have SATA disk in a SAS enclosure (it is a DS4243 shelf)

How disks are named?

--------------------

{slot}{port}.{shelfID}.{bay}

example: 3c.10.21

How are disk firmwares updated?

-------------------------------

- when you assign it to storage system

- you can do it (ask Netapp technical support what is your target disk firmware)

Commands

--------

Displaying

# lists all disk on the cluster

storage show disk

storage show disk -T --> adds TYPE of disk column at the end (SAS,BSAS,etc.)

# shows info for a particular disk

disk show 1b.29

# display unowned disks

disk show -n

# shows busy disk

stats show disk:*:disk_busy

example output: disk:39CF5F4E:715905F5:E3AEEB55:DCADA33F:0[...]000:disk_busy:100%

# this command shows more of the hardware side of the disk

filer01> storage show disk 4b.01.13

Disk: 3a.01.13

Shelf: 1

Bay: 13

Serial: 6SL3LM8B0000N2367MFW

Vendor: NETAPP

Model: X412_S15K7560A15

Rev: NA00

RPM: 15000

WWN: 5:000:c5004b:e8877c

UID: 5000C500:4BE8877F:00000000:00000000:00[...]

Downrev: yes

Pri Port: A

Sec Name: 4b.01.13

Sec Port: B

Power-on Hours: N/A

Blocks read: 0

Blocks written: 0

Time interval: 00:00:00

Glist count: 0

Scrub last done: 00:00:00

Scrub count: 0

LIP count: 0

Dynamically qualified: No

Current owner: 4294967295

Home owner: 4294967295

Reservation owner: 0

filer01>

# to see firmware version of netapp disk (it is the column with NAxx)

sysconfig -a

storage show disk

# how to see raid size of an aggregate?

aggr status -v

* raid size = data disks + parity disks

* under Options, example: raidsize=16

Assigning

# unowning a disk

disk assign 0c.51 -s unowned -f

# assigning a spare disk to another system

# assigning multiple disks to local node

disk assign disk_1 disk_2 … disk_N

Tutorials

---------

Manual Update of Disk Firmware

1. remove all files under /etc/disk_fw (make a backup just to make sure)

2. download the target disk firmwares (here are examples)

http://mysupport.netapp.com/NOW/cgi-bin/diskfwmustread.cgi/download/tools/diskfw/bin/X308_HMARK03TSSM.NA04.LOD

http://mysupport.netapp.com/NOW/cgi-bin/diskfwmustread.cgi/download/tools/diskfw/bin/X412_S15K7560A15.NA06.LOD

3. within 2 minutes it should start updating by its own

4. if disk firmwares are upgraded, verify by issuing this

command: storage show disk -x

Debugging/Troubleshooting

-------------------------

data is copied to a spare disk

Sun Sep 6 04:13:56 EDT [filer01:raid.rg.diskcopy.start:notice]: /aggr1/plex0/rg2: starting disk copy from 3a.01.11 to 4b.02.0

Sun Sep 6 04:14:03 EDT [filer01:raid.disk.predictiveFailure:warning]: Disk /aggr1/plex0/rg2/3a.01.11 Shelf 1 Bay 11 [NETAPP X412_S15K7560A15 NA06] S/N [6SL3KS740000N23610JD] reported a predictive failure and it is prefailed; it will be copied to a spare and failed

...

Sun Sep 6 06:31:28 EDT [filer01:raid.rg.diskcopy.done:notice]: /aggr1/plex0/rg2: disk copy from 3a.01.11 to 4b.02.0 completed in 2:17:32.30

Tuesday, May 29, 2018

Introduction to Python

What is Python?

===============

- an interpreted language (a software installed on your computer reads your

python code)

* although Python is an interpreted language, it uses sometimes compiled

code (*.pyc) to speed up execution

- not necessarily a compiled language (converts human-readable code into

byte-code and relay it directly to hardware)

- programming language used for automation

> search and replace

> for small database

> specialized GUI application

> simple game

> etc..

- not well suited for GUI applications or games

- cross-platform: can be used in Windows or *NIX systems

Features of Python

==================

- offers much more error checking than C

- has high-level data types

> flexible arrays

> dictionaries

- appicable on more areas than awk or perl

- allows to split programs into modules

> file I/O

> system calls

> sockets

> GUI toolkits such as Tk

- interpreted language: no complition and linking needed

- interpreter can be used interactively (just enter `python` in cli)

- programs written in Python are typically shorter than C/C++/Java counterparts

> high-level data types can be expressed in a single statement

> grouping is down by indentation instead of beginning and ending brackets

> no variable or argument declaration necessary

- extensible (modules can be added)

Getting Help

============

- python contains documentations and manual pages

- ways on getting useful information:

a. open up web location to browse HTMl docs: # pydoc -p 9999

b. prints functions/names define in a module: `dir(mod_name)` or dir

(mod1.mod2)

c. open manual pages: `help(mod_name)`

d. interactive help session: `help()`

Monday, May 28, 2018

DataDomain IPMI

Introduction

============

- a host system is the one controlling your target system

- the target system must have an IPMI port enabled and IPMI credentials

(username + password)

- to login to target system, you must login first to host system

- tasks you can do to target system is power off, power on, and reboot

- NOTE: it is not advisable to use IPMI to power down a target system because it

is not a graceful method

- example use of IPMI is when a system is unresponsive and you need to reboot it

remotely

IPMI/BMC Ports in General

=========================

- Some cable used by IPMI ports are also the physical cables used by LAN

interfaces

-> IPMI port on DD630 uses cable of eth0a

-> IPMI port on DD640 and DD670 uses a dedicated onboard port at the back

(right side of it are 4 USB ports)

- Some DDs have dedicated cables for IPMI ports

-> Some DD4200's IPMI depend on the MGT interface its using

- IPMI having BMC FW will only work when connected to 10/100 MB ports

- The port speed on switch side must be able to negotiate 10/100 Mb. IPMI/BMC

will not work if it can only negotiate 1Gb.

- BMC port must have an IP address

- BMC port IP must be pingable from outside the DD

- There are some cases that the BMC port IP is pingable outside but not inside

the DD

- If you see unknown status of link in "ipmi show hardware" command, it doesn't

mean that IPMI is not accessible

- To see if it is accessible, ping the IP of IPMI port and login to that IP

BMC (Baseboard Management Controller)

=====================================

The baseboard management controller (BMC) is a specialized microcontroller

embedded on the motherboard of the Data Domain system.

Sensors built into the Data Domain system report to the BMC on parameters such

as temperature, cooling fan speeds, power status, etc.

The BMC monitors the sensors and can trigger alerts if any of the parameters do

not stay within preset limits.

BMC Ports per DD Model

======================

DD630 - bmc-eth0 (can't be disabled via "ipmi disable bmc-eth0" command;

uses cable of eth0a)

DD640 - bmc0a

DD670 - bmc0a

DD4200 - bmc0a

Commands

========

Editing	ipmi config \ ipaddress \ netmask \ gateway ipmi config dhcp
Controlling	ipmi remote power [on\|off\|cycle\|status] \ ipmi-target user ipmi remote power [on\|off\|cycle\|status] \ ipmi-target user [password ]
Displaying	ipmi user list ipmi remote console ipmi-target user
Adding	ipmi user add ## can be used for complex passwords ipmi user add

Sunday, May 27, 2018

Redis Cluster on Centos 7

1. Install redis on all nodes

yum install -y epel-release

yum install -y redis

2. Configure master

vi /etc/redis.conf # update the following line: bind 127.0.0.1

firewall-cmd --add-port=6379/tcp

3. Configure slaves

vi /etc/redis.conf # update the following line: slaveof 6379

4. Restart and enable redis on all nodes

systemctl enable --now redis

5. Login to master and create a key to validate

127.0.0.1:6379> info replication

127.0.0.1:6379> redis-cli

127.0.0.1:6379> set 'a' 1

6. verify on the nodes that the replication is working

127.0.0.1:6379> redis-cli

127.0.0.1:6379> get 'a' # you must get correct value set from the master

127.0.0.1:6379> info replication

Saturday, May 26, 2018

Cisco Device Management

Backing up and Restoring IOS

----------------------------

Basic parts:

Flash

- stores IOS (around 29 MB)

- slow

- is copied to RAM when device boots

RAM

- holds running config and IOS

- fast but volatile

NVRAM

- available on some devices to hold running config

- fast and non-volatile

Where to put back ups?

HTTP - web server

FTP - old way; requires username and password

TFTP - uses UDP; don't require username and password

Configuration Registers

-----------------------

Determines how device boots up

2100 (rommon)	device will boot here if it can't detect an IOS
2101 (rxboot)	limited version of IOS
2012	normal boot
2142	ignore/bypass NVRAM

How Cisco devices boot

----------------------

1. checks config register

2. checks for `boot system` command in startup config

3. looks for first IOS image in flash

4. broadcast for a tftp server

Commands

--------

key command	copy ?
general syntax of `copy`	copy FROM TO
copies flash to TFTP server	copy flash: tftp://192.168.1.100/
same as above but uses prompt	copy flash tftp
prints flash information	show flash
restoring	copy TFTP flash
copies running config	copy running-config DESTINATION
displays configuration register	show version

Some notes:

- don't restore running config because it will just merge it to the present

- e.g: copy tftp running-config

- proper way is to restore running config from offsite box to startup config

- e.g: copy tftp start-up config

Tutorials

---------

recovering enable password

1. Boot into rommon

-> connect to device via console

-> power off device via physical switch

-> power on device

-> hit CTRL+BREAK

2. Change config register to bypass NVRAM

rommon 1 > confreg 0x2142

rommon 2 > reset

3. Copy the startup config to the running config and change

enable password

Router> enable

Router# copy startup-config running-config

Router# conf t

Router(conf)# enable secret newpassword

Router(conf)# exit

Router# copy run start

Friday, May 25, 2018

SSL Termination in Nginx

Overview

--------

- Nginx can act as SSL endpoint/termination

- once client request is received via encrypted channel (SSL), connection is

closed and requests is passed to the backend server via unencrypted channel

- can be performed on HTTP and TCP connections

client - encrypted (SSL) -> Nginx proxy server

|--- unencrypted --> backend server

Requiremements

--------------

* Nginx Plus R6 or later

* A load-balanced upstream group with several TCP servers

* SSL certificates and a private key (obtained or self-generated)

Configuration

-------------

Configuration is similar to SSL setup in the previous discussion but with the

addition of `proxy_pass` directive

Standard settings

server {

listen 443 ssl;

proxy_pass backend;

server_name www.example.com;

# public key (shared to others)

ssl_certificate www.example.com.crt;

# private key (must be kept private

ssl_certificate_key www.example.com.key;

ssl_protocols TLSv1 TLSv1.1 TLSv1.2;

ssl_ciphers HIGH:!aNULL:!MD5;

...

}

Speeding up TCP connections

---------------------------

- SSL handshake is series of messages between client and server to verify that

the connection is trusted

- default SSL handshake timeout is 60 seconds

- you can change it via `ssl_handshake_timeout`

- must not be set too low (results in handshake failure) or too high (long time

wait for handshake to complete)

manually specifying SSL handshake

timeout

server {

…

ssl_handshake_timeout 10s;

}

Thursday, May 24, 2018

Introduction to Cybersecurity

History of Hacking

------------------

Timeline:

<1970 -="" computers="" early="" p="" radios="">

1970 - mainframes of campuses became targets

1980 - PCs were invented

1990 - internet

2000 - bluetooth, tablets, smartphones ..

>2000 - international law for computer crimes was established

"Making things easier for hackers is the fact that early network

technologies such as the Internet were never designed with security

as a goal. The goal was the sharing of information."

Famous hacks through the years:

1988 - 1st internet worm was created by Robert T. Morris, Jr.

1994 - Kevin Lee Pulsen took over telephone lines of Kiss-FM to win a Porsche

1999 - David L. Smith created "Melissa" virus w/c email itself to entries

in user's address book

2001 - Jan de Wit created "Anna Kournikova" virus w/c reads all entries of

a user's outlook address book

2002 - Gary McKinnon connected to deleted critical US military files

2004 - Adam Botbyl (together w/ 2 other friends) stole credit card information

from Lowe's hardware chain

2005 - Cameron Lacroix hacked into Paris Hilton's phone

2009 - Kristina Vladimirovna (good looking russian hacker) skimmed around

3 billion US $ on US banks

mid 2000s - "Stuxnet" virus attacked uranium production

- "anonymous" group attacked local government networks

Generic examples of Cyber crimes

--------------------------------

1. stealing usernames and passwords

2. network intrusions

3. social engineering (involves human interaction)

4. posting/transmitting of illegal material

5. fraud

6. software piracy

7. dumpster diving (reconstruction of broken data)

8. malicious code (viruses)

9. unauthorized destruction of data

10. embezzlement (form of financial fraud)

11. data-diddling (modification of information to cover up activities)

12. Denial-of-service (overloads a system resource)

13. ransomware (encrypts files on target system to get money)

Devices and Systems that adds security

--------------------------------------

Software:

- VPNs (Virtual Private Networks)

- IPs (Intrusion Prevention Systems)

- firewalls

- ACLs (Access Control Lists)

- biometrics

- smartcards

Physical security:

- cable locks

- device locks

- alarm systems

Malicious Attacks

-----------------

- Denial-of-service (DoS)

- manipulation of stock prices

- identity theft

- vandalism

- credit card theft

- piracy

- theft of service

Known hacker groups

-------------------

Anonymous

https://en.wikipedia.org/wiki/Anonymous_(group)

LulzSec

https://en.wikipedia.org/wiki/LulzSec

Wednesday, May 23, 2018

NBU Duplication and SLP

Things to know about Netbackup duplication

------------------------------------------

- you can duplicate a backup image from cmd or GUI

- by default, restore is being done from the primary copy

- duplication job doesn't show "KB per second" in JAVA console

- from experience, a 35 GB backup took 2 hours and a 32 KB backup took 16

minutes to duplicate to a DR facility (destination system is a DatDomain w/

un-aggregated links)

- To duplicate data generally takes longer than to back up data

- Duplication also consumes twice the bandwidth from storage devices than

backups consume because a duplication job must read from one storage device

and write to another storage device

- Duplication taxes the NetBackup resource broker (nbrb) twice as much as

backups

- If nbrb is overtaxed, it can slow the rate at which all types of new jobs are

able to acquire resources and begin to move data

How duplication jobs are triggered?

-----------------------------------

NetBackup starts a duplication session every five minutes to copy data from a

backup destination to a duplication destination. If a duplication job fails, the

next three duplication sessions retry the job if necessary. If the job fails all

three times, the job is retried every 24 hours until it succeeds.

Duplication occurs as soon as possible after the backup completes.

Concepts about backup service levels

------------------------------------

- service level is based on recovery capability

- Recovery point objective (RPO) is The most recent backup

- Recovery time objective (RTO) is the time required to recover the backup

- RTO of a given backup becomes less critical as the backup ages

- Backup data is at its most valuable immediately after the backup has been made

- Platinum service level = RPO and RTO of 1 or 2 hours --> mission critical

applications such as order processing systems and transaction processing

systems

- Gold service level = RPO and RTO of 12 hours or less --> non-critical

applications such as e-mail, CRM, and HR systems

- Silver service level = RPO and RTO of 1 or 2 days --> non-critical

applications such as user file and print data, relatively static data

- high cost storage devices are disk, ssds, etc

- low cost storage devices are tapes, virtual tape libraries, etc

Things to know about Netbackup Storage Lifecycle Policy (SLP)

-------------------------------------------------------------

- It is introduced in NBU 6.5

- a Storage Lifecycle Policy is a plan or map of where backup data will be

stored and for how long

- it automates duplication process and determines how long the backup data will

reside in each location that it is duplicated to

- when a storage plan changes (e.g., if a new regulation is imposed on your

business requiring changes to retention periods or the number of copies

created), you simply need to change a small number of Storage Lifecycle

Policies, and all associated backups will take the changes into account

automatically

- after the original backup completes, the Storage Lifecycle Policy process

creates copies of the image, retrying as necessary until all required copies

are successfully created

- in practice it is likely that a Backup Policy may have two or three Storage

Lifecycle Policies covering different types of backup (e.g., daily

incremental, weekly full, and monthly full)

- a backup policy may have one or more SLPs (e.g one for Daily Incr schedule and

another one for Weekly Full)

- SLP scheduling is builtin on NBU 7.6

SLP Operations

--------------

- duplication jobs will start as soon as the backup completes (backup then

duplication)

- by default, SLP checks every 5 minutes for backup images that have recently

completed and require duplication jobs

- SLP groups batches of similar images together for each duplication job, to

optimize the performance of duplication (when there is enough data, 8 GB by

default, to warrant a duplication job, duplication is started)

-> as an example, see "first_duplication_batch_job.jpg"

- default settings of 5 minutes and 8 GB can be varied by setting values in the

/usr/openv/netbackup/db/config/LIFECYCLE_PARAMETERS

- if a duplication job fails to make a copy of an image, that image will be

added to a subsequent batch of images to be duplicated with the next

five-minute sweep of images that need to be copied (this is done 3 times for

a single image)

- after three failures, the SLP will wait two hours (by default) before trying

to create that copy of that image again (this retry will continue once every

two hours (by default) until either the user intervenes or the time of the

longest retention specified for the image comes to pass)

- duplicate copies will not be deleted until if atleast one copy failed to

duplicate

- In practice, I notice that SLP starts 30 minutes after a daily incremental

finishes (both triggered and scheduled backup)

-> reason of this is because we don't have a

/usr/openv/netbackup/db/config/LIFECYCLE_PARAMETERS file in our master

server

-> so SLP is using the default values for

MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB which is 30 minutes

Considerations in setting up Storage Lifecycle Policy (SLP)

-----------------------------------------------------------

1.) It is important to remember that this is not a hierarchical model; it is

duplicated at the first possible opportunity and occupies all the storage

locations simultaneously.

2.) In most cases the primary (first) Backup Storage Destination will be a

high-speed storage device that allows fast restores.

3.) It is not possible to specify the use of the Media Server Encryption Option

on specific Storage Destinations within a Storage Lifecycle Policy.

4.) A storage destination within a Storage Lifecycle Policy may use either a

specific Storage Unit or a Storage Unit Group.

5.) It is important to remember this when defining Duplication Storage

Destinations, as poor design may lead to excessive network traffic and other

resource contention.

6.) The “Alternate Read Server” setting for a storage destination applies on the

source destination, not the target destination. This means that the only

Storage Destination on which the “Alternate Read Server” setting has any

effect is the first Backup Destination (as this is the source used for all

duplication).

Setup/Configuration

-------------------

The LIFECYCLE_PARAMETERS file:

/usr/openv/netbackup/db/config/LIFECYCLE_PARAMETERS

MIN_KB_SIZE_PER_DUPLICATION

This is the size of the minimum duplication batch (default 8 GB).

MAX_KB_SIZE_PER_DUPLICATION_JOB

This is the size of the maximum duplication batch (default 25 GB).

MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB

This represents the time interval between forcing duplication sessions for

small batches (default 30 minutes).

IMAGE_EXTENDED_RETRY_PERIOD_IN_HOURS

After duplication of an image fails three times, this is the time interval

between subsequent retries (default 2 hours).

DUPLICATION_SESSION_INTERVAL_MINUTES

This is how often the Storage Lifecycle Policy service (nbstserv) looks to see

if it is time to start a new duplication job(s) (default 5 minutes).

- if this file does not exist, the default values will be used

- not all parameters are required in the file, and there is no order dependency

in the file

- any parameters omitted from the file will use default values

The syntax of the LIFECYCLE_PARAMETERS file, using default values, is as

follows:

MIN_KB_SIZE_PER_DUPLICATION_JOB 8192

MAX_KB_SIZE_PER_DUPLICATION_JOB 25600

MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB 30

IMAGE_EXTENDED_RETRY_PERIOD_IN_HOURS 2

DUPLICATION_SESSION_INTERVAL_MINUTES 5