Thursday, May 31, 2018

Disks & Raid Groups


How RAID parity works?
----------------------

Parity data is used by some RAID levels to achieve redundancy. If a drive in the
array fails, remaining data on the other drives can be combined with the parity
data (using the Boolean XOR function) to reconstruct the missing data. For
example, suppose two drives in a three-drive RAID 5 array contained the
following data:

Drive 1: 01101101
Drive 2: 11010100

To calculate parity data for the two drives, an XOR is performed on their data:

    01101101
XOR 11010100
------------
    10111001

The resulting parity data, 10111001, is then stored on Drive 3.

Should any of the three drives fail, the contents of the failed drive can be
reconstructed on a replacement drive by subjecting the data from the remaining
drives to the same XOR operation. If Drive 2 were to fail, its data could be
rebuilt using the XOR results of the contents of the two remaining drives,
Drive 1 and Drive 3:

Drive 1: 01101101
Drive 3: 10111001

as follows:

    10111001
XOR 01101101
------------
    11010100

The result of that XOR calculation yields Drive 2's contents. 11010100 is then
stored on Drive 2, fully repairing the array. This same XOR concept applies
similarly to larger arrays, using any number of disks. In the case of a RAID 3
array of 12 drives, 11 drives participate in the XOR calculation shown above and
yield a value that is then stored on the dedicated parity drive.

what is data striping?
----------------------

In computer data storage, data striping is the technique of segmenting logically
sequential data, such as a file, so that consecutive segments are stored on
different physical storage devices.

Striping is useful when a processing device requests data more quickly than a
single storage device can provide it. By spreading segments across multiple
devices which can be accessed concurrently, total data throughput is increased.
It is also a useful method for balancing I/O load across an array of disks.
Striping is used across disk drives in redundant array of independent disks
(RAID) storage, network interface controllers, different computers inclustered
file systems and grid-oriented storage, and RAM in some systems.

RAID Disk types
---------------

Data disk
  - holds data stored on behalf of clients within RAID groups
  - and any data generated about the state of the storage system
    as a result of a malfunction

Spare disk
  - does not hold usable data, but is available to be added to a
    RAID group in an aggregate. Any functioning disk that is not
    assigned to an aggregate but is assigned to a system
  - functions as a hot spare disk

Parity disk
  - stores row parity information that is used for data reconstruction
    when a single disk drive fails within the RAID group

dParity disk
  - stores diagonal parity information that is used for data
    reconstruction when two disk drives fail within the RAID group,
    if RAID-DP is enabled.

how are RAID groups formed?
--------------------------

- they are automatically formed when you add disks to an aggregate
- Data ONTAP adds new drives to the most recently created RAID group until
  it reaches its maximum size (default behavior)

What size of RAID group should I choose?
----------------------------------------

- recommended range of RAID group size is between 12 and 20

Data ONTAP Storage units
------------------------

1. disks
  - physical device that you put into the shelves
  - e.g, SATA,BSAS,or SAS

2. raid groups
  - a collection of one or more disks providing a RAID level
  - if its a RAID-DP, there is atleast one data disk and 2 parity
  - disks (parity disk + dparity disk)

2. plexes
  - a collection of one or more raid groups

3. aggregates
  - consists of disks within the raid groups
  - collection of one or two plexes
  - if unmirrored, it contains a single plex
  - if mirrored, it contains 2 plexes

3. volumes
  - 2 types: traditional and FlexVol
  - tradional -> it inherit properties its containing aggregate (directly tied
                 to aggr)
  - FlexVol -> loosely coupled to its containing aggregate; you can alter
               properties on the fly (RECOMMENDED!!!)

4. qtrees
  - subdirectory of the root directory of a volume
  - you can use qtrees to subdivide a volume in order to group LUNs

5. LUNS
  - logical unit of storage under a volume
  - you can create LUNs in the root of a volume (traditional or flexible) or in
    the root of a qtree
  - NOTE: don't create LUNs under Data ONTAP's root volume (/vol/vol0)

Disk types based on speed
-------------------------

SAS - Serial Attached SCSI (faster)
BSAS - Bridged SAS / SATA drives (slower); this
       means you have SATA disk in a SAS enclosure (it is a DS4243 shelf)

How disks are named?
--------------------

{slot}{port}.{shelfID}.{bay}

example: 3c.10.21

How are disk firmwares updated?
-------------------------------

- when you assign it to storage system
- you can do it (ask Netapp technical support what is your target disk firmware)

Commands
--------

Displaying
# lists all disk on the cluster
storage show disk
storage show disk -T --> adds TYPE of disk column at the end (SAS,BSAS,etc.)

# shows info for a particular disk
disk show 1b.29

# display unowned disks
disk show -n

# shows busy disk
stats show disk:*:disk_busy
example output: disk:39CF5F4E:715905F5:E3AEEB55:DCADA33F:0[...]000:disk_busy:100%

# this command shows more of the hardware side of the disk
filer01> storage show disk 4b.01.13
Disk: 3a.01.13
Shelf: 1
Bay: 13
Serial: 6SL3LM8B0000N2367MFW
Vendor: NETAPP
Model: X412_S15K7560A15
Rev: NA00
RPM: 15000
WWN: 5:000:c5004b:e8877c
UID: 5000C500:4BE8877F:00000000:00000000:00[...]
Downrev: yes
Pri Port: A
Sec Name: 4b.01.13
Sec Port: B
Power-on Hours: N/A
Blocks read:      0
Blocks written:   0
Time interval: 00:00:00
Glist count: 0
Scrub last done: 00:00:00
Scrub count: 0
LIP count: 0
Dynamically qualified: No
Current owner: 4294967295
Home owner: 4294967295
Reservation owner: 0
filer01>
       
# to see firmware version of netapp disk (it is the column with NAxx)
sysconfig -a
storage show disk

# how to see raid size of an aggregate?
aggr status -v
  * raid size = data disks + parity disks
  * under Options, example: raidsize=16

Assigning
# unowning a disk
disk assign 0c.51 -s unowned -f

# assigning a spare disk to another system
c
# assigning multiple disks to local node
disk assign disk_1 disk_2 … disk_N

Tutorials
---------

Manual Update of Disk Firmware
1. remove all files under /etc/disk_fw (make a backup just to make sure)
2. download the target disk firmwares (here are examples)
3. within 2 minutes it should start updating by its own
4. if disk firmwares are upgraded, verify by issuing this
   command: storage show disk -x

Debugging/Troubleshooting
-------------------------

data is copied to a spare disk
Sun Sep  6 04:13:56 EDT [filer01:raid.rg.diskcopy.start:notice]: /aggr1/plex0/rg2: starting disk copy from 3a.01.11 to 4b.02.0
Sun Sep  6 04:14:03 EDT [filer01:raid.disk.predictiveFailure:warning]: Disk /aggr1/plex0/rg2/3a.01.11 Shelf 1 Bay 11 [NETAPP   X412_S15K7560A15 NA06] S/N [6SL3KS740000N23610JD] reported a predictive failure and it is prefailed; it will be copied to a spare and failed
...
Sun Sep  6 06:31:28 EDT [filer01:raid.rg.diskcopy.done:notice]: /aggr1/plex0/rg2: disk copy from 3a.01.11 to 4b.02.0 completed in 2:17:32.30

Tuesday, May 29, 2018

Introduction to Python


What is Python?
===============

- an interpreted language (a software installed on your computer reads your
  python code)
    * although Python is an interpreted language, it uses sometimes compiled
      code (*.pyc) to speed up execution
- not necessarily a compiled language (converts human-readable code into
  byte-code and relay it directly to hardware)
- programming language used for automation
    > search and replace
    > for small database
    > specialized GUI application
    > simple game
    > etc..
- not well suited for GUI applications or games
- cross-platform: can be used in Windows or *NIX systems

Features of Python
==================

- offers much more error checking than C
- has high-level data types
    > flexible arrays
    > dictionaries
- appicable on more areas than awk or perl
- allows to split programs into modules
    > file I/O
    > system calls
    > sockets
    > GUI toolkits such as Tk
- interpreted language: no complition and linking needed
- interpreter can be used interactively (just enter `python` in cli)
- programs written in Python are typically shorter than C/C++/Java counterparts
    > high-level data types can be expressed in a single statement
    > grouping is down by indentation instead of beginning and ending brackets
    > no variable or argument declaration necessary
- extensible (modules can be added)

Getting Help
============

- python contains documentations and manual pages
- ways on getting useful information:
    a. open up web location to browse HTMl docs: # pydoc -p 9999
    b. prints functions/names define in a module: `dir(mod_name)` or dir
       (mod1.mod2)
    c. open manual pages: `help(mod_name)`
    d. interactive help session: `help()`

Monday, May 28, 2018

DataDomain IPMI


Introduction
============

- a host system is the one controlling your target system
- the target system must have an IPMI port enabled and IPMI credentials
  (username + password)
- to login to target system, you must login first to host system
- tasks you can do to target system is power off, power on, and reboot
- NOTE: it is not advisable to use IPMI to power down a target system because it
  is not a graceful method
- example use of IPMI is when a system is unresponsive and you need to reboot it
  remotely

IPMI/BMC Ports in General
=========================

- Some cable used by IPMI ports are also the physical cables used by LAN
  interfaces
  -> IPMI port on DD630 uses cable of eth0a
  -> IPMI port on DD640 and DD670 uses a dedicated onboard port at the back
     (right side of it are 4 USB ports)
- Some DDs have dedicated cables for IPMI ports
  -> Some DD4200's IPMI depend on the MGT interface its using
- IPMI having BMC FW will only work when connected to 10/100 MB ports
- The port speed on switch side must be able to negotiate 10/100 Mb. IPMI/BMC
  will not work if it can only negotiate 1Gb.
- BMC port must have an IP address
- BMC port IP must be pingable from outside the DD
- There are some cases that the BMC port IP is pingable outside but not inside
  the DD
- If you see unknown status of link in "ipmi show hardware" command, it doesn't
  mean that IPMI is not accessible
- To see if it is accessible, ping the IP of IPMI port and login to that IP

BMC (Baseboard Management Controller)
=====================================

The baseboard management controller (BMC) is a specialized microcontroller
embedded on the motherboard of the Data Domain system.

Sensors built into the Data Domain system report to the BMC on parameters such
as temperature, cooling fan speeds, power status, etc.

The BMC monitors the sensors and can trigger alerts if any of the parameters do
not stay within preset limits.

BMC Ports per DD Model
======================

DD630 - bmc-eth0 (can't be disabled via "ipmi disable bmc-eth0" command;
                 uses cable of eth0a)
DD640 - bmc0a
DD670 - bmc0a
DD4200 - bmc0a

Commands
========

Editing
ipmi config \
    ipaddress \
    netmask \
    gateway
ipmi config dhcp
Controlling
ipmi remote power [on|off|cycle|status] \
   ipmi-target user
ipmi remote power [on|off|cycle|status] \
   ipmi-target user [password ]
Displaying
ipmi user list
ipmi remote console ipmi-target user
Adding
ipmi user add ## can be used for complex passwords
ipmi user add


Sunday, May 27, 2018

Redis Cluster on Centos 7


1. Install redis on all nodes
yum install -y epel-release
yum install -y redis

2. Configure master
vi /etc/redis.conf  # update the following line: bind 127.0.0.1
firewall-cmd --add-port=6379/tcp

3. Configure slaves
vi /etc/redis.conf  # update the following line: slaveof 6379

4. Restart and enable redis on all nodes
systemctl enable --now redis

5. Login to master and create a key to validate
127.0.0.1:6379> info replication
127.0.0.1:6379> redis-cli
127.0.0.1:6379> set 'a' 1

6. verify on the nodes that the replication is working
127.0.0.1:6379> redis-cli
127.0.0.1:6379> get 'a'  # you must get correct value set from the master
127.0.0.1:6379> info replication

Saturday, May 26, 2018

Cisco Device Management


Backing up and Restoring IOS
----------------------------

Basic parts:

  Flash
    - stores IOS (around 29 MB)
    - slow
    - is copied to RAM when device boots

  RAM
    - holds running config and IOS
    - fast but volatile
 
  NVRAM
    - available on some devices to hold running config
    - fast and non-volatile

Where to put back ups?

  HTTP - web server
  FTP - old way; requires username and password
  TFTP - uses UDP; don't require username and password

Configuration Registers
-----------------------

Determines how device boots up

2100 (rommon)
device will boot here if it can't detect an IOS
2101 (rxboot)
limited version of IOS
2012
normal boot
2142
ignore/bypass NVRAM

How Cisco devices boot
----------------------

1. checks config register
2. checks for `boot system` command in startup config
3. looks for first IOS image in flash
4. broadcast for a tftp server

Commands
--------

key command
copy ?
general syntax of `copy`
copy FROM TO
copies flash to TFTP server
copy flash: tftp://192.168.1.100/
same as above but uses prompt
copy flash tftp
prints flash information
show flash
restoring
copy TFTP flash
copies running config
copy running-config DESTINATION
displays configuration register
show version

Some notes:

  - don't restore running config because it will just merge it to the present
  - e.g: copy tftp running-config
  - proper way is to restore running config from offsite box to startup config
  - e.g: copy tftp start-up config

Tutorials
---------

recovering enable password
1. Boot into rommon
  -> connect to device via console
  -> power off device via physical switch
  -> power on device
  -> hit CTRL+BREAK

2. Change config register to bypass NVRAM
rommon 1 > confreg 0x2142
rommon 2 > reset

3. Copy the startup config to the running config and change
   enable password
Router> enable
Router# copy startup-config running-config
Router# conf t
Router(conf)# enable secret newpassword
Router(conf)# exit
Router# copy run start

Friday, May 25, 2018

SSL Termination in Nginx

Overview
--------

- Nginx can act as SSL endpoint/termination
- once client request is received via encrypted channel (SSL), connection is
  closed and requests is passed to the backend server via unencrypted channel
- can be performed on HTTP and TCP connections


client - encrypted (SSL) -> Nginx proxy server
                                             |
                                             |
                                             |--- unencrypted --> backend server

Requiremements
--------------

* Nginx Plus R6 or later
* A load-balanced upstream group with several TCP servers
* SSL certificates and a private key (obtained or self-generated)


Configuration
-------------

Configuration is similar to SSL setup in the previous discussion but with the
addition of `proxy_pass` directive

Standard settings
server {
    listen              443 ssl;
    proxy_pass          backend;
    server_name         www.example.com;

    # public key (shared to others)
    ssl_certificate     www.example.com.crt;

    # private key (must be kept private
    ssl_certificate_key www.example.com.key;

    ssl_protocols       TLSv1 TLSv1.1 TLSv1.2;
    ssl_ciphers         HIGH:!aNULL:!MD5;
    ...
}


Speeding up TCP connections
---------------------------

- SSL handshake is series of messages between client and server to verify that
  the connection is trusted
- default SSL handshake timeout is 60 seconds
- you can change it via `ssl_handshake_timeout`
- must not be set too low (results in handshake failure) or too high (long time
  wait for handshake to complete)

manually specifying SSL handshake
timeout
server {
   
    ssl_handshake_timeout 10s;
}

Thursday, May 24, 2018

Introduction to Cybersecurity


History of Hacking
------------------

Timeline:

<1970 -="" computers="" early="" p="" radios="">
1970  - mainframes of campuses became targets
1980  - PCs were invented
1990  - internet
2000  - bluetooth, tablets, smartphones ..
>2000 - international law for computer crimes was established

"Making things easier for hackers is the fact that early network
technologies such as the Internet were never designed with security
as a goal. The goal was the sharing of information."

Famous hacks through the years:

1988 - 1st internet worm was created by Robert T. Morris, Jr.
1994 - Kevin Lee Pulsen took over telephone lines of Kiss-FM to win a Porsche
1999 - David L. Smith created "Melissa" virus w/c email itself to entries
       in user's address book
2001 - Jan de Wit created "Anna Kournikova" virus w/c reads all entries of
       a user's outlook address book
2002 - Gary McKinnon connected to deleted critical US military files
2004 - Adam Botbyl (together w/ 2 other friends) stole credit card information
       from Lowe's hardware chain
2005 - Cameron Lacroix hacked into Paris Hilton's phone
2009 - Kristina Vladimirovna (good looking russian hacker) skimmed around
       3 billion US $ on US banks
mid 2000s - "Stuxnet" virus attacked uranium production
          - "anonymous" group attacked local government networks

Generic examples of Cyber crimes
--------------------------------

1.  stealing usernames and passwords
2.  network intrusions
3.  social engineering (involves human interaction)
4.  posting/transmitting of illegal material
5.  fraud
6.  software piracy
7.  dumpster diving (reconstruction of broken data)
8.  malicious code (viruses)
9.  unauthorized destruction of data
10. embezzlement (form of financial fraud)
11. data-diddling (modification of information to cover up activities)
12. Denial-of-service (overloads a system resource)
13. ransomware (encrypts files on target system to get money)

Devices and Systems that adds security
--------------------------------------

Software:

- VPNs (Virtual Private Networks)
- IPs (Intrusion Prevention Systems)
- firewalls
- ACLs (Access Control Lists)
- biometrics
- smartcards

Physical security:

- cable locks
- device locks
- alarm systems

Malicious Attacks
-----------------

- Denial-of-service (DoS)
- manipulation of stock prices
- identity theft
- vandalism
- credit card theft
- piracy
- theft of service

Known hacker groups
-------------------

Anonymous


























LulzSec

























Wednesday, May 23, 2018

NBU Duplication and SLP


Things to know about Netbackup duplication
------------------------------------------

- you can duplicate a backup image from cmd or GUI
- by default, restore is being done from the primary copy
- duplication job doesn't show "KB per second" in JAVA console
- from experience, a 35 GB backup took 2 hours and a 32 KB backup took 16
  minutes to duplicate to a DR facility (destination system is a DatDomain w/
  un-aggregated links)
- To duplicate data generally takes longer than to back up data
- Duplication also consumes twice the bandwidth from storage devices than
  backups consume because a duplication job must read from one storage device
  and write to another storage device
- Duplication taxes the NetBackup resource broker (nbrb) twice as much as
  backups
- If nbrb is overtaxed, it can slow the rate at which all types of new jobs are
  able to acquire resources and begin to move data

How duplication jobs are triggered?
-----------------------------------

NetBackup starts a duplication session every five minutes to copy data from a
backup destination to a duplication destination. If a duplication job fails, the
next three duplication sessions retry the job if necessary. If the job fails all
three times, the job is retried every 24 hours until it succeeds.
Duplication occurs as soon as possible after the backup completes.

Concepts about backup service levels
------------------------------------

- service level is based on recovery capability
- Recovery point objective (RPO) is The most recent backup
- Recovery time objective (RTO) is the time required to recover the backup
- RTO of a given backup becomes less critical as the backup ages
- Backup data is at its most valuable immediately after the backup has been made
- Platinum service level = RPO and RTO of 1 or 2 hours --> mission critical
  applications such as order processing systems and transaction processing
  systems
- Gold service level = RPO and RTO of 12 hours or less --> non-critical
  applications such as e-mail, CRM, and HR systems
- Silver service level = RPO and RTO of 1 or 2 days --> non-critical
  applications such as user file and print data, relatively static data
- high cost storage devices are disk, ssds, etc
- low cost storage devices are tapes, virtual tape libraries, etc

Things to know about Netbackup Storage Lifecycle Policy (SLP)
-------------------------------------------------------------

- It is introduced in NBU 6.5
- a Storage Lifecycle Policy is a plan or map of where backup data will be
  stored and for how long
- it automates duplication process and determines how long the backup data will
  reside in each location that it is duplicated to
- when a storage plan changes (e.g., if a new regulation is imposed on your
  business requiring changes to retention periods or the number of copies
  created), you simply need to change a small number of Storage Lifecycle
  Policies, and all associated backups will take the changes into account
  automatically
- after the original backup completes, the Storage Lifecycle Policy process
  creates copies of the image, retrying as necessary until all required copies
  are successfully created
- in practice it is likely that a Backup Policy may have two or three Storage
  Lifecycle Policies covering different types of backup (e.g., daily
  incremental, weekly full, and monthly full)
- a backup policy may have one or more SLPs (e.g one for Daily Incr schedule and
  another one for Weekly Full)
- SLP scheduling is builtin on NBU 7.6

SLP Operations
--------------

- duplication jobs will start as soon as the backup completes (backup then
  duplication)
- by default, SLP checks every 5 minutes for backup images that have recently
  completed and require duplication jobs
- SLP groups batches of similar images together for each duplication job, to
  optimize the performance of duplication (when there is enough data, 8 GB by
  default, to warrant a duplication job, duplication is started)
  -> as an example, see "first_duplication_batch_job.jpg"
- default settings of 5 minutes and 8 GB can be varied by setting values in the
  /usr/openv/netbackup/db/config/LIFECYCLE_PARAMETERS
- if a duplication job fails to make a copy of an image, that image will be
  added to a subsequent batch of images to be duplicated with the next
  five-minute sweep of images that need to be copied (this is done 3 times for
  a single image)
- after three failures, the SLP will wait two hours (by default) before trying
  to create that copy of that image again (this retry will continue once every
  two hours (by default) until either the user intervenes or the time of the
  longest retention specified for the image comes to pass)
- duplicate copies will not be deleted until if atleast one copy failed to
  duplicate
- In practice, I notice that SLP starts 30 minutes after a daily incremental
  finishes (both triggered and scheduled backup)
  -> reason of this is because we don't have a
     /usr/openv/netbackup/db/config/LIFECYCLE_PARAMETERS file in our master
     server
  -> so SLP is using the default values for
     MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB which is 30 minutes

Considerations in setting up Storage Lifecycle Policy (SLP)
-----------------------------------------------------------

1.) It is important to remember that this is not a hierarchical model; it is
    duplicated at the first possible opportunity and occupies all the storage
    locations simultaneously.
2.) In most cases the primary (first) Backup Storage Destination will be a
    high-speed storage device that allows fast restores.
3.) It is not possible to specify the use of the Media Server Encryption Option
    on specific Storage Destinations within a Storage Lifecycle Policy.
4.) A storage destination within a Storage Lifecycle Policy may use either a
    specific Storage Unit or a Storage Unit Group.
5.) It is important to remember this when defining Duplication Storage
    Destinations, as poor design may lead to excessive network traffic and other
    resource contention.
6.) The “Alternate Read Server” setting for a storage destination applies on the
    source destination, not the target destination. This means that the only
    Storage Destination on which the “Alternate Read Server” setting has any
    effect is the first Backup Destination (as this is the source used for all
    duplication).

Setup/Configuration
-------------------

The LIFECYCLE_PARAMETERS file:
/usr/openv/netbackup/db/config/LIFECYCLE_PARAMETERS

MIN_KB_SIZE_PER_DUPLICATION
This is the size of the minimum duplication batch (default 8 GB).

MAX_KB_SIZE_PER_DUPLICATION_JOB
This is the size of the maximum duplication batch (default 25 GB).

MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB
This represents the time interval between forcing duplication sessions for
small batches (default 30 minutes).

IMAGE_EXTENDED_RETRY_PERIOD_IN_HOURS
After duplication of an image fails three times, this is the time interval
between subsequent retries (default 2 hours).

DUPLICATION_SESSION_INTERVAL_MINUTES
This is how often the Storage Lifecycle Policy service (nbstserv) looks to see
if it is time to start a new duplication job(s) (default 5 minutes).

- if this file does not exist, the default values will be used
- not all parameters are required in the file, and there is no order dependency
  in the file
- any parameters omitted from the file will use default values

The syntax of the LIFECYCLE_PARAMETERS file, using default values, is as
follows:
MIN_KB_SIZE_PER_DUPLICATION_JOB 8192
MAX_KB_SIZE_PER_DUPLICATION_JOB 25600
MAX_MINUTES_TIL_FORCE_SMALL_DUPLICATION_JOB 30
IMAGE_EXTENDED_RETRY_PERIOD_IN_HOURS 2
DUPLICATION_SESSION_INTERVAL_MINUTES 5