NT1 staff, and local NDGF site sysadmin meeting. The meeting location is Bern, Universität Bern / UniS, Schanzeneckstr. 1, room A-124
There will be a dinner the evening before at Altes Tramdepot.
Bern is normally reached by flying to Zurich and taking a train. Trains from Geneva take longer, but also a possibility. Meeting location is right next to the Bern main station.
The past ½ year:
- OPN on redundant(?) 100Gb. (Should be redundant, but fail-over failed when needed)
- Added 0.5PB disk (tarpool).
- Added 2 tapepools (tarpool)
- Compute nodes moved to centos7
- CEs moved to Centos 7 and ARC6
- Facility people got tired of bothering us and moved on to chemistry
Future:
- Alice VOBOX (alice01) moves to centos 7
- Order more compute this month (10-15kHS expected from benchmark runs).
- Convert all pools to tarpools.
- IPv6 on compute.
- 2020: New tape library or upgrade the old.
NTR
Alice cluster will be upgraded to Centos 7 and the LRMS will switch to SLURM very soon - maybe this week.
- Tarpooled all disk pools.
- Cache machines upgraded to Xenial, the upgrade to bionic did not work.
- dCache tarpools - readied for the "reboot-required"
- Tape - TSM mishap: 2 tapes got lost:
- Issue found in ENDIT: Lost tapes caused filled logs.
- Compute - Nested singularity seems to work now.
- Uploads from Abisko not prefering storage at HPC2N, remember to tell Ops when changing IP address.
- Net - Half of disk pools using BBR.
- no obvious improvement
- expected to perform better worst-case on bad links.
Future:
- Before end of 2019: new tape pools
- Before end of 2020: new disk pools
NTR
- Working on new system for the last half a year.
- Alice jobs are now running, but still has performance issues.
- The Alice queue has 200 nodes with 8 nodes. There might be another 100 nodes available.
- Lots of validation errors on the jobs.
- Storage:
- dCache - 140 TB (evacuating is about 600 MB/S so around 64 hours - a week)
- Ganglia is almost ready
- There might be some left over money for more compute and dCache. Could mean that pools could be reduced in size to gain performance.
- Abel is being decommissioned.
- downtime since 15/9 - 2019
- Low prio kept running. Will probably run until the end of the year.
- Current downtime run until 30/10, should be extended.
- New tier 1 will be run on openstack.
- Test cluster since June: AMD epyc - 12.5 HEPSPEC pr core with HT
- 8 vCPU pr node
- 30 nodes from June
- Disk limited - running in pilot mode
- October: More disk available - installed as ARC cache
- 10 more nodes
- More server have arrived
- 12 servers of 2 cpus of 48(*2 threads)
- in total 2304 more vCPU
- ARC Datastaging
- There is trouble keeping the nodes fed with input. Looks like there might be a 1Gb link somewhere in the system. To fill a cluster it should be at least 10Gb
- Norwegian pledge has been saved by Oracle cloud
- Storage:
- New disk pools are on Ceph
- Erasure coded
- 2PB available space
- 300 MB/s speed, should ideally be 4 times more
- New pools are being commissioned
- Minimal alice:
- VOBOX is there. No jobs are run. Should this be closed?
LHCOPN 100 GB: The 100 Gb link to Oslo is possibly ready. There are no plans to upgrade the link to Bergen.
Almost all disk pools are tarpools.
Missing:
UCPH - disk and some tape (by end of year)
HPC2N - tape (Done with new pools by end of year)
IJS - old pools
Tape:
Is the responsibility line clear:
endit daemons run by local admins, endit plugins by NDGF ops
UiB missing a TSM person. Another option would be enstore. Cheaper, but UiB would be the only ones running it.
Automatic reboots:
Seems to work at HPC2N. Made sure to only reboot during office hours.
Ansible scripts should be cleaned up. Start stop issue should be investigated. (Who?)
Note: Psql transfer compression is useful, but not above level 2 as it will wait on CPU - Znikke
Drbd tuning: Hpc2n has reached 10 Gbs, but not in conjunction with Ganeti. We would like to reach 25 Gbs
Same disk on Dulkis has dropped out of raid a number of times. Ticket for Dell must be created. (Chrulle)
Plan:
Ganeti install - Petter
Psql - Chrulle
Power cap should be tested - Chrulle
Monitoring/ganglia - Petter/Jens (nagios test for different pw hashes)
Network failover test - Maswan
Syncauth - could be deprecated - Petter
Add migration to weekly page - Chrulle
Is 200000 request limit in FTS reasonable? We should be able to support this - Maswan.
During the test we saw files being removed before being transferred out. Looks to be the sweeper stepping in. Vincent will investigate. Possibly caused by the missing protection on files.
Monitoring on the tape pools? Stuff missing from pinboard.
Pools has a timeout(96 hours) for a restore, endit does not. We should make the timeout a bit longer.
Dashboard is missing information, to be able to debug issue with check summing at HPC2N.
The Tape Carrousel has high priority from ATLAS. ATLAS is running out of disk space, and would like to put more to tape. NDGF would like an actual official bandwidth requirement for tapes. This should be communicated to ATLAS.
Znikke suggets that we finish the Tape validation for all sites: https://wiki.neic.no/wiki/NDGF_dCache_tape_pool_validation
Niklas reports on current and future technology.
Niklas presents various hardware configurations, slides available. Timescale: In production before Christmas, should last 4 years. No firm conclusion, but mixed use configuration likely.
Maiken presents and demonstrates.
Maiken presents.
Vincent presents.