Changelog for
slurm-16.05.8-30.4.x86_64.rpm :
Fri Jan 20 13:00:00 2017 robert.warmbierAATTwits.ac.za
- Trying to fix issues with the spec file, which doesn\'t allow stuff
depending on ther perlapi to be installed.
Tue Jan 10 13:00:00 2017 robert.warmbierAATTwits.ac.za
- Update to version 16.05.8 (see network:cluster project for list of changes)
- Pulled hwloc 2.0 patches from network:cluster
Thu Jul 21 14:00:00 2016 robert.warmbierAATTwits.ac.za
- Update to version 15.08.12
* Changes in Slurm 15.08.12
- - Do not attempt to power down a node which has never responded if the
slurmctld daemon restarts without state.
- - Fix for possible slurmstepd segfault on invalid user ID.
- - MySQL - Fix for possible race condition when archiving multiple clusters
at the same time.
- - Fix compile for when you don\'t have hwloc.
- - Fix issue where daemons would only listen on specific address given in
slurm.conf instead of all. If looking for specific addresses use
TopologyParam options No
*InAddrAny.
- - Cray - Better robustness when dealing with the aeld interface.
- - job_submit.lua - add array_inx value for job arrays.
- - Perlapi - Remove unneeded/undefined mutex.
- - Fix issue when TopologyParam=NoInAddrAny is set the responses wouldn\'t
make it to the slurmctld when using message aggregation.
- - MySQL - Fix potential memory leak when rolling up data.
- - Fix issue with clustername file when running on NFS with root_squash.
- - Fix race condition with respects to cleaning up the profiling threads
when in use.
- - Fix issues when building on NetBSD.
- - Fix jobcomp/elasticsearch build when libcurl is installed in a
non-standard location.
- - Fix MemSpecLimit to explicitly require TaskPlugin=task/cgroup and
ConstrainRAMSpace set in cgroup.conf.
- - MYSQL - Fix order of operations issue where if the database is locked up
and the slurmctld doesn\'t wait long enough for the response it would give
up leaving the connection open and create a situation where the next message
sent could receive the response of the first one.
- - Fix CFULL_BLOCK distribution type.
- - Prevent sbatch from trying to enable debug messages when using job arrays.
- - Prevent sbcast from enabling \"--preserve\" when specifying a jobid.
- - Prevent wrong error message from spank plugin stack on GLOB_NOSPACE error.
- - Fix proctrack/lua plugin to prevent possible deadlock.
- - Prevent infinite loop in slurmstepd if execve fails.
- - Prevent multiple responses to REQUEST_UPDATE_JOB_STEP message.
- - Prevent possible deadlock in acct_gather_filesystem/lustre on error.
- - Make it so --mail-type=NONE didn\'t throw an invalid error.
- - If no default account is given for a user when creating (only a list of
accounts) no default account is printed, previously NULL was printed.
- - Fix for tracking a node\'s allocated CPUs with gang scheduling.
- - Fix Hidden error during _rpc_forward_data call.
- - Fix bug resulting from wrong order-of-operations in _connect_srun_cr(),
and two others that cause incorrect debug messages.
- - Fix backwards compatibility with sreport going to <= 14.11 coming from
>= 15.08 for some reports.
* Changes in Slurm 15.08.11
- - Fix for job \"--contiguous\" option that could cause job allocation/launch
failure or slurmctld crash.
- - Fix to setup logs for single-character program names correctly.
- - Backfill scheduling performance enhancement with large number of running
jobs.
- - Reset job\'s prolog_running counter on slurmctld restart or reconfigure.
- - burst_buffer/cray - Update job\'s prolog_running counter if pre_run fails.
- - MYSQL - Make the error message more specific when removing a reservation
and it doesn\'t meet basic requirements.
- - burst_buffer/cray - Fix for script creating or deleting persistent buffer
would fail \"paths\" operation and hold the job.
- - power/cray - Prevent possible divide by zero.
- - power/cray - Fix bug introduced in 15.08.10 preventing operation in many
cases.
- - Prevent deadlock for flow of data to the slurmdbd when sending reservation
that wasn\'t set up correctly.
- - burst_buffer/cray - Don\'t call Datawarp \"paths\" function if script includes
only create or destroy of persistent burst buffer. Some versions of Datawarp
software return an error for such scripts, causing the job to be held.
- - Fix potential issue when adding and removing TRES which could result
in the slurmdbd segfaulting.
- - Add cast to memory limit calculation to prevent integer overflow for
very large memory values.
- - Bluegene - Fix issue with reservations resizing under the covers on a
restart of the slurmctld.
- - Avoid error message of \"Requested cpu_bind option requires entire node to
be allocated; disabling affinity\" being generated in some cases where
task/affinity and task/cgroup plugins used together.
- - Fix version issue when packing GRES information between 2 different versions
of Slurm.
- - Fix for srun hanging with OpenMPI and PMIx
- - Better initialization of node_ptr when dealing with protocol_version.
- - Fix incorrect type when initializing header of a message.
- - MYSQL - Fix incorrect usage of limit and union.
- - MYSQL - Remove \'ignore\' from alter ignore when updating a table.
- - Documentation - update prolog_epilog page to reflect current behavior
if the Prolog fails.
- - Documentation - clarify behavior of \'srun --export=NONE\' in man page.
- - Fix potential gres underflow on restart of slurmctld.
- - Fix sacctmgr to remove a user who has no associations.
Mon Apr 25 14:00:00 2016 robert.warmbierAATTwits.ac.za
- Update to version 15.08.10
- Clean-up of spec file
* Changes in Slurm 15.08.10
- - Fix issue where if a slurmdbd rollup lasted longer than 1 hour the
rollup would effectively never run again.
- - Make error message in the pmi2 code to debug as the issue can be expected
and retries are done making the error message a little misleading.
- - Power/cray: Don\'t specify NID list to Cray APIs. If any of those nodes are
not in a ready state, the API returned an error for ALL nodes rather than
valid data for nodes in ready state.
- - Fix potential divide by zero when tree_width=1.
- - checkpoint/blcr plugin: Fix memory leak.
- - If using PrologFlags=contain: Don\'t launch the extern step if a job is
cancelled while launching.
- - Remove duplicates from AccountingStorageTRES
- - Fix backfill scheduler race condition that could cause invalid pointer in
select/cons_res plugin. Bug introduced in 15.08.9.
- - Avoid double calculation on partition QOS if the job is using the same QOS.
- - Do not change a job\'s time limit when updating unrelated field in a job.
* Changes in Slurm 15.08.9
- - BurstBuffer/cray - Defer job cancellation or time limit while \"pre-run\"
operation in progress to avoid inconsistent state due to multiple calls
to job termination functions.
- - Fix issue with resizing jobs and limits not be kept track of correctly.
- - BGQ - Remove redeclaration of job_read_lock.
- - BGQ - Tighter locks around structures when nodes/cables change state.
- - Make it possible to change CPUsPerTask with scontrol.
- - Make it so scontrol update part qos= will take away a partition QOS from
a partition.
- - Fix issue where SocketsPerBoard didn\'t translate to Sockets when CPUS=
was also given.
- - Add note to slurm.conf man page about setting \"--cpu_bind=no\" as part
of SallocDefaultCommand if a TaskPlugin is in use.
- - Set correct reason when a QOS\' MaxTresMins is violated.
- - Insure that a job is completely launched before trying to suspend it.
- - Remove historical presentations and design notes. Only distribute
maintained doc/html and doc/man directories.
- - Remove duplicate xmalloc() in task/cgroup plugin.
- - Backfill scheduler to validate correct job partition for job submitted to
multiple partitions.
- - Force close on exec on first 256 file descriptors when launching a
slurmstepd to close potential open ones.
- - Step GRES value changed from type \"int\" to \"int64_t\" to support larger
values.
- - Fix getting reservations to database when database is down.
- - Fix issue with sbcast not doing a correct fanout.
- - Fix issue where steps weren\'t always getting the gres/tres involved.
- - Fixed double read lock on getting job\'s gres/tres.
- - Fix display for RoutePlugin parameter to display the correct value.
- - Fix route/topology plugin to prevent segfault in sbcast when in use.
- - Fix Cray slurmconfgen_smw.py script to use nid as nid, not nic.
- - Fix Cray NHC spawning on job requeue. Previous logic would leave nodes
allocated to a requeued job as non-usable on job termination.
- - burst_buffer/cray plugin: Prevent a requeued job from being restarted while
file stage-out is still in progress. Previous logic could restart the job
and not perform a new stage-in.
- - Fix job array formatting to allow return [0-100:2] display for arrays with
step functions rather than [0,2,4,6,8,...] .
- - FreeBSD - replace Linux-specific set_oom_adj to avoid errors in slurmd log.
- - Add option for TopologyParam=NoInAddrAnyCtld to make the slurmctld listen
on only one port like TopologyParam=NoInAddrAny does for everything else.
- - Fix burst buffer plugin to prevent corruption of the CPU TRES data when bb
is not set as an AccountingStorageTRES type.
- - Surpress error messages in acct_gather_energy/ipmi plugin after repeated
failures.
- - Change burst buffer use completion email message from
\"SLURM Job_id=1360353 Name=tmp Staged Out, StageOut time 00:01:47\" to
\"SLURM Job_id=1360353 Name=tmp StageOut/Teardown time 00:01:47\"
- - Generate burst buffer use completion email immediately afer teardown
completes rather than at job purge time (likely minutes later).
- - Fix issue when adding a new TRES to AccountingStorageTRES for the first
time.
- - Update gang scheduling tables when job manually suspended or resumed. Prior
logic could mess up job suspend/resume sequencing.
- - Update gang scheduling data structures when job changes in size.
- - Associations - prevent hash table corruption if uid initially unset for
a user, which can cause slurmctld to crash if that user is deleted.
- - Avoid possibly aborting srun on SIGSTOP while creating the job step due to
threading bug.
- - Fix deadlock issue with burst_buffer/cray when a newly created burst
buffer is found.
- - burst_buffer/cray: Set environment variables just before starting job rather
than at job submission time to reflect persistent buffers created or
modified while the job is pending.
- - Fix check of per-user qos limits on the initial run by a user.
- - Fix gang scheduling resource selection bug which could prevent multiple jobs
from being allocated the same resources. Bug was introduced in 15.08.6.
- - Don\'t print the Rgt value of an association from the cache as it isn\'t
kept up to date.
- - burst_buffer/cray - If the pre-run operation fails then don\'t issue
duplicate job cancel/requeue unless the job is still in run state. Prevents
jobs hung in COMPLETING state.
- - task/cgroup - Fix bug in task binding to CPUs.
* Changes in Slurm 15.08.8
- - Backfill scheduling properly synchronized with Cray Node Health Check.
Prior logic could result in highest priority job getting improperly
postponed.
- - Make it so daemons also support TopologyParam=NoInAddrAny.
- - If scancel is operating on large number of jobs and RPC responses from
slurmctld daemon are slow then introduce a delay in sending the cancel job
requests from scancel in order to reduce load on slurmctld.
- - Remove redundant logic when updating a job\'s task count.
- - MySQL - Fix querying jobs with reservations when the id\'s have rolled.
- - Perl - Fix use of uninitialized variable in slurm_job_step_get_pids.
- - Launch batch job requsting --reboot after the boot completes.
- - Move debug messages like \"not the right user\" from association manager
to debug3 when trying to find the correct association.
- - Fix incorrect logic when querying assoc_mgr information.
- - Move debug messages to debug3 notifying a gres_bit_alloc was NULL for
gres types without a file.
- - Sanity Check Patch to setup variables for RAPL if in a race for it.
- - GRES - Fix minor typecast issues.
- - burst_buffer/cray - Increase size of intermediate variable used to store
buffer byte size read from DW instance from 32 to 64-bits to avoid overflow
and reporting invalid buffer sizes.
- - Allow an existing reservation with running jobs to be modified without
Flags=IGNORE_JOBS.
- - srun - don\'t attempt to execve() a directory with a name matching the
requested command
- - Do not automatically relocate an advanced reservation for individual cores
that spans multiple nodes when nodes in that reservation go down (e.g.
a 1 core reservation on node \"tux1\" will be moved if node \"tux1\" goes
down, but a reservation containing 2 cores on node \"tux1\" and 3 cores on
\"tux2\" will not be moved node \"tux1\" goes down). Advanced reservations for
whole nodes will be moved by default for down nodes.
- - Avoid possible double free of memory (and likely abort) for slurmctld in
background mode.
- - contribs/cray/csm/slurmconfgen_smw.py - avoid including repurposed compute
nodes in configs.
- - Support AuthInfo in slurmdbd.conf that is different from the value in
slurm.conf.
- - Fix build on FreeBSD 10.
- - Fix hdf5 build on ppc64 by using correct fprintf formatting for types.
- - Fix cosmetic printing of NO_VALs in scontrol show assoc_mgr.
- - Fix perl api for newer perl versions.
- - Fix for jobs requesting cpus-per-task (eg. -c3) that exceed the number of
cpus on a core.
- - Remove unneeded perl files from the .spec file.
- - Flesh out filters for scontrol show assoc_mgr.
- - Add function to remove assoc_mgr_info_request_t members without freeing
structure.
- - Fix build on some non-glibc systems by updating includes.
- - Add new PowerParameters options of get_timeout and set_timeout. The default
set_timeout was increased from 5 seconds to 30 seconds. Also re-read current
power caps periodically or after any failed \"set\" operation.
- - Fix slurmdbd segfault when listing users with blank user condition.
- - Save the ClusterName to a file in SaveStateLocation, and use that to
verify the state directory belongs to the given cluster at startup to avoid
corruption from multiple clusters attempting to share a state directory.
- - MYSQL - Fix issue when rerolling monthly data to work off correct time
period. This would only hit you if you rerolled a 15.08 prior to this
commit.
- - If FastSchedule=0 is used make sure TRES are set up correctly in accounting.
- - Fix sreport\'s truncation of columns with large TRES and not using
a parsing option.
- - Make sure count of boards are restored when slurmctld has option -R.
- - When determine if a job can fit into a TRES time limit after resources
have been selected set the time limit appropriately if the job didn\'t
request one.
- - Fix inadequate locks when updating a partition\'s TRES.
- - Add new assoc_limit_continue flag to SchedulerParameters.
- - Avoid race in acct_gather_energy_cray if energy requested before available.
- - MYSQL - Avoid having multiple default accounts when a user is added to
a new account and making it a default all at once.
Mon Jan 25 13:00:00 2016 robert.warmbierAATTwits.ac.za
- Update to version 15.08.7
Fri Jan 8 13:00:00 2016 robert.warmbierAATTwits.ac.za
- version 15.08.6
Thu Dec 17 13:00:00 2015 robert.warmbierAATTwits.ac.za
- version 15.08.5
Mon Nov 23 13:00:00 2015 robert.warmbierAATTwits.ac.za
- Update to 15.08.4
* Many bug fixes. See NEWS file
Thu Jan 15 13:00:00 2015 robert.warmbierAATTgmx.de
- version 14.11.3
* Many bug fixes. See NEWS file
* native systemd support
Sun Nov 2 13:00:00 2014 scorotAATTfree.fr
- add missing systemd requirements
- add missing rclink
Sun Nov 2 13:00:00 2014 scorotAATTfree.fr
- version 14.03.9
* Many bug fixes. See NEWS file
- add systemd support
Sat Jul 26 14:00:00 2014 scorotAATTfree.fr
- version 14.03.6
* Added support for native Slurm operation on Cray systems
(without ALPS).
* Added partition configuration parameters AllowAccounts,
AllowQOS, DenyAccounts and DenyQOS to provide greater control
over use.
* Added the ability to perform load based scheduling. Allocating
resources to jobs on the nodes with the largest number if idle
CPUs.
* Added support for reserving cores on a compute node for system
services (core specialization)
* Add mechanism for job_submit plugin to generate error message
for srun, salloc or sbatch to stderr.
* Support for Postgres database has long since been out of date
and problematic, so it has been removed entirely. If you
would like to use it the code still exists in <= 2.6, but will
not be included in this and future versions of the code.
* Added new structures and support for both server and cluster
resources.
* Significant performance improvements, especially with respect
to job array support.
- update files list
Sun Mar 16 13:00:00 2014 scorotAATTfree.fr
- update to version 2.6.7
* Support for job arrays, which increases performance and ease of
use for sets of similar jobs.
* Job profiling capability added to record a wide variety of job
characteristics for each task on a user configurable periodic
basis. Data currently available includes CPU use, memory use,
energy use, Infiniband network use, Lustre file system use, etc.
* Support for MPICH2 using PMI2 communications interface with much
greater scalability.
* Prolog and epilog support for advanced reservations.
* Much faster throughput for job step execution with --exclusive
option. The srun process is notified when resources become
available rather than periodic polling.
* Support improved for Intel MIC (Many Integrated Core) processor.
* Advanced reservations with hostname and core counts now supports
asymmetric reservations (e.g. specific different core count for
each node).
* External sensor plugin infrastructure added to record power
consumption, temperature, etc.
* Improved performance for high-throughput computing.
* MapReduce+ support (launches ~1000x faster, runs ~10x faster).
* Added \"MaxCPUsPerNode\" partition configuration parameter. This
can be especially useful to schedule GPUs. For example a node
can be associated with two Slurm partitions (e.g. \"cpu\" and
\"gpu\") and the partition/queue \"cpu\" could be limited to only a
subset of the node\'s CPUs, insuring that one or more CPUs would
be available to jobs in the \"gpu\" partition/queue.
Thu Jun 6 14:00:00 2013 scorotAATTfree.fr
- version 2.5.7
* Fix for linking to the select/cray plugin to not give warning
about undefined variable.
* Add missing symbols to the xlator.h
* Avoid placing pending jobs in AdminHold state due to backfill
scheduler interactions with advanced reservation.
* Accounting - make average by task not cpu.
* POE - Correct logic to support poe option \"-euidevice sn_all\"
and \"-euidevice sn_single\".
* Accounting - Fix minor initialization error.
* POE - Correct logic to support srun network instances count
with POE.
* POE - With the srun --launch-cmd option, report proper task
count when the --cpus-per-task option is used without the
- -ntasks option.
* POE - Fix logic binding tasks to CPUs.
* sview - Fix race condition where new information could of
slipped past the node tab and we didn\'t notice.
* Accounting - Fix an invalid memory read when slurmctld sends
data about start job to slurmdbd.
* If a prolog or epilog failure occurs, drain the node rather
than setting it down and killing all of its jobs.
* Priority/multifactor - Avoid underflow in half-life calculation.
* POE - pack missing variable to allow fanout (more than 32
nodes)
* Prevent clearing reason field for pending jobs. This bug was
introduced in v2.5.5 (see \"Reject job at submit time ...\").
* BGQ - Fix issue with preemption on sub-block jobs where a job
would kill all preemptable jobs on the midplane instead of just
the ones it needed to.
* switch/nrt - Validate dynamic window allocation size.
* BGQ - When --geo is requested do not impose the default
conn_types.
* RebootNode logic - Defers (rather than forgets) reboot request
with job running on the node within a reservation.
* switch/nrt - Correct network_id use logic. Correct support for
user sn_all and sn_single options.
* sched/backfill - Modify logic to reduce overhead under heavy
load.
* Fix job step allocation with --exclusive and --hostlist option.
* Select/cons_res - Fix bug resulting in error of \"cons_res: sync
loop not progressing, holding job #\"
* checkpoint/blcr - Reset max_nodes from zero to NO_VAL on job
restart.
* launch/poe - Fix for hostlist file support with repeated host
names.
* priority/multifactor2 - Prevent possible divide by zero.
- - srun - Don\'t check for executable if --test-only flag is
used.
* energy - On a single node only use the last task for gathering
energy. Since we don\'t currently track energy usage per task
(only per step). Otherwise we get double the energy.
Sat Apr 6 14:00:00 2013 scorotAATTfree.fr
- version 2.5.4
* Support for IntelĀ® Many Integrated Core (MIC) processors.
* User control over CPU frequency of each job step.
* Recording power usage information for each job.
* Advanced reservation of cores rather than whole nodes.
* Integration with IBM\'s Parallel Environment including POE (Parallel
Operating Environment) and NRT (Network Resource Table) API.
* Highly optimized throughput for serial jobs in a new
\"select/serial\" plugin.
* CPU load is information available
* Configurable number of CPUs available to jobs in each SLURM
partition, which provides a mechanism to reserve CPUs for use
with GPUs.
Sat Nov 17 13:00:00 2012 scorotAATTfree.fr
- remore runlevel 4 from init script thanks to patch1
- fix self obsoletion of slurm-munge package
- use fdupes to remove duplicates
- spec file reformaing
Sat Nov 17 13:00:00 2012 scorotAATTfree.fr
- put perl macro in a better within install section
Sat Nov 17 13:00:00 2012 scorotAATTfree.fr
- enable numa on x86_64 arch only
Sat Nov 17 13:00:00 2012 scorotAATTfree.fr
- add numa and hwloc support
- fix rpath with patch0
Fri Nov 16 13:00:00 2012 scorotAATTfree.fr
- fix perl module files list
Mon Nov 5 13:00:00 2012 scorotAATTfree.fr
- use perl_process_packlist macro for the perl files cleanup
- fix some summaries length
- add cgoups directory and example the cgroup.release_common file
Sat Nov 3 13:00:00 2012 scorotAATTfree.fr
- spec file cleanup
Sat Nov 3 13:00:00 2012 scorotAATTfree.fr
- first package